Prerequisites
- Python 3.10+ or Node 20+
- Basic familiarity with one LLM API
Step 1: Install tokenizers per model family
pip install tiktoken anthropic google-cloud-aiplatform
Each vendor ships a different tokenizer; you need one per family. OpenAI uses tiktoken, Anthropic exposes anthropic.tokenizer, Google has a count-tokens API.
Step 2: Write a unified counter
import tiktoken
from anthropic import Anthropic
def count_tokens(text: str, model: str) -> int:
if model.startswith("gpt") or model.startswith("o"):
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
if model.startswith("claude"):
return Anthropic().count_tokens(text)
# Fallback: rough approximation, ~4 chars/token
return len(text) // 4
Step 3: Load a pricing table
Keep this as JSON you can update monthly. Prices are USD per 1M tokens.
PRICING = {
"gpt-5": {"input": 2.50, "output": 10.00, "context": 400_000},
"gpt-4o": {"input": 2.50, "output": 10.00, "context": 128_000},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "context": 128_000},
"claude-4-opus": {"input": 15.00, "output": 75.00, "context": 1_000_000},
"claude-4-sonnet": {"input": 3.00, "output": 15.00, "context": 1_000_000},
"claude-4.5-haiku":{"input": 0.80, "output": 4.00, "context": 200_000},
"gemini-2.5-pro": {"input": 1.25, "output": 5.00, "context": 2_000_000},
"gemini-2.5-flash":{"input": 0.075, "output": 0.30, "context": 1_000_000},
}
Step 4: Estimator function
def estimate(prompt: str, expected_output_tokens: int, models: list[str]):
results = []
for m in models:
price = PRICING[m]
input_tokens = count_tokens(prompt, m)
if input_tokens + expected_output_tokens > price["context"]:
results.append({"model": m, "fits": False})
continue
cost = (input_tokens * price["input"] +
expected_output_tokens * price["output"]) / 1_000_000
results.append({
"model": m,
"fits": True,
"input_tokens": input_tokens,
"output_tokens": expected_output_tokens,
"cost_usd": round(cost, 6),
})
return sorted(results, key=lambda r: (not r["fits"], r.get("cost_usd", 1)))
Step 5: Use it as a router
def pick_model(prompt: str, expected_output: int, quality_tier: str = "mid") -> str:
candidates = {
"cheap": ["gemini-2.5-flash", "gpt-4o-mini", "claude-4.5-haiku"],
"mid": ["gpt-4o", "claude-4-sonnet", "gemini-2.5-pro"],
"top": ["gpt-5", "claude-4-opus"],
}[quality_tier]
estimates = estimate(prompt, expected_output, candidates)
for e in estimates:
if e["fits"]:
return e["model"]
raise RuntimeError("prompt too large for any candidate")
What this saves you
- Surprise bills from feeding a 100k-token doc to an Opus-class model when Haiku would have sufficed
- Runtime errors from sending prompts that exceed the context window
- The cost of running a production A/B across models -- now you can estimate before calling
Variations
- Add batch-API pricing columns (50% off for 24hr turnaround)
- Add cached-input pricing (Claude cache discounts for repeated context)
- Log every real call; at end of day, compare actual to estimate, tune your output-token estimates per task
