Development

Token Budget & Cost Estimator for Production LLM Apps

A pre-flight cost calculator that tells you -- before you call -- how many tokens a given prompt + expected response will burn, across multiple models, so your app can choose the cheapest model that fits the window.

●●○○○ • ~30 minutes • GPT-5, Claude 4.6, Gemini 2.5

Prerequisites

  • Python 3.10+ or Node 20+
  • Basic familiarity with one LLM API

Step 1: Install tokenizers per model family

pip install tiktoken anthropic google-cloud-aiplatform

Each vendor ships a different tokenizer; you need one per family. OpenAI uses tiktoken, Anthropic exposes anthropic.tokenizer, Google has a count-tokens API.

Step 2: Write a unified counter

import tiktoken
from anthropic import Anthropic

def count_tokens(text: str, model: str) -> int:
    if model.startswith("gpt") or model.startswith("o"):
        enc = tiktoken.encoding_for_model(model)
        return len(enc.encode(text))
    if model.startswith("claude"):
        return Anthropic().count_tokens(text)
    # Fallback: rough approximation, ~4 chars/token
    return len(text) // 4

Step 3: Load a pricing table

Keep this as JSON you can update monthly. Prices are USD per 1M tokens.

PRICING = {
  "gpt-5":           {"input": 2.50, "output": 10.00, "context": 400_000},
  "gpt-4o":          {"input": 2.50, "output": 10.00, "context": 128_000},
  "gpt-4o-mini":     {"input": 0.15, "output": 0.60,  "context": 128_000},
  "claude-4-opus":   {"input": 15.00, "output": 75.00, "context": 1_000_000},
  "claude-4-sonnet": {"input": 3.00,  "output": 15.00, "context": 1_000_000},
  "claude-4.5-haiku":{"input": 0.80,  "output": 4.00,  "context": 200_000},
  "gemini-2.5-pro":  {"input": 1.25,  "output": 5.00,  "context": 2_000_000},
  "gemini-2.5-flash":{"input": 0.075, "output": 0.30,  "context": 1_000_000},
}

Step 4: Estimator function

def estimate(prompt: str, expected_output_tokens: int, models: list[str]):
    results = []
    for m in models:
        price = PRICING[m]
        input_tokens = count_tokens(prompt, m)
        if input_tokens + expected_output_tokens > price["context"]:
            results.append({"model": m, "fits": False})
            continue
        cost = (input_tokens * price["input"] +
                expected_output_tokens * price["output"]) / 1_000_000
        results.append({
            "model": m,
            "fits": True,
            "input_tokens": input_tokens,
            "output_tokens": expected_output_tokens,
            "cost_usd": round(cost, 6),
        })
    return sorted(results, key=lambda r: (not r["fits"], r.get("cost_usd", 1)))

Step 5: Use it as a router

def pick_model(prompt: str, expected_output: int, quality_tier: str = "mid") -> str:
    candidates = {
      "cheap": ["gemini-2.5-flash", "gpt-4o-mini", "claude-4.5-haiku"],
      "mid":   ["gpt-4o", "claude-4-sonnet", "gemini-2.5-pro"],
      "top":   ["gpt-5", "claude-4-opus"],
    }[quality_tier]
    estimates = estimate(prompt, expected_output, candidates)
    for e in estimates:
        if e["fits"]:
            return e["model"]
    raise RuntimeError("prompt too large for any candidate")

What this saves you

  • Surprise bills from feeding a 100k-token doc to an Opus-class model when Haiku would have sufficed
  • Runtime errors from sending prompts that exceed the context window
  • The cost of running a production A/B across models -- now you can estimate before calling

Variations

  • Add batch-API pricing columns (50% off for 24hr turnaround)
  • Add cached-input pricing (Claude cache discounts for repeated context)
  • Log every real call; at end of day, compare actual to estimate, tune your output-token estimates per task