Development

Context Window

The maximum number of tokens (input + output combined) a model can process in a single request. Content past this limit is truncated or requires chunking.

First published April 14, 2026

Context windows have expanded rapidly: GPT-3.5 had 4k, GPT-4 had 128k, Claude 4 has 1M, Gemini 2.5 has 2M. Bigger windows change what's possible: full codebases, entire books, hour-long meeting transcripts can fit in one call.

Caveats: (1) cost scales with window usage -- a 1M-token request is not free, (2) attention quality degrades with distance -- the "lost-in-the-middle" effect is real, (3) some models charge different rates for long vs short context. Design for the window size you actually need, not the one your model allows.

Example Prompt

# Managing context window in a long conversation

MAX_CONTEXT = 128_000  # for gpt-4o
SYSTEM_TOKENS = count(system_prompt)
RESERVED_FOR_OUTPUT = 4_000
budget = MAX_CONTEXT - SYSTEM_TOKENS - RESERVED_FOR_OUTPUT

# Trim old turns until history fits in `budget`
while count(history) > budget:
    history = summarize_oldest(history)

When to use it

  • Planning context composition for agents and RAG
  • Deciding whether to chunk a large document or feed it whole
  • Sizing the conversation history buffer

When NOT to use it

  • Filling the window just because you can -- attention quality degrades
  • Relying on 1M+ windows for precise recall -- use retrieval instead