Context windows have expanded rapidly: GPT-3.5 had 4k, GPT-4 had 128k, Claude 4 has 1M, Gemini 2.5 has 2M. Bigger windows change what's possible: full codebases, entire books, hour-long meeting transcripts can fit in one call.
Caveats: (1) cost scales with window usage -- a 1M-token request is not free, (2) attention quality degrades with distance -- the "lost-in-the-middle" effect is real, (3) some models charge different rates for long vs short context. Design for the window size you actually need, not the one your model allows.
Example Prompt
# Managing context window in a long conversation
MAX_CONTEXT = 128_000 # for gpt-4o
SYSTEM_TOKENS = count(system_prompt)
RESERVED_FOR_OUTPUT = 4_000
budget = MAX_CONTEXT - SYSTEM_TOKENS - RESERVED_FOR_OUTPUT
# Trim old turns until history fits in `budget`
while count(history) > budget:
history = summarize_oldest(history)When to use it
- Planning context composition for agents and RAG
- Deciding whether to chunk a large document or feed it whole
- Sizing the conversation history buffer
When NOT to use it
- Filling the window just because you can -- attention quality degrades
- Relying on 1M+ windows for precise recall -- use retrieval instead
