Context Window
The maximum number of tokens a model can attend to in a single request, including the response.
The context window is the working memory of a language model. Anything outside it is invisible to the model on this request.
In 2026 most frontier models support 128K-1M token windows. That's enough to fit hundreds of pages of documentation, an entire small codebase, or a full year of email threads. Prompt-caching makes large contexts much more affordable by reusing already-processed prefixes.
Long context isn't free. Latency, cost, and surprisingly quality can all degrade as the window fills. A common pattern is hybrid: use RAG to retrieve only the relevant chunks instead of stuffing everything into context.