Apr 23, 20266 min read

Reducing token costs in AI agents without losing quality

How I think about token costs in agents and chatbots pragmatically: from prompt discipline, context management, and RAG to caching, model routing, and bounded agent loops.

AI AgentsToken CostsRAGLLM Ops

Costs rarely come from one place only

When people discuss token costs, the conversation often jumps straight to the prompt. That is understandable, but too narrow. In production agents, costs come from system prompts, tool definitions, conversation history, RAG context, intermediate reasoning steps, retry logic, and outputs. If you only shorten wording, you often optimize the smallest part of the problem.

That is why I treat token optimization as an architectural concern. The central question is not only: how can I write this shorter? It is: what information does the model actually need, when does it need it, and which model should handle this task in the first place?

Keep prompts compact, but do not compress blindly

A compact system prompt is almost always useful. Redundant explanations, duplicate rules, and too many examples cost money on every call. Especially in recurring agent flows, it is worth reviewing instructions regularly: which rule is truly needed, which one only grew historically, and which examples measurably improve the result?

At the same time, brevity should not come at the expense of control. An overly compressed prompt can become more expensive if the model asks more follow-up questions, uses tools incorrectly, or requires heavier post-processing. Good prompt optimization saves tokens without making roles, boundaries, and quality criteria vague.

Context management is the biggest everyday lever

Many chatbots and agents become expensive because they carry too much history. Every message, tool result, and old intermediate step returns to the context even though only a small part is still decision-relevant. Sliding windows, trimming, and running summaries are therefore not convenience features, but cost controls.

Selective context is even better. Instead of sending complete histories or documents, the system should actively decide which information is relevant for the current task. This is where QA thinking and agent design meet: context has to be traceable, sufficient, and bounded.

RAG, caching, and routing instead of large default contexts

RAG is one of the strongest levers when agents work with knowledge bases. Instead of loading entire documents or long policy texts into the prompt, only relevant chunks are retrieved, ranked, and injected. Good chunking, low overlap, and reranking matter more than returning as many matches as possible. Three well-chosen chunks are often better than twenty mediocre ones.

Caching adds another layer. Static system parts, recurring tool context, or semantically similar standard questions should not have to be paid for from scratch every time. And with model routing, not every task needs the strongest model. Classification, pre-filtering, simple extraction, and formatting can often run on smaller models, while complex cases are escalated deliberately.

Agent loops need clear boundaries

Agents become especially expensive when they get stuck in long tool loops. One tool call creates context, the next call creates more context, and without stopping criteria the history grows quickly. Max iterations, clear stop conditions, and compact tool definitions therefore belong directly in the design.

Tool descriptions are part of the token bill as well. Every description is shown to the model. Short, precise descriptions and deferred tool loading help avoid carrying a full toolbox into every request. An agent should only see the tools that are plausibly relevant to the current task.

My pragmatic priority mix

If I had to prioritize a production chatbot or agent scenario, I would usually start with three levers: RAG instead of large contexts, prompt caching for stable system parts, and model routing for simple requests. These three measures change the cost structure noticeably without rebuilding the whole product.

After that comes refinement: shorter prompts, better summaries, smaller outputs, response caching, and batch processing for asynchronous workloads. The important part is not to optimize cost in isolation. A cheap agent that gives wrong answers or behaves opaquely is not a good system. The goal is not the lowest possible token count, but a reliable balance of quality, traceability, and cost.