Why RAG pipelines waste tokens
The problem is rarely the model itself. In many production pipelines, the largest hidden cost comes from inefficient context preparation.
Raw HTML, duplicated boilerplate, oversized chunks and low-signal embeddings dramatically increase token usage across ingestion, retrieval and inference layers.
As RAG systems scale, these inefficiencies compound quickly.
Most retrieval systems ingest documents that were originally designed for browsers, not language models.
That means AI systems frequently process:
- navigation menus
- cookie banners
- repeated layout wrappers
- inline CSS
- tracking scripts
- duplicated responsive markup
- irrelevant metadata