Share this
The Vector-Database Tax
by Rikkert Engels on Jun 23, 2026 3:15:28 PM
How a flaw in AI search wastes up to a quarter of your token budget and how we fixed it.
We fixed the flaw with a new retrieval index that fuses metadata as a ranking signal, not a bolted-on filter, so RAG stops looping, retrying, and overspending.
If your enterprise AI bill is climbing, the instinct is to blame the model. It's the wrong suspect. A large share of what you pay isn't spent thinking it's spent searching: the retrieval pipeline that feeds the model, looping and retrying to find the right context. And the reason it loops comes down to a single design flaw that sits underneath nearly every RAG system in production.
We modeled what that flaw costs. For an enterprise spending north of $10M a year on LLMs, fixing it is worth a modeled $1M–$2.5M per year per $10M of spend roughly a quarter of the retrieval-driven budget in the base case. We'll show that math in full, with every assumption sourced, further down. But first, the flaw, because once you see it, the cost is obvious.
The flaw: filtering and ranking fight each other
Ask anyone who has built enterprise retrieval to draw it on a whiteboard and you'll get the same picture: embed the query, search a vector database for similar chunks, apply metadata filters, run a reranker to fix what the filters broke, trim to fit the prompt, hand it to the model. If the answer is weak, an agent rewrites the query and runs the whole thing again.
Every box is reasonable. Together they're a workaround for one problem nobody questions, because it's been there since the first RAG pipeline was drawn:
Ranking is done by vector similarity. It understands meaning but it's blind to metadata. It has no idea whether a result is recent, authoritative, permitted, or critical.
Filtering is done by a separate gate. It understands metadata, date, author, tenant, permission but it's a blunt yes/no that knows nothing about relevance.
Because they're separate steps, they pull against each other. Filter before you rank, and you throw away relevant results before they're ever scored. Filter after, and the top-N cutoff quietly drops the right answer. The entire modern RAG stack, hybrid search, rerankers, graph traversal, agentic re-query loops exists to paper over that one conflict. Each layer adds latency, complexity, and tokens. None removes the cause.
That's the tax. It doesn't show up as a line item. It hides inside the retry loops a flawed index forces on you.
The fix: make metadata a ranking signal, not a filter
We kept asking one question: what if filtering and ranking weren't two steps?
What if a document's date, authority, permission and criticality weren't gates applied after search, but dimensions the search ranks on directly, weighted against meaning, in the same pass?
If you can do that, the conflict disappears. There's nothing left to filter, because the filter criteria are already inside the thing being ranked. Filtering becomes ranking.
You can't get there with a cleverer query on a conventional vector database, in those, metadata simply isn't in the vector; it lives in a side table the similarity math never sees. So we changed what gets stored. In Aspected's index, each document's content vector is fused with its metadata aspect vectors into a single multi-aspect vector, preserving each aspect's distinct contribution. One query ranks across meaning, recency, authority, structure and permission at once.
One pass. No separate filter to fight the ranker. No reranker to undo a bad first pass. No agentic loop to recover lost relevance. The retrieval goes in a straight line.
It's a new index structure, not a new query, which is why it's patented, and why an incumbent can't replicate it by bolting on another stage.
Why the savings follow
Two things every enterprise AI team is told to trade off, accuracy and cost turn out to be victims of the same flaw. Close it and both move.
Cost falls because the loop is gone. The complex queries that today escalate into multi-step agentic retrieval, each step billed in tokens, resolve in one pass. And because retrieval is precise, the prompt carries the right two or three chunks instead of eight padded ones.
Accuracy rises because the system ranks on the aspect that actually matters. Ask for critical tickets and you get the ones that are critical, not the ones that merely use urgent-sounding words. Similar stops standing in for correct.
The number, shown honestly
We were curious enough about the cost effect to model it in full and to publish the model so you can check it yourself. The headline figure: on a $10M LLM budget, a modeled saving of $1M (conservative) to $2.5M (base) per year, scaling linearly with spend.
We're deliberate about what that is. It's an illustrative model built entirely on public benchmarks — not a measurement. Every assumption sits at or below the midpoint of its cited source. We publish the conservative case next to the base case, and we keep our own lab numbers out of the inputs entirely; independent validation is underway with TNO. An outside sense-check helps: independent reviews routinely find 22–48% of enterprise LLM spend recoverable through retrieval and prompt optimization, our base case lands inside that band.
The full working, every assumption, formula and source is in the technical whitepaper that accompanies this post.
For a decade, the industry has optimized the model while the retrieval layer underneath it stayed structurally the same: rank on meaning, filter on metadata, patch the gap forever. We don't think that gap was meant to be patched. It was meant to be closed.
Aspected is the retrieval infrastructure layer for enterprise AI, a patented single-vector, multi-aspect index that resolves meaning and metadata in one pass. Read the full technical whitepaper, with the complete cost model and sources, here or ask us for a live demo on your own data.
Team @ Aspected