It Wasn’t a Hallucination. It Was a Retrieval Failure.
An analysis of the root causes of hallucinations in RAG systems, and a relevance-first approach that reduces them by design.
TL;DR
Many of the flaws of LLMs are attributed to “hallucinations”, which are in fact the result of retrieval failures in retrieval-augmented generation systems [1, 2]. When AI systems work on incomplete or poorly structured context, incorrect answers are an expected outcome. Aspected search proposes a different way of thinking about retrieval, where relevance is computed across multiple semantic aspects directly in vector space. By embedding context into relevance itself, Aspected reduces “hallucinations” by design rather than by external mitigation strategies.
Why “Hallucination” Is the Wrong Diagnosis
From this point onward, the term hallucination is used without quotation marks.
Hallucinations have become the default term for incorrect outputs produced by large language models. It is commonly used to describe responses that sound fluent and confident, yet are factually incorrect or unsupported by evidence [1]. While true hallucinations can occur at the model level, their prevalence is often overstated.
In real-world AI systems, especially those built on retrieval-augmented generation (RAG) architectures, the model rarely operates in isolation. In RAG, the language model is supplemented with external documents retrieved at query time, grounding its responses in indexed knowledge sources [2]. Its outputs are constrained by the quality and structure of the information retrieved upstream. When the retrieval layer fails to provide the right documents, or provides incomplete or badly ranked context, the model bases its response on insufficient or partially incorrect information. The output may appear fluent and confident, but it reflects limitations in the retrieved input rather than creative fabrication. What looks like a model error is often the consequence of missing or misaligned context.

This distinction matters. Treating hallucinations purely as a generation problem leads to mitigation strategies such as prompt tuning, response constraints, or post-generation validation. These approaches address symptoms rather than causes. If relevance is computed incorrectly, no amount of careful prompting will reliably fix the outcome. Furthermore, system efficiency suffers: extra prompts, validation layers, and retries are added to compensate for retrieval errors instead of removing them at the source.
Retrieval as the Real Bottleneck in Modern AI Systems
Most production-grade AI systems today rely on a retrieval layer to ground model outputs in external and up-to-date knowledge [2]. Documents are embedded, indexed, and retrieved based on semantic similarity, typically using vector search [3]. The retrieved context is then passed to the model as input.
However, this architecture comes with clear limitations:
- Semantic similarity is usually computed only over unstructured text.
- Structured signals such as time, document type, ownership, or sensitivity are handled separately from the search.
- Metadata is applied as pre-retrieval constraints or post-retrieval filters.
- Relevance is still approximated through fine-tuned heuristics rather than being computed directly.
As a result, retrieval pipelines often exclude relevant documents too early or surface context that is semantically similar but contextually wrong. This leads to the model producing answers that are coherent but incorrect. The failure is not creative invention by the model; it is the predictable outcome of an incomplete relevance model.
In domains where errors have serious consequences, such as legal, regulatory, financial, or enterprise knowledge systems, the cost of these failures is real. Reliability cannot be achieved by improving generation alone. Retrieval has to be treated as a first-class concern.
In 2024, Air Canada was held liable after its customer service chatbot provided a passenger with incorrect refund information based on outdated policy documentation [5]. The tribunal ruled that the airline was responsible for the chatbot’s output, even though the response was generated automatically. The issue was not model creativity but incorrect grounding in policy information.
Why Vector Search Filters Are Not Enough
The dominant RAG pattern follows a familiar flow: text is embedded, top results are retrieved by vector similarity, and metadata is used to filter or rerank them. This approach assumes that relevance can be approximated through incremental refinement.
In practice, this assumption does not hold up as systems grow. Filters remove results instead of shaping similarity. Reranking happens after the initial retrieval decision has already been made. If the relevant context was filtered out or never included in the top-k results, no amount of reranking can recover it. As collections grow larger and more heterogeneous, the gap between semantic similarity and actual relevance widens.
This is why many hallucination mitigation strategies quickly reach their limits. They operate downstream of a retrieval architecture that was never designed to compute relevance as a whole.
Aspected: Relevance as a Retrieval Primitive
This is precisely where Aspected comes in. Rather than treating retrieval as a sequence of semantic search followed by metadata filtering, Aspected computes relevance directly across multiple aspects, addressing retrieval failure at its root [4]. Instead of representing a document as a single vector derived from unstructured text, Aspected represents documents across multiple aspects. These aspects can include content, time, structure, sensitivity, ownership, or any other meaningful property. Both original metadata and AI-enriched attributes are transformed into vector representations and combined into a unified embedding.
From the perspective of the vector database, this embedding behaves like any other vector. Existing approximate nearest neighbor indexing structures can still be used, but the distance functions are adapted to allow selective dimension masking without distorting similarity computation. What changes is what similarity means. In Aspected, similarity is no longer a proxy for relevance. It is relevance. Because multiple aspects contribute directly to the similarity score, retrieval happens in a single step across all relevant dimensions. Context is not applied externally. It is built into the retrieval model itself.
When retrieval returns documents that are both semantically related and contextually aligned, the large language models operate with more complete and appropriate input. Temporal mismatches, policy violations, and structural conflicts are less likely because they influence similarity scoring directly, rather than being handled as afterthoughts. In this setting, many outputs previously labeled as hallucinations disappear. Not because the model has changed, but because the system is finally retrieving what actually matters. This suggests that hallucinations are often a systems problem, not a model flaw. Incorrect answers are often a signal that relevance was approximated instead of computed.
Ultimately, while hallucinations are often treated as an inevitable side effect of generative models, many of them are, in practice, avoidable. When an AI system gives a wrong answer, the first question should not be whether the model made something up. It should be whether the system retrieved the right information in the first place. Aspected was created because relevance is not a feature to be tuned. It is a primitive to be computed.
If you are exploring how to build more reliable retrieval systems, you can read our previous deep dive on the Aspected architecture, or visit http://aspected.com to learn more.
Team @ Aspected
References
[1] Mialon, G., et al. (2023). Augmented Language Models: A Survey. ACM Computing Surveys.
[2] Microsoft (2024). Grounding LLMs with Retrieval-Augmented Generation. Azure Architecture Center.
[3] Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
[4] Xillio Aspected (2026). Enterprise Retrieval Solutions: Enable AI on Existing Content Without Migration.
[5] BBC (2024). Airline held liable for its chatbot giving passenger bad advice.