Retrieval-Augmented Generation (RAG)¶
Trained and fine-tuned LLMs can generate high quality results, though their generated results will be generally confined to the information they have been trained on. Additionally, responses can suffer from:
- Confabulations and Hallucinations that create false or inaccurate information
- Lack of attributon making it difficult to ascertain validity
- Staleness due to new or updated information
Retrieval-Augmented Generation (RAG) helps to solve these!! is a context-augmentation method by coupling the information to external memory.
Here is a basic comparison of the two:
Comparison with/without RAG
graph LR
style QueryEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
style QueryOptimizer1 fill:#E7B4E1,stroke:#333,stroke-width:1px
style Query fill:#FADAD2,stroke:#333,stroke-width:1px
style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
style Docs fill:#FADAD2,stroke:#333,stroke-width:1px
style QueryOptimizer2 fill:#E7B4E1,stroke:#333,stroke-width:1px
style DocEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
style Retriever fill:#E1E7B4,stroke:#333,stroke-width:1px
style Context fill:#B4E1E7,stroke:#333,stroke-width:1px
style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px
QueryEncoder --> |Retrieve\n from|Retriever
Prompt --> Generator[LLM\n Generation]
Query --> Generator
Query --> QueryOptimizer1(Query\n Optimizer)
QueryOptimizer1 --> QueryEncoder[Encoder]
Docs --> QueryOptimizer2(Docs\n Optimizer)
QueryOptimizer2 --> DocEncoder[Encoder]
DocEncoder --> |Index\n to| Retriever[Database]
Retriever --> Context
Context --> Generator
Generator --> Answer
graph LR
style Query fill:#E1FAD2,stroke:#333,stroke-width:1px
style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px
Query --> Generator[LLM Generation]
Prompt --> Generator
Generator --> Answer
Original inceptions of RAG involve queries that involve connecting with Embedding based lookups, though other lookup mechanisms, including key-word searches and other lookups from memory sources may also be possible.
RAG is still an area of optimization with a number of components that may be optimized
These areas of optimization include:
- Manner of document encoding and chunking
- Manner of query encoding when and what to retrieve.
- How to combine the contexts with the prompts
One of the seminal papers on RAG, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduced a solution for end-to-end training of models involving training document and query encoding, lookup and demosntrated revealing improved results over solutions where model components were frozen. For reasons of simplicity, however, a generally standard approach uses models that are frozen to embed and query documents.
It is important to Evaluate your system to ensure efficient efforts in using RAG.
Why use RAG?¶
Large foundation models are trained on large corporas of public (and sometimes private) data. Models may lose effective semantic grounding because of the breadth of implicing knowledge they have codified in the next-token predictors. To improve the groundedness and appropriateness of the desired output, RAG fetches appropriate information that can be combined with the prompt context in order for the LLM to generate appropriate results. This can be particularly important when there is information that my be changing, and needs to be incorporated quickly.
Importantly, iou can use RAG to help with for data summarization, question-answeering, and the ability to 'know how' information is generated in a somewhat more interpretable manner.
Use RAG because:
- You need knowledge beyond the LLM's training set
- You want to minimize hallucinations
- Your data can be highly dynamic
- The results need to interpretable
- You don't have training data available
Why not use RAG?¶
The primary challenges regarding rag may be related to organizational or functional challenges.
Don't use RAG because:
- You have Latency requirements that RAG retrieval may induce.
- You don't want to pay for, or maintain and support a RAG database.
- There are ethical or privacy concerns relating to sending data to a third-party API
RAG vs Finetuning¶
Because finetuning can enable intrisic knowledge to be ingrained in an LLM, it generally leads to improved performance.
Rag vs Finetuning reveals Fine tuning boosts performance over RAG
That said, it can be seen that using RAG to informe fine tuning, in Retrieval Augmented Fine Tuning (RAFT), as variations are done with mixture of experts can lead to even improved performance.
The RAG process can be divided into two main stages: Preparation (offline) and Retrieval and Generation (online).
Document Indexing (offline)¶
Indexing will involve Loading Data, Splitting data, Embedding Data, Adding Metadata, Storing the data.
It is useful to perform parallel indexing that keeps track of records that are put into vector stores.
Indexing helps to improves performance saving time and money by not:
- Re-processing unchanged content
- Re-computing embeddings of unchanged content
- Inserting duplicated content
The langchain Blog and docs on indexing provide quality discussions on these topics.
Indexing process (clickable)
graph LR
style DocumentSelection fill:#B4E1E7,stroke:#333,stroke-width:1px
style LoadDocuments fill:#FAD2E1,stroke:#333,stroke-width:1px
style SplitDocuments fill:#E1FAD2,stroke:#333,stroke-width:1px
style EmbedDocumentSplits fill:#D2FAFA,stroke:#333,stroke-width:1px
style StoringData fill:#FADAD2,stroke:#333,stroke-width:1px
DocumentSelection[Select Documents] --> LoadDocuments[Load \nDocuments]
LoadDocuments --> SplitDocuments[Split \n Documents]
SplitDocuments --> EmbedDocumentSplits[Embed \n Document \n Splits]
EmbedDocumentSplits --> StoringData[Store in \nDatabase]
click DocumentSelection "#selecting-data"
click LoadDocuments "#loading-data"
click SplitDocuments "#splitting-data"
click EmbedDocumentSplits "#embedding-data"
click StoringData "#storing-data"
The preparation stage involves the following steps in an offline manner
- Data Selection: Choose the appropriate data to ingest.
- Loading Data: Load the data in a manner that can be consumed by the models.
- Splitting Data: Split the data into chunks that can be both consumed by the model and retrieved with a reasonable degree of data.
- Embedding Data: Embed the data.
- Storing Data: Store the embedding.
Selecting Data¶
Users should only access data that is appropriate for their application. However, including too much information might be unnecessary or harmful to retrieval if the retrieval cannot handle the volume or complexity of data. It is also crucial to ensure data privacy when providing data that might not be appropriate (or legal) to access.
Loading Data¶
Different data types require different loaders. Raw text, PDFs, spreadsheets, and more proprietary formats need to be processed in a way that the information is of highest relevance to data. Text is easy to process, but some data, especially multimodal data like PDFs, may need to be formatted with a schema to allow for more effective searching.
Splitting Data¶
Once data has been loaded in a way that a model can process it, it must be split. There are several ways of splitting data:
- By the max size a model can handle.
- By some heuristic break, such as
.
sentences,\n
return characters or\p
paragraphs or newlines. - In a manner that maximizes the topic coherence. In this case, splitting and embedding may happen simultaneously.
Late Chunking of Short Chunks in Long-Context Embedding Models
The authors show in their Blog_and Paper The use of tokenization initially and then pooling those intelligently for having better embeddings for lookup.
Contextual retrieval
Anthropic reveals contextual-retrieval where entire documents are cached (for efficiency) and RAG-retrieval is significantly improved. They use the following to generate contextual chunks that are paired with the item when performing embedding. The results leads to significant (67% !!!) performance improvements.
<document>
{{ WHOLE_DOCUMENT }}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{ CHUNK_CONTENT }}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
Embedding Data¶
Index Building - One of the most useful tricks is multi-representation indexing: decouple what you index for retrieval (e.g., table or image summary) from what you pass to the LLM for answer synthesis (e.g., the raw image, a table). Read more
Adding metadata¶
Information such as dates, chapters, or key words can allow for filtering and key-word lookup.
Storing Data¶
The embedded data is stored for future retrieval and use. This is done via standarad database methods, with the use of embeddings as vector retrieval addresses as well as meta-data for more traditional search (key-word) methods.
Retrieval and Generation (online)¶
The retrieval and generation stage involves the following steps:
- Retrieving Data: Retrieve the data based on input in such a way that relevant documents and chunks can be used in downstream chains.
- Generating Output: Generate an output using a prompt that integrates the query and retrieved data.
The decision and act to retrieve the documents will depend on the additional contexts that the agents may need to be aware of.
It might not always be necessary to retrieve documents. When it is necessary to retrieve the document, it is important to know where to retrieve from routing, and then matching the query to the appropriately stored information. Both of these may involve rewriting the prompt to be more effective in the manner the data is retrieved.
Retrieval and generation (clickable)
graph LR
style C fill:#B4E1E7,stroke:#333,stroke-width:1px
style T fill:#FAD2E1,stroke:#333,stroke-width:1px
style RR fill:#E1FAD2,stroke:#333,stroke-width:1px
style R fill:#FADAD2,stroke:#333,stroke-width:1px
style F fill:#E7B4E1,stroke:#333,stroke-width:1px
style G fill:#D2E1FA,stroke:#333,stroke-width:1px
style H fill:#E1E7B4,stroke:#333,stroke-width:1px
C[Query] --> T[Optimize]
T --> RR[Route]
RR --> R[Match and \nRank Documents]
R --> F[Combine With\n Context]
F --> G[LLM \nGeneration]
G --> H[Answer]
click T "#query-optimization"
click RR "#routing"
click R "#match-and-rank"
click F "#CombineWithContext"
click G "#LLMGeneration"
click H "#Answer"
Query Optimization¶
In production settings, the queries that users ask are unlikely to be optimal for retrieval. This can be due to a combination of challenges such as questions that are.
- Irrelevant
- Vague
- Not related to retrieval
- Are made of multiple questions
Optimization of queries, looks to improve these queries in several manners. Here are a several with other greater descriptions written in Langchain's query analysis.
Rewrite-Retrieve-Read¶
This approach involves rewriting the query for better retrieval and reading of the relevant documents.
Step Back Prompting¶
This method generates an intermediate context that helps to 'abstract' the information. Once generated, the additional context can be used.
Step back
You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.
{normal_context}
{step_back_context}
Original Question: {question}
Answer:
Query Rephrasing¶
Particularly in chat settings, it's important to include all of the appropriate context to create an effective search query.
Query Decomposition¶
When questions are directly made of multiple questions, or the effective answer to these questions involves answering several sub-questions, breaking the questions into multiple queries may be essential. This may involve performing sequential queries that are created based on retrieved information, or queries that can be run irrespective of other results. Langchain Query decomposition
Query Expasion¶
Can generate multiple rephrased versions of the query to increas the likelihood of a hit, or use the advanced retrieval methods to triangulate higher quality hits.
Query Clarifying¶
Particularly in chat settings when questions are vague, asking follow-up questions can be instrumental in ensuring the lookup can be as effective as possible.
Query structuring¶
When answers to queries can be 'filtered' using meta-data based on elements of the queries can be highly valuable. This can include attributes such as date, location, subjects. See Langchain's Query construction for additional information related to this.
Routing¶
Depending on the question asked, queries may need to be routed to different sources of data, or indexes. OpenAI's RAG strategies provides some guidance on question routing:
Matching and Ranking¶
Matching involves aligning the query with the appropriately stored information.
Multi-Hop RAG¶
In order to effectively answer some queries, retrieval of evidence from multiple documents may be needed. This is known as multi-hop rag.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries provides a dataset for evaluating multihop rag
"MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications."
Iterating and Corrective RAG¶
??? note 'SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION" selfrag The author's show in their blog and paper an iterative reflecting RAG generation to yield SOTA retrieval on QA and fact verification.
IN their own words
> The issue: Factual inaccuracies of versatile LLMs
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. They often generate hallucinations, especially in long-tail, their knowledge gets obsolete, and lacks attribution.
Is Retrieval-Augmented Generation a silver bullet?
Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues and shows effectiveness in knowledge-intensive tasks such as QA. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. Moreover, there's no guarantee that generations are entailed by cited evidence.
What is Self-RAG?
Self-Reflective Retrieval-Augmented Generation (Self-RAG) is a new framework to enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand (e.g., can retrieve multiple times during generation, or completely skip retrieval), and generates and reflects on retrieved passages and its own generations using special tokens, called _reflection tokens_. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements.
How good is Self-RAG?
Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
![image](https://github.com/user-attachments/assets/7166c3e0-6145-4fe4-9e02-f5cbe0c70b52)
![image](https://github.com/user-attachments/assets/0f045d00-5cfc-4b60-9ae9-df4deb319409)
![image](https://github.com/user-attachments/assets/0713e3ac-a55d-4f42-a939-7cdc66e0d4ec)
BEST RESULTS SO FAR Corrective Retrieval Augmented Generation
Developments The authors hsow in their paper an iterative RAG generation that evaluates document reletance and confidence of the different actions can be considered. Called Corrective retrieval augmented generation CRAG, the results allow significant improvement over other solutions, including self-RAG.
The prompts:
A tutorial coombining Self-Corrective RAG application for answering questions about Pandas documentation using LangGraph Cloud.
We implement ideas from both self-RAG and corrective RAG to flexibly handle model hallucinations. You'll see how to check for hallucinations after an answer is generated, and check for answer relevancy before returning the user quest check for answer relevancy before returning the user question.
Small to big lookup¶
TODO xxx
Reranking¶
TODO xxx Reranking
Generating responses¶
The final step is generating an output using a prompt that integrates the query and retrieved data.
Challenges in generating responses can involve
- Not having enough information: RAG can help minimize response generation of non-factual information, but only if retrieved information provides sufficient context to answer theq estion properly. If the question cannot be answered with a reasonable degree of certainty, then the response should be along the lines of "I don't know."
- Conflicting information: When retrieved results contain different responses to the same question, a difinitive response may not be possible
- Stale information: When information is no longer relevant.
Advanced methods¶
STRUCTRAG: BOOSTING KNOWLEDGE INTENSIVE REASONING OF LLMS VIA INFERENCE-TIME HYBRID INFORMATION STRUCTURIZATION
Developments The authors create a new framework called StructRAG that identifies the optimal structures documents to be fed into the prompts. They show that they are very good at improving the result, here are core components
As seen on the internet: 🛣️ Hybrid Structure Router Analyzes the input question and determines the best format to structure the data before processing it. It can choose from: - Tables for tasks with a lot of statistical data - Graphs for tasks requiring long-chain reasoning like tracing cause-effect relationships - Catalogues for summarizing or organizing hierarchical information - Chunks for simpler, one-off tasks. - Algorithms for more procedural tasks Each type benefits from a specific structure. For example, using a table for a statistical comparison task is much more efficient than just presenting the raw text.
🧱 Scattered Knowledge Structurizer Once the Hybrid Structure Router has selected the best knowledge format, StructRAG takes all the relevant information and organizes it into the appropriate structure: - For tables, it arranges data into rows and columns (e.g., comparing company financials across years). - For graphs, it forms entity-relationship triples like “Company A → revenue growth → 10%.” - For chunks, it keeps the text but filters out the noise, giving the model only what’s relevant.
🛠️ Structured Knowledge Utilizer This component decomposes complex questions into sub-questions and extracts relevant information from the structured knowledge to answer each one. Then, it integrates those sub-answers into a final inference. For example, if you ask the model, “Which company has shown the best growth over the last 5 years?” the Utilizer breaks this down into sub-questions like: - What was each company's growth percentage? - How did their revenue change year-on-year? - How do those numbers compare? It retrieves precise data from the structured knowledge (e.g., the table) and uses it to construct an answer that’s more accurate and contextually aware.
In their own words:
StructRAG framework consists of three modules designed to sequentially identify the most suitable structure type, construct structured knowledge in that format, and utilize that structured knowledge to infer the final answer. First, recognizing that different structure types are suited for different tasks, a hybrid structure router is proposed to determine the most appropriate structure type based on the question and document information of the current task. Second, given that constructing structured knowledge is complex and requires strong comprehension and generation abilities, an LLM-based scattered knowledge structurizer is employed to convert raw documents into structured knowledge in the optimal type. Finally, since questions in knowledgeintensive reasoning tasks can often be a complex composite problems that are challenging to solve directly, a structured knowledge utilizer is used to perform question decomposition and precise knowledge extraction for more accurate answer inference
Multimodal Rag¶
Natural-language lookup with RAG can be improved by allowing other modalities, such as tables and images, at the same time. There are several ways that this may be accomplished as described in Langchain's multi modal rag:
Option 1:
Use multimodal embeddings (such as CLIP) to embed images and text
Retrieve both using similarity search
Pass raw images and text chunks to a multimodal LLM for answer synthesis
Option 2:
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve text
Pass text chunks to an LLM for answer synthesis
Option 3:
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve image summaries with a reference to the raw image
Pass raw images and text chunks to a multimodal LLM for answer synthesis
-
Multi-Modal: This approach is used for RAG on a substack that has many images of densely packed tables, graphs. Here is an example implementation, and Here is one that works with private data.
-
Semi-Structured: This approach is used for RAG on documents with tables, which can be split using naive RAG text-splitting that does not explicitly preserve them. Here is an example implementation.
Evaluating and Comparing¶
Because of the large number of manners of performing RAG, it is important to evaluate the quality of the implemented solution.
Rag Arena Provides interfaces with LangChain to provide a RAG chatbot experience where queries receive multiple responses.
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
Development: The authors present a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task. It summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them.
In general, it breaks down the complexity of queries into several levels: L1: Explicit Fact Queries: ** To just answer specific questions based on document or snippets within the collection. **L2: Implicit Fact Queries: ** To answer questions involving data dependencies or some level of logical or common sense reasoning. **L3: Interpretable Rational Queries: ** Queries that require external data to create rational for comparison. **L4: Hidden Rational Queri8es: They have domain specific reasoning that may not be explicitly described and difficult to enumerate.
Open source tools and applications¶
Resources, Tutorials and Blogs¶
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduces a complete solution for enabling improved response generation with LLMs.
The authors reveal that allowing for fine tuning of the models when equipped with RAG improved the results.
12 RAG Pain Points and Proposed Solutions
Things that might lead to failure of RAG pipeline. Mostly taken from the blog
Pain point: * and solutions
1: Missing Content:
- Clean your data
- Better prompting
2: Missed the Top Ranked Documents
- Hyperparameter tuning for
chunk_size
andsimilarity_top_k
as in Hyperparameter Optimization for RAG. - Reranking notebook usses Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex and
CohereRank
to rerank the resultsimport os from llama_index.postprocessor.cohere_rerank import CohereRerank api_key = os.environ["COHERE_API_KEY"] cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker query_engine = index.as_query_engine( similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors ) response = query_engine.query( "What did Sam Altman do in this essay?", )
3: Not in Context — Consolidation Strategy Limitations
- Tweak retrieval strategies
- Finetune embeddings
4: Not Extracted
- Clean your Data
- Prompt Compression
- Long Context Reorder (put crucial content at beginning and end)
5: Wrong Format
- Output Parsing
- Pydantic
6: Incorrect Specificity
7: Incomplete and Impartial Responses
8: Data Ingestion Scalability
- Chain of table and Llama solution
- Mix-Self-Consistency Pack based on Rethinking Tabular Data Understanding with Large Language Models Llama solution
9: Structured Data QA
- Use Llama index
ChainOfTablePack
based on Chain of Table - Use Llama index
MixSelfConsistencyQueryEngine
based on Rethinking Tabular Data Understanding with Large Language Models
10: Data Extraction from Complex PDFs
- Use pdf2htmlEX
- Use
EmbeddedTablesUnstructuredRetrieverPack
inLlamaIndex
11: Fallback Model(s): Use a model router like - Neutrino
from llama_index.llms import Neutrino
from llama_index.llms import ChatMessage
llm = Neutrino(
api_key="<your-Neutrino-api-key>",
router="test" # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
)
response = llm.complete("What is large language model?")
print(f"Optimal model: {response.raw['model']}")
from llama_index.llms import OpenRouter
from llama_index.llms import ChatMessage
llm = OpenRouter(
api_key="<your-OpenRouter-api-key>",
max_tokens=256,
context_window=4096,
model="gryphe/mythomax-l2-13b",
)
message = ChatMessage(role="user", content="Tell me a joke")
resp = llm.chat([message])
print(resp)
12: LLM Security
- Use things like Llama Guard