Retrieval-Augmented Generation (RAG)¶
Trained and fine-tuned LLMs can generate high quality results, though their generated results will be generally confined to the information they have been trained on. Additionally, responses can suffer from:
- Confabulations and Hallucinations that create false or inaccurate information
- Lack of attributon making it difficult to ascertain validity
- Staleness due to new or updated information
Retrieval-Augmented Generation (RAG) helps to solve these!! is a context-augmentation method by coupling the information to external memory.
Here is a basic comparison of the two:
Comparison with/without RAG
graph LR
style QueryEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
style QueryOptimizer1 fill:#E7B4E1,stroke:#333,stroke-width:1px
style Query fill:#FADAD2,stroke:#333,stroke-width:1px
style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
style Docs fill:#FADAD2,stroke:#333,stroke-width:1px
style QueryOptimizer2 fill:#E7B4E1,stroke:#333,stroke-width:1px
style DocEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
style Retriever fill:#E1E7B4,stroke:#333,stroke-width:1px
style Context fill:#B4E1E7,stroke:#333,stroke-width:1px
style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px
QueryEncoder --> |Retrieve<br> from|Retriever
Prompt --> Generator[LLM<br> Generation]
Query --> Generator
Query --> QueryOptimizer1(Query<br> Optimizer)
QueryOptimizer1 --> QueryEncoder[Encoder]
Docs --> QueryOptimizer2(Docs<br> Optimizer)
QueryOptimizer2 --> DocEncoder[Encoder]
DocEncoder --> |Index<br> to| Retriever[Database]
Retriever --> Context
Context --> Generator
Generator --> Answer
graph LR
style Query fill:#E1FAD2,stroke:#333,stroke-width:1px
style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px
Query --> Generator[LLM Generation]
Prompt --> Generator
Generator --> Answer
Original inceptions of RAG involve queries that involve connecting with Embedding based lookups, though other lookup mechanisms, including key-word searches and other lookups from memory sources may also be possible.
RAG is still an area of optimization with a number of components that may be optimized
These areas of optimization include:
- Manner of document encoding and chunking
- Manner of query encoding when and what to retrieve.
- How to combine the contexts with the prompts
One of the seminal papers on RAG, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduced a solution for end-to-end training of models involving training document and query encoding, lookup and demosntrated revealing improved results over solutions where model components were frozen. For reasons of simplicity, however, a generally standard approach uses models that are frozen to embed and query documents.
It is important to Evaluate your system to ensure efficient efforts in using RAG.
Why use RAG?¶
Large foundation models are trained on large corporas of public (and sometimes private) data. Models may lose effective semantic grounding because of the breadth of implicing knowledge they have codified in the next-token predictors. To improve the groundedness and appropriateness of the desired output, RAG fetches appropriate information that can be combined with the prompt context in order for the LLM to generate appropriate results. This can be particularly important when there is information that my be changing, and needs to be incorporated quickly.
Importantly, iou can use RAG to help with for data summarization, question-answeering, and the ability to 'know how' information is generated in a somewhat more interpretable manner.
Use RAG because:
- You need knowledge beyond the LLM's training set
- You want to minimize hallucinations
- Your data can be highly dynamic
- The results need to interpretable
- You don't have training data available
Why not use RAG?¶
The primary challenges regarding rag may be related to organizational or functional challenges.
Don't use RAG because:
- You have Latency requirements that RAG retrieval may induce.
- You don't want to pay for, or maintain and support a RAG database.
- There are ethical or privacy concerns relating to sending data to a third-party API
RAG vs Finetuning¶
Because finetuning can enable intrisic knowledge to be ingrained in an LLM, it generally leads to improved performance.
Rag vs Finetuning reveals Fine tuning boosts performance over RAG
That said, it can be seen that using RAG to informe fine tuning, in Retrieval Augmented Fine Tuning (RAFT), as variations are done with mixture of experts can lead to even improved performance.
▪️ Original RAG ▪️ Graph RAG ▪️ LongRAG ▪️ Self-RAG ▪️ Corrective RAG ▪️ EfficientRAG ▪️ Golden-Retriever ▪️ Adaptive RAG ▪️ Modular RAG ▪️ Speculative RAG ▪️ RankRAG ▪️ Multi-Head RAG
Save the list and check this out for more info: https://www.turingpost.com/p/12-types-of-rag
Implementing RAg¶
The RAG process can be divided into two main stages: Preparation (offline) and Retrieval and Generation (online).
Document Indexing (offline)¶
Indexing will involve Loading Data, Splitting data, Embedding Data, Adding Metadata, Storing the data.
It is useful to perform parallel indexing that keeps track of records that are put into vector stores.
Indexing helps to improves performance saving time and money by not:
- Re-processing unchanged content
- Re-computing embeddings of unchanged content
- Inserting duplicated content
The langchain Blog and docs on indexing provide quality discussions on these topics.
Indexing process (clickable)
graph LR
style DocumentSelection fill:#B4E1E7,stroke:#333,stroke-width:1px
style LoadDocuments fill:#FAD2E1,stroke:#333,stroke-width:1px
style SplitDocuments fill:#E1FAD2,stroke:#333,stroke-width:1px
style EmbedDocumentSplits fill:#D2FAFA,stroke:#333,stroke-width:1px
style StoringData fill:#FADAD2,stroke:#333,stroke-width:1px
DocumentSelection[Select Documents] --> LoadDocuments[Load <br>Documents]
LoadDocuments --> SplitDocuments[Split <br> Documents]
SplitDocuments --> EmbedDocumentSplits[Embed <br> Document <br> Splits]
EmbedDocumentSplits --> StoringData[Store in <br>Database]
click DocumentSelection "#selecting-data"
click LoadDocuments "#loading-data"
click SplitDocuments "#splitting-data"
click EmbedDocumentSplits "#embedding-data"
click StoringData "#storing-data"
The preparation stage involves the following steps in an offline manner
- Data Selection: Choose the appropriate data to ingest.
- Loading Data: Load the data in a manner that can be consumed by the models.
- Splitting Data: Split the data into chunks that can be both consumed by the model and retrieved with a reasonable degree of data.
- Embedding Data: Embed the data.
- Storing Data: Store the embedding.
Selecting Data¶
Users should only access data that is appropriate for their application. However, including too much information might be unnecessary or harmful to retrieval if the retrieval cannot handle the volume or complexity of data. It is also crucial to ensure data privacy when providing data that might not be appropriate (or legal) to access.
Loading Data¶
Different data types require different loaders. Raw text, PDFs, spreadsheets, and more proprietary formats need to be processed in a way that the information is of highest relevance to data. Text is easy to process, but some data, especially multimodal data like PDFs, may need to be formatted with a schema to allow for more effective searching.
Splitting Data¶
Once data has been loaded in a way that a model can process it, it must be split. There are several ways of splitting data:
- By the max size a model can handle.
- By some heuristic break, such as
.
sentences,<br>
return characters or\p
paragraphs or newlines. - In a manner that maximizes the topic coherence. In this case, splitting and embedding may happen simultaneously.
Late Chunking of Short Chunks in Long-Context Embedding Models
The authors show in their Blog_and Paper The use of tokenization initially and then pooling those intelligently for having better embeddings for lookup.
Contextual retrieval
Anthropic reveals contextual-retrieval where entire documents are cached (for efficiency) and RAG-retrieval is significantly improved. They use the following to generate contextual chunks that are paired with the item when performing embedding. The results leads to significant (67% !!!) performance improvements.
<document>
{{ WHOLE_DOCUMENT }}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{ CHUNK_CONTENT }}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
Embedding Data¶
Index Building - One of the most useful tricks is multi-representation indexing: decouple what you index for retrieval (e.g., table or image summary) from what you pass to the LLM for answer synthesis (e.g., the raw image, a table). Read more
Adding metadata¶
Information such as dates, chapters, or key words can allow for filtering and key-word lookup.
Storing Data¶
The embedded data is stored for future retrieval and use. This is done via standarad database methods, with the use of embeddings as vector retrieval addresses as well as meta-data for more traditional search (key-word) methods.
Retrieval and Generation (online)¶
The retrieval and generation stage involves the following steps:
- Retrieving Data: Retrieve the data based on input in such a way that relevant documents and chunks can be used in downstream chains.
- Generating Output: Generate an output using a prompt that integrates the query and retrieved data.
The decision and act to retrieve the documents will depend on the additional contexts that the agents may need to be aware of.
It might not always be necessary to retrieve documents. When it is necessary to retrieve the document, it is important to know where to retrieve from routing, and then matching the query to the appropriately stored information. Both of these may involve rewriting the prompt to be more effective in the manner the data is retrieved.
Retrieval and generation (clickable)
graph LR
style C fill:#B4E1E7,stroke:#333,stroke-width:1px
style T fill:#FAD2E1,stroke:#333,stroke-width:1px
style RR fill:#E1FAD2,stroke:#333,stroke-width:1px
style R fill:#FADAD2,stroke:#333,stroke-width:1px
style F fill:#E7B4E1,stroke:#333,stroke-width:1px
style G fill:#D2E1FA,stroke:#333,stroke-width:1px
style H fill:#E1E7B4,stroke:#333,stroke-width:1px
C[Query] --> T[Optimize]
T --> RR[Route]
RR --> R[Match and <br>Rank Documents]
R --> F[Combine With<br> Context]
F --> G[LLM <br>Generation]
G --> H[Answer]
click T "#query-optimization"
click RR "#routing"
click R "#match-and-rank"
click F "#CombineWithContext"
click G "#LLMGeneration"
click H "#Answer"
Query Optimization¶
In production settings, the queries that users ask are unlikely to be optimal for retrieval. This can be due to a combination of challenges such as questions that are.
- Irrelevant
- Vague
- Not related to retrieval
- Are made of multiple questions
Optimization of queries, looks to improve these queries in several manners.
Rewrite-Retrieve-Read¶
This approach involves rewriting the query for better retrieval and reading of the relevant documents.
Step Back Prompting¶
This method generates an intermediate context that helps to 'abstract' the information. Once generated, the additional context can be used.
Step back
You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.
{normal_context}
{step_back_context}
Original Question: {question}
Answer:
Query Rephrasing¶
Particularly in chat settings, it's important to include all of the appropriate context to create an effective search query.
Query Decomposition¶
When questions are directly made of multiple questions, or the effective answer to these questions involves answering several sub-questions, breaking the questions into multiple queries may be essential. This may involve performing sequential queries that are created based on retrieved information, or queries that can be run irrespective of other results. Langchain Query decomposition
Query Expasion¶
Can generate multiple rephrased versions of the query to increas the likelihood of a hit, or use the advanced retrieval methods to triangulate higher quality hits.
Query Clarifying¶
Particularly in chat settings when questions are vague, asking follow-up questions can be instrumental in ensuring the lookup can be as effective as possible.
Query structuring¶
When answers to queries can be 'filtered' using meta-data based on elements of the queries can be highly valuable. This can include attributes such as date, location, subjects. See Langchain's Query construction for additional information related to this.
Routing¶
Depending on the question asked, queries may need to be routed to different sources of data, or indexes. OpenAI's RAG strategies provides some guidance on question routing:
Matching and Ranking¶
Matching involves aligning the query with the appropriately stored information.
Multi-Hop RAG¶
In order to effectively answer some queries, retrieval of evidence from multiple documents may be needed. This is known as multi-hop rag.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries provides a dataset for evaluating multihop rag
"MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications."
Small to big lookup¶
TODO xxx
Reranking¶
TODO xxx Reranking
Generating responses¶
The final step is generating an output using a prompt that integrates the query and retrieved data.
Challenges in generating responses can involve
- Not having enough information: RAG can help minimize response generation of non-factual information, but only if retrieved information provides sufficient context to answer theq estion properly. If the question cannot be answered with a reasonable degree of certainty, then the response should be along the lines of "I don't know."
- Conflicting information: When retrieved results contain different responses to the same question, a difinitive response may not be possible
- Stale information: When information is no longer relevant.
Advanced methods¶
Multimodal Rag¶
Natural-language lookup with RAG can be improved by allowing other modalities, such as tables and images, at the same time. There are several ways that this may be accomplished as described in Langchain's multi modal rag:
Option 1:
Use multimodal embeddings (such as CLIP) to embed images and text
Retrieve both using similarity search
Pass raw images and text chunks to a multimodal LLM for answer synthesis
Option 2:
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve text
Pass text chunks to an LLM for answer synthesis
Option 3:
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve image summaries with a reference to the raw image
Pass raw images and text chunks to a multimodal LLM for answer synthesis
-
Multi-Modal: This approach is used for RAG on a substack that has many images of densely packed tables, graphs. Here is an example implementation, and Here is one that works with private data.
-
Semi-Structured: This approach is used for RAG on documents with tables, which can be split using naive RAG text-splitting that does not explicitly preserve them. Here is an example implementation.
Evaluating and Comparing¶
Because of the large number of manners of performing RAG, it is important to evaluate the quality of the implemented solution.
Rag Arena Provides interfaces with LangChain to provide a RAG chatbot experience where queries receive multiple responses.
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
Development: The authors present a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task. It summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them.
In general, it breaks down the complexity of queries into several levels: L1: Explicit Fact Queries: ** To just answer specific questions based on document or snippets within the collection. **L2: Implicit Fact Queries: ** To answer questions involving data dependencies or some level of logical or common sense reasoning. **L3: Interpretable Rational Queries: ** Queries that require external data to create rational for comparison. **L4: Hidden Rational Queri8es: They have domain specific reasoning that may not be explicitly described and difficult to enumerate.
Resources, Tutorials and Blogs¶
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduces a complete solution for enabling improved response generation with LLMs.
The authors reveal that allowing for fine tuning of the models when equipped with RAG improved the results.
12 RAG Pain Points and Proposed Solutions
Things that might lead to failure of RAG pipeline. Mostly taken from the blog
Pain point: * and solutions
1: Missing Content:
- Clean your data
- Better prompting
2: Missed the Top Ranked Documents
- Hyperparameter tuning for
chunk_size
andsimilarity_top_k
as in Hyperparameter Optimization for RAG. - Reranking notebook usses Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex and
CohereRank
to rerank the resultsimport os from llama_index.postprocessor.cohere_rerank import CohereRerank api_key = os.environ["COHERE_API_KEY"] cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker query_engine = index.as_query_engine( similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors ) response = query_engine.query( "What did Sam Altman do in this essay?", )
3: Not in Context — Consolidation Strategy Limitations
- Tweak retrieval strategies
- Finetune embeddings
4: Not Extracted
- Clean your Data
- Prompt Compression
- Long Context Reorder (put crucial content at beginning and end)
5: Wrong Format
- Output Parsing
- Pydantic
6: Incorrect Specificity
7: Incomplete and Impartial Responses
8: Data Ingestion Scalability
- Chain of table and Llama solution
- Mix-Self-Consistency Pack based on Rethinking Tabular Data Understanding with Large Language Models Llama solution
9: Structured Data QA
- Use Llama index
ChainOfTablePack
based on Chain of Table - Use Llama index
MixSelfConsistencyQueryEngine
based on Rethinking Tabular Data Understanding with Large Language Models
10: Data Extraction from Complex PDFs
- Use pdf2htmlEX
- Use
EmbeddedTablesUnstructuredRetrieverPack
inLlamaIndex
11: Fallback Model(s): Use a model router like - Neutrino
from llama_index.llms import Neutrino
from llama_index.llms import ChatMessage
llm = Neutrino(
api_key="<your-Neutrino-api-key>",
router="test" # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
)
response = llm.complete("What is large language model?")
print(f"Optimal model: {response.raw['model']}")
from llama_index.llms import OpenRouter
from llama_index.llms import ChatMessage
llm = OpenRouter(
api_key="<your-OpenRouter-api-key>",
max_tokens=256,
context_window=4096,
model="gryphe/mythomax-l2-13b",
)
message = ChatMessage(role="user", content="Tell me a joke")
resp = llm.chat([message])
print(resp)
12: LLM Security
- Use things like Llama Guard