Retrieval-Augmented Generation (RAG)¶

Trained and fine-tuned LLMs can generate high quality results, though their generated results will be generally confined to the information they have been trained on. Additionally, responses can suffer from:

Confabulations and Hallucinations that create false or inaccurate information
Lack of attributon making it difficult to ascertain validity
Staleness due to new or updated information

Retrieval-Augmented Generation (RAG) helps to solve these!! is a context-augmentation method by coupling the information to external memory.

Here is a basic comparison of the two:

Comparison with/without RAG

WithWithout

graph LR
    style QueryEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
    style QueryOptimizer1 fill:#E7B4E1,stroke:#333,stroke-width:1px
    style Query fill:#FADAD2,stroke:#333,stroke-width:1px
    style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
    style Docs fill:#FADAD2,stroke:#333,stroke-width:1px
    style QueryOptimizer2 fill:#E7B4E1,stroke:#333,stroke-width:1px
    style DocEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
    style Retriever fill:#E1E7B4,stroke:#333,stroke-width:1px
    style Context fill:#B4E1E7,stroke:#333,stroke-width:1px
    style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
    style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px

    QueryEncoder --> |Retrieve\n from|Retriever
    Prompt --> Generator[LLM\n Generation]
    Query --> Generator
    Query --> QueryOptimizer1(Query\n Optimizer)
    QueryOptimizer1 --> QueryEncoder[Encoder]
    Docs --> QueryOptimizer2(Docs\n Optimizer)
    QueryOptimizer2 --> DocEncoder[Encoder]
    DocEncoder --> |Index\n to| Retriever[Database]

    Retriever --> Context

    Context --> Generator
    Generator --> Answer

graph LR
    style Query fill:#E1FAD2,stroke:#333,stroke-width:1px
    style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
    style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
    style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px

    Query --> Generator[LLM Generation]
    Prompt --> Generator
    Generator --> Answer

Original inceptions of RAG involve queries that involve connecting with Embedding based lookups, though other lookup mechanisms, including key-word searches and other lookups from memory sources may also be possible.

RAG is still an area of optimization with a number of components that may be optimized

These areas of optimization include:

Manner of document encoding and chunking
Manner of query encoding when and what to retrieve.
How to combine the contexts with the prompts

One of the seminal papers on RAG, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduced a solution for end-to-end training of models involving training document and query encoding, lookup and demosntrated revealing improved results over solutions where model components were frozen. For reasons of simplicity, however, a generally standard approach uses models that are frozen to embed and query documents.

It is important to Evaluate your system to ensure efficient efforts in using RAG.

Why use RAG?¶

Large foundation models are trained on large corporas of public (and sometimes private) data. Models may lose effective semantic grounding because of the breadth of implicing knowledge they have codified in the next-token predictors. To improve the groundedness and appropriateness of the desired output, RAG fetches appropriate information that can be combined with the prompt context in order for the LLM to generate appropriate results. This can be particularly important when there is information that my be changing, and needs to be incorporated quickly.

Importantly, iou can use RAG to help with for data summarization, question-answeering, and the ability to 'know how' information is generated in a somewhat more interpretable manner.

Use RAG because:

You need knowledge beyond the LLM's training set
You want to minimize hallucinations
Your data can be highly dynamic
The results need to interpretable
You don't have training data available

Why not use RAG?¶

The primary challenges regarding rag may be related to organizational or functional challenges.

Don't use RAG because:

You have Latency requirements that RAG retrieval may induce.
You don't want to pay for, or maintain and support a RAG database.
There are ethical or privacy concerns relating to sending data to a third-party API

RAG vs Finetuning¶

Because finetuning can enable intrisic knowledge to be ingrained in an LLM, it generally leads to improved performance.

Rag vs Finetuning reveals Fine tuning boosts performance over RAG

Paper

That said, it can be seen that using RAG to informe fine tuning, in Retrieval Augmented Fine Tuning (RAFT), as variations are done with mixture of experts can lead to even improved performance.

📋

🦍 RAFT: Adapting Language Model to Domain Specific RAG

Blog post Paper

RAG in Detail¶

The RAG process can be divided into two main stages: Preparation (offline) and Retrieval and Generation (online).

Document Indexing (offline)¶

Indexing will involve Loading Data, Splitting data, Embedding Data, Adding Metadata, Storing the data.

It is useful to perform parallel indexing that keeps track of records that are put into vector stores.

📋

Indexing

Indexing helps to improves performance saving time and money by not:

Re-processing unchanged content
Re-computing embeddings of unchanged content
Inserting duplicated content

The langchain Blog and docs on indexing provide quality discussions on these topics.

Indexing process (clickable)

graph LR
    style DocumentSelection fill:#B4E1E7,stroke:#333,stroke-width:1px
    style LoadDocuments fill:#FAD2E1,stroke:#333,stroke-width:1px
    style SplitDocuments fill:#E1FAD2,stroke:#333,stroke-width:1px
    style EmbedDocumentSplits fill:#D2FAFA,stroke:#333,stroke-width:1px
    style StoringData fill:#FADAD2,stroke:#333,stroke-width:1px

    DocumentSelection[Select Documents] --> LoadDocuments[Load \nDocuments]
    LoadDocuments --> SplitDocuments[Split \n Documents]
    SplitDocuments --> EmbedDocumentSplits[Embed \n Document \n Splits]
    EmbedDocumentSplits --> StoringData[Store in \nDatabase]

    click DocumentSelection "#selecting-data"
    click LoadDocuments "#loading-data"
    click SplitDocuments "#splitting-data"
    click EmbedDocumentSplits "#embedding-data"
    click StoringData "#storing-data"

The preparation stage involves the following steps in an offline manner

Data Selection: Choose the appropriate data to ingest.
Loading Data: Load the data in a manner that can be consumed by the models.
Splitting Data: Split the data into chunks that can be both consumed by the model and retrieved with a reasonable degree of data.
Embedding Data: Embed the data.
Storing Data: Store the embedding.

Selecting Data¶

Users should only access data that is appropriate for their application. However, including too much information might be unnecessary or harmful to retrieval if the retrieval cannot handle the volume or complexity of data. It is also crucial to ensure data privacy when providing data that might not be appropriate (or legal) to access.

Loading Data¶

Different data types require different loaders. Raw text, PDFs, spreadsheets, and more proprietary formats need to be processed in a way that the information is of highest relevance to data. Text is easy to process, but some data, especially multimodal data like PDFs, may need to be formatted with a schema to allow for more effective searching.

Splitting Data¶

Once data has been loaded in a way that a model can process it, it must be split. There are several ways of splitting data:

By the max size a model can handle.
By some heuristic break, such as . sentences, \n return characters or \p paragraphs or newlines.
In a manner that maximizes the topic coherence. In this case, splitting and embedding may happen simultaneously.

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Late Chunking of Short Chunks in Long-Context Embedding Models

The authors show in their Blog_and Paper The use of tokenization initially and then pooling those intelligently for having better embeddings for lookup.

📋

Contextual retrieval

Anthropic reveals contextual-retrieval where entire documents are cached (for efficiency) and RAG-retrieval is significantly improved. They use the following to generate contextual chunks that are paired with the item when performing embedding. The results leads to significant (67% !!!) performance improvements.

<document> 
{{ WHOLE_DOCUMENT }} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{ CHUNK_CONTENT }} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

Embedding Data¶

Index Building - One of the most useful tricks is multi-representation indexing: decouple what you index for retrieval (e.g., table or image summary) from what you pass to the LLM for answer synthesis (e.g., the raw image, a table). Read more

Adding metadata¶

Information such as dates, chapters, or key words can allow for filtering and key-word lookup.

Storing Data¶

The embedded data is stored for future retrieval and use. This is done via standarad database methods, with the use of embeddings as vector retrieval addresses as well as meta-data for more traditional search (key-word) methods.

Retrieval and Generation (online)¶

The retrieval and generation stage involves the following steps:

Retrieving Data: Retrieve the data based on input in such a way that relevant documents and chunks can be used in downstream chains.
Generating Output: Generate an output using a prompt that integrates the query and retrieved data.

The decision and act to retrieve the documents will depend on the additional contexts that the agents may need to be aware of.

It might not always be necessary to retrieve documents. When it is necessary to retrieve the document, it is important to know where to retrieve from routing, and then matching the query to the appropriately stored information. Both of these may involve rewriting the prompt to be more effective in the manner the data is retrieved.

Retrieval and generation (clickable)

    graph LR
        style C fill:#B4E1E7,stroke:#333,stroke-width:1px
        style T fill:#FAD2E1,stroke:#333,stroke-width:1px
        style RR fill:#E1FAD2,stroke:#333,stroke-width:1px
        style R fill:#FADAD2,stroke:#333,stroke-width:1px
        style F fill:#E7B4E1,stroke:#333,stroke-width:1px
        style G fill:#D2E1FA,stroke:#333,stroke-width:1px
        style H fill:#E1E7B4,stroke:#333,stroke-width:1px

        C[Query] --> T[Optimize]
        T --> RR[Route]
        RR --> R[Match and \nRank Documents]
        R --> F[Combine With\n Context]
        F --> G[LLM \nGeneration]
        G --> H[Answer]

        click T "#query-optimization"
        click RR "#routing"
        click R "#match-and-rank"
        click F "#CombineWithContext"
        click G "#LLMGeneration"
        click H "#Answer"

Query Optimization¶

In production settings, the queries that users ask are unlikely to be optimal for retrieval. This can be due to a combination of challenges such as questions that are.

Irrelevant
Vague
Not related to retrieval
Are made of multiple questions

Optimization of queries, looks to improve these queries in several manners. Here are a several with other greater descriptions written in Langchain's query analysis.

Rewrite-Retrieve-Read¶

This approach involves rewriting the query for better retrieval and reading of the relevant documents.

Query Rewriting for Retrieval-Augmented Large Language Models

Step Back Prompting¶

This method generates an intermediate context that helps to 'abstract' the information. Once generated, the additional context can be used.

Step back

    You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

    {normal_context}
    {step_back_context}

    Original Question: {question}
    Answer:

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Query Rephrasing¶

Particularly in chat settings, it's important to include all of the appropriate context to create an effective search query.

Rephrase question

    Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone Question:

Query Decomposition¶

When questions are directly made of multiple questions, or the effective answer to these questions involves answering several sub-questions, breaking the questions into multiple queries may be essential. This may involve performing sequential queries that are created based on retrieved information, or queries that can be run irrespective of other results. Langchain Query decomposition

Query Expasion¶

Can generate multiple rephrased versions of the query to increas the likelihood of a hit, or use the advanced retrieval methods to triangulate higher quality hits.

Query Clarifying¶

Particularly in chat settings when questions are vague, asking follow-up questions can be instrumental in ensuring the lookup can be as effective as possible.

Query structuring¶

When answers to queries can be 'filtered' using meta-data based on elements of the queries can be highly valuable. This can include attributes such as date, location, subjects. See Langchain's Query construction for additional information related to this.

Routing¶

Depending on the question asked, queries may need to be routed to different sources of data, or indexes. OpenAI's RAG strategies provides some guidance on question routing:

Matching and Ranking¶

Matching involves aligning the query with the appropriately stored information.

Multi-Hop RAG¶

In order to effectively answer some queries, retrieval of evidence from multiple documents may be needed. This is known as multi-hop rag.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries provides a dataset for evaluating multihop rag

"MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications."

Iterating and Corrective RAG¶

??? note 'SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION" selfrag The author's show in their blog and paper an iterative reflecting RAG generation to yield SOTA retrieval on QA and fact verification.

IN their own words 
> The issue: Factual inaccuracies of versatile LLMs
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. They often generate hallucinations, especially in long-tail, their knowledge gets obsolete, and lacks attribution.

Is Retrieval-Augmented Generation a silver bullet?
Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues and shows effectiveness in knowledge-intensive tasks such as QA. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. Moreover, there's no guarantee that generations are entailed by cited evidence.

What is Self-RAG?
Self-Reflective Retrieval-Augmented Generation (Self-RAG) is a new framework to enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand (e.g., can retrieve multiple times during generation, or completely skip retrieval), and generates and reflects on retrieved passages and its own generations using special tokens, called _reflection tokens_. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements.

How good is Self-RAG?
Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

![image](https://github.com/user-attachments/assets/7166c3e0-6145-4fe4-9e02-f5cbe0c70b52)

![image](https://github.com/user-attachments/assets/0f045d00-5cfc-4b60-9ae9-df4deb319409)

![image](https://github.com/user-attachments/assets/0713e3ac-a55d-4f42-a939-7cdc66e0d4ec)

📋

BEST RESULTS SO FAR Corrective Retrieval Augmented Generation

Developments The authors hsow in their paper an iterative RAG generation that evaluates document reletance and confidence of the different actions can be considered. Called Corrective retrieval augmented generation CRAG, the results allow significant improvement over other solutions, including self-RAG.

The prompts:

A tutorial coombining Self-Corrective RAG application for answering questions about Pandas documentation using LangGraph Cloud.

We implement ideas from both self-RAG and corrective RAG to flexibly handle model hallucinations. You'll see how to check for hallucinations after an answer is generated, and check for answer relevancy before returning the user quest check for answer relevancy before returning the user question.

Video:

GitHub repo:

Notebook:

Small to big lookup¶

TODO xxx

Reranking¶

TODO xxx Reranking

Generating responses¶

The final step is generating an output using a prompt that integrates the query and retrieved data.

Challenges in generating responses can involve

Not having enough information: RAG can help minimize response generation of non-factual information, but only if retrieved information provides sufficient context to answer theq estion properly. If the question cannot be answered with a reasonable degree of certainty, then the response should be along the lines of "I don't know."
Conflicting information: When retrieved results contain different responses to the same question, a difinitive response may not be possible
Stale information: When information is no longer relevant.

Advanced methods¶

📋

STRUCTRAG: BOOSTING KNOWLEDGE INTENSIVE REASONING OF LLMS VIA INFERENCE-TIME HYBRID INFORMATION STRUCTURIZATION

Developments The authors create a new framework called StructRAG that identifies the optimal structures documents to be fed into the prompts. They show that they are very good at improving the result, here are core components

As seen on the internet: 🛣️ Hybrid Structure Router Analyzes the input question and determines the best format to structure the data before processing it. It can choose from: - Tables for tasks with a lot of statistical data - Graphs for tasks requiring long-chain reasoning like tracing cause-effect relationships - Catalogues for summarizing or organizing hierarchical information - Chunks for simpler, one-off tasks. - Algorithms for more procedural tasks Each type benefits from a specific structure. For example, using a table for a statistical comparison task is much more efficient than just presenting the raw text.

🧱 Scattered Knowledge Structurizer Once the Hybrid Structure Router has selected the best knowledge format, StructRAG takes all the relevant information and organizes it into the appropriate structure: - For tables, it arranges data into rows and columns (e.g., comparing company financials across years). - For graphs, it forms entity-relationship triples like “Company A → revenue growth → 10%.” - For chunks, it keeps the text but filters out the noise, giving the model only what’s relevant.

🛠️ Structured Knowledge Utilizer This component decomposes complex questions into sub-questions and extracts relevant information from the structured knowledge to answer each one. Then, it integrates those sub-answers into a final inference. For example, if you ask the model, “Which company has shown the best growth over the last 5 years?” the Utilizer breaks this down into sub-questions like: - What was each company's growth percentage? - How did their revenue change year-on-year? - How do those numbers compare? It retrieves precise data from the structured knowledge (e.g., the table) and uses it to construct an answer that’s more accurate and contextually aware.

In their own words:

StructRAG framework consists of three modules designed to sequentially identify the most suitable structure type, construct structured knowledge in that format, and utilize that structured knowledge to infer the final answer. First, recognizing that different structure types are suited for different tasks, a hybrid structure router is proposed to determine the most appropriate structure type based on the question and document information of the current task. Second, given that constructing structured knowledge is complex and requires strong comprehension and generation abilities, an LLM-based scattered knowledge structurizer is employed to convert raw documents into structured knowledge in the optimal type. Finally, since questions in knowledgeintensive reasoning tasks can often be a complex composite problems that are challenging to solve directly, a structured knowledge utilizer is used to perform question decomposition and precise knowledge extraction for more accurate answer inference

Multimodal Rag¶

Natural-language lookup with RAG can be improved by allowing other modalities, such as tables and images, at the same time. There are several ways that this may be accomplished as described in Langchain's multi modal rag:

Option 1:

Use multimodal embeddings (such as CLIP) to embed images and text
Retrieve both using similarity search
Pass raw images and text chunks to a multimodal LLM for answer synthesis

Option 2:

Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve text
Pass text chunks to an LLM for answer synthesis

Option 3:

Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve image summaries with a reference to the raw image
Pass raw images and text chunks to a multimodal LLM for answer synthesis

Multi-Modal: This approach is used for RAG on a substack that has many images of densely packed tables, graphs. Here is an example implementation, and Here is one that works with private data.
Semi-Structured: This approach is used for RAG on documents with tables, which can be split using naive RAG text-splitting that does not explicitly preserve them. Here is an example implementation.

Evaluating and Comparing¶

Because of the large number of manners of performing RAG, it is important to evaluate the quality of the implemented solution.

Rag Arena Provides interfaces with LangChain to provide a RAG chatbot experience where queries receive multiple responses.

📋

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Development: The authors present a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task. It summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them.

In general, it breaks down the complexity of queries into several levels: L1: Explicit Fact Queries: ** To just answer specific questions based on document or snippets within the collection. **L2: Implicit Fact Queries: ** To answer questions involving data dependencies or some level of logical or common sense reasoning. **L3: Interpretable Rational Queries: ** Queries that require external data to create rational for comparison. **L4: Hidden Rational Queri8es: They have domain specific reasoning that may not be explicitly described and difficult to enumerate.

Open source tools and applications¶

An open-source clean & customizable RAG UI for chatting with your documents. Built with both end users and developers in mind.

Resources, Tutorials and Blogs¶

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduces a complete solution for enabling improved response generation with LLMs.

The authors reveal that allowing for fine tuning of the models when equipped with RAG improved the results.

📋

12 RAG Pain Points and Proposed Solutions

Things that might lead to failure of RAG pipeline. Mostly taken from the blog

Pain point: * and solutions

1: Missing Content:

Clean your data
Better prompting

2: Missed the Top Ranked Documents

Hyperparameter tuning for chunk_size and similarity_top_k as in Hyperparameter Optimization for RAG.

Reranking notebook usses Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex and CohereRank to rerank the results

    import os
    from llama_index.postprocessor.cohere_rerank import CohereRerank

    api_key = os.environ["COHERE_API_KEY"]
    cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker

    query_engine = index.as_query_engine(
        similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval
        node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors
    )

    response = query_engine.query(
        "What did Sam Altman do in this essay?",
    )

3: Not in Context — Consolidation Strategy Limitations

Tweak retrieval strategies
Finetune embeddings

4: Not Extracted

Clean your Data
Prompt Compression
Long Context Reorder (put crucial content at beginning and end)

5: Wrong Format

Output Parsing
Pydantic

6: Incorrect Specificity

7: Incomplete and Impartial Responses

8: Data Ingestion Scalability

Chain of table and Llama solution
Mix-Self-Consistency Pack based on Rethinking Tabular Data Understanding with Large Language Models Llama solution

9: Structured Data QA

Use Llama index ChainOfTablePack based on Chain of Table
Use Llama index MixSelfConsistencyQueryEngine based on Rethinking Tabular Data Understanding with Large Language Models

10: Data Extraction from Complex PDFs

Use pdf2htmlEX
Use EmbeddedTablesUnstructuredRetrieverPack in LlamaIndex

11: Fallback Model(s): Use a model router like - Neutrino

    from llama_index.llms import Neutrino
    from llama_index.llms import ChatMessage

    llm = Neutrino(
        api_key="<your-Neutrino-api-key>", 
        router="test"  # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
    )

    response = llm.complete("What is large language model?")
    print(f"Optimal model: {response.raw['model']}")

Openrouter

    from llama_index.llms import OpenRouter
    from llama_index.llms import ChatMessage

    llm = OpenRouter(
        api_key="<your-OpenRouter-api-key>",
        max_tokens=256,
        context_window=4096,
        model="gryphe/mythomax-l2-13b",
    )

    message = ChatMessage(role="user", content="Tell me a joke")
    resp = llm.chat([message])
    print(resp)

12: LLM Security

Use things like Llama Guard

Advanced Rag small to big

Blog

Advanced Retreival Augmented Generation from Theory to Llamaindex

Blog

RAG vs finetuning