Guide to Building a RAG Based LLM Application

LLMs are in the spotlight now. They are a vast source of knowledge, which have changed the face of how search engines work. Information retrieval and search have become a lot easier since the debut of chatbots like ChatGPT. While the knowledge of current AI models, including ChatGPT, is confined to information up until 2021, Bing has adopted a unique approach. It enhances its understanding by extracting up-to-date information from the internet, offering a more current and comprehensive knowledge base.

Retrieval Augmented Generation (RAG)

But we have another approach where we can augment the knowledge of LLMs and retrieve information from custom content. It is called Retrieval Augmented Generation (RAG). Utility tools like ChatPDF have been popular Generative AI tools. The PDF document is connected as an external data source and we can interact with it as we are assisted by an LLM. What we do in RAG is inserting additional data into the context (prompt) of a model at inference time. That helps the LLM get more precise and relevant content for our queries when compared to zero-shot prompting. Another way of looking at it is in the context of a doctor and patient. A doctor’s diagnosis can be significantly more precise and accurate when they have access to the patient’s test results and charts, as opposed to relying solely on symptomatic observations.

Here’s a quick step-by-step guide to building a RAG based LLM application.

The System Workflow

The workflow of the RAG based LLM application will be as follows:

Receive query from the user.
Convert it to an embedded query vector preserving the semantics, using an embedding model.
Retrieve the top-k relevant content from the vector database by computing similarity between the query embedding and the content embedding in the database.
Pass the retrieved content and query as a prompt to an LLM.
The LLM gives the required response.

Prerequisites

‍

The directory structure for the project is as shown.

‍

Ensure that you are using a Python version 3.9.0 or later. Install the following Python libraries by preparing a requirements.txt file.

langchain==0.0.279
torch==2.0.1
transformers==4.32.1
accelerate==0.22.0
sentence_transformers==2.2.2
chromadb==0.4.2
pdfminer.six
bitsandbytes
requests
bs4

$ pip install -r requirements.txt

We’ve utilized a large language model that operates efficiently on a CPU with decent performance and a minimum of 8GB RAM. However, superior specifications are recommended. If you’re considering using other large language models, a cloud-based environment like E2E cloud might be necessary.

Clone the model repository from Hugging Face to the working directory.

Make sure you have git installed on your system.

$ git lfs install
$ git clone https://huggingface.co/MBZUAI/LaMini-T5-738M

‍Configuring the Database

constants.py

import os
from chromadb.config import Settings

CHROMA_SETTINGS = Settings(
    chroma_db_impl='duckdb+parquet',
    persist_directory='db',
    anonymized_telemetry=False
)

‍Loading the Data

The retrieval knowledge base must be constructed before building the application. For this we use a vector database. In order to retrieve specific information from a document, such as a patient’s lab report, we first need to process the content of the document. This involves converting the raw data into a format that can be understood and manipulated by our system.

Once the data is processed, it is then stored in a database. However, instead of storing the data in its original form, we convert it into a mathematical representation known as an embedding. These embeddings capture the semantic meaning of the data and allow us to perform complex operations on it. For example, if we want to query information from a patient’s lab report, we search for the embedding of its contents in our vector database.

We are using Chroma DB here for simplicity. ChormaDB is an open-source, feature-rich, simple vector database for building AI applications. Check out the documentation for details.

Import libraries and load contents into the vector database.

ingest.py

from langchain.document_loaders import PyPDFLoader, PDFMinerLoader, DirectoryLoader
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from os.path import join
import os
from constants import CHROMA_SETTINGS

Create a PDFMinerLoader object for the file. After all files have been processed, it loads the data from the last processed PDF file into the documents variable.

ingest.py

for root,dir,files in os.walk("docs"):
        for file in files:
            if file.endswith(".pdf"):
                loader = PDFMinerLoader(join(root,file))
documents = loader.load()

The document is segmented into multiple parts to simplify the search process. This method aids in the efficient retrieval of the most relevant content. We use RecursiveCharacterTextSplitter from LangChain to split the document into chunks of 500 characters with an overlap of 500 characters between each chunk.

Then, it uses the SentenceTransformerEmbeddings model “all-MiniLM-L6-v2” to generate embeddings (numerical representations) for each chunk of text.

ingest.py

textsplitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=500)
texts = textsplitter.split_documents(documents)

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Create a Chroma object from the given texts and their corresponding embeddings. It then persists (saves) this data in the form of parquet files to a directory named “db” for future use.

ingest.py

db = Chroma.from_documents(texts, embeddings, persist_directory="db", client_settings=CHROMA_SETTINGS)

Creating LLM Object

Here we are using an open-source lightweight LLM called LaMini-T5-738M. Load the embedding model from the pretrained checkpoint. Use AutoModelForSeq2SeqLM class to load the seq2seq (or encoder-decoder) model that has a language modeling (LM) head on top.

app.py

from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
import torch
checkpoint="LaMini-T5-738M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint,
    device_map="auto",
    torch_dtype=torch.float32
)

‍Create a pipeline for text-to-text generation using the specified model, tokenizer, and several parameters that control the text generation process. Adjust the temperature parameter to control the randomness of the output. Lower values make the output more deterministic. max_length sets the maximum length of the generated text.

app.py

from langchain.llms import HuggingFacePipeline

from langchain.llms  import HuggingFacePipeline

def llm_pipeline():
    pipe=pipeline(
        'text2text-generation',
        model=base_model,
        tokenizer=tokenizer,
        temperature=0.4, 
        max_length=256,
        do_sample=True,
        top_p=0.95
    )
    local_llm=HuggingFacePipeline(pipeline=pipe)
    return local_llm

Configuring the Chain

Set up a question-answering system pipeline using the language model and a retriever. db.as_retriever() creates a retriever from the Chroma database. The retriever is responsible for fetching relevant documents based on a query. In LangChain, a chain serves as a comprehensive wrapper that encompasses multiple individual components. Each command within this chain can either be a request directed towards the Large Language Model (LLM) or a function call that taps into an alternate data source. In LangChain, a “chain type” refers to the specific configuration or sequence of commands that you want the Large Language Model (LLM) to execute.

app.py

from langchain.chains import RetrievalQA
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from constants import CHROMA_SETTINGS

def qa_llm():
    llm=llm_pipeline()
    embeddings=SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    db=Chroma(persist_directory="db", embedding_function=embeddings,      client_settings=CHROMA_SETTINGS)
    retriever=db.as_retriever()
    qa=RetrievalQA.from_chain_type(
      llm=llm, 
      chain_type="stuff",
      retriever=retriever,
      return_source_documents=True
    )
    return qa

Now pass the query and generate a response from the LLM.

app.py

def process_answer(instruction):
    response=''
    qa=qa_llm()
    generation=qa(instruction)
    answer=generation['result']
    return answer, generation

if name == "main":
    instruction = "Your query goes here"  # replace this with your query
  
    answer, generation = process_answer(instruction)
    print("Answer:", answer)
    print("Generation:", generation)

Add the PDF file you need to explore in the docs directory and run the Python files to prepare the database with the embeddings and then see the result.

$ python3 ingest.py

$ python3 app.py

Sample query:

Wrapping Up

And that was a simple RAG-based LLM application. You have learnt how to use LLM with RAG to generate relevant and informative answers from large-scale text corpora. Try experimenting with various chains in LangChain and build multi-PDF readers. We hope you enjoyed this tutorial and found it useful for your projects.

References

https://developer.dataiku.com/latest/tutorials/machine-learning/genai/nlp/gpt-lc-chroma-rag/index.html

https://www.trychroma.com/

https://github.com/AIAnytime/Search-Your-PDF-App

‍

Guide to Building a RAG Based LLM Application

Retrieval Augmented Generation (RAG)

The System Workflow

Prerequisites

Creating LLM Object

Configuring the Chain

Wrapping Up

References

Related Articles

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

Company

Legal & Policies

Investor Relations

Resources