A vector illustration of a tech city using latest cloud technologies & infrastructure

Guide to Using GPT Cache with Langchain and LlamaIndex: The Semantic Cache for LLMs to Speed Up Inferencing

December 6, 2023

Introduction

In today’s world of technology, inferencing is the best tool for any kind of AI to learn the process of making predictions, decisions, or drawing conclusions based on input data and the knowledge gained during the training phase of a machine learning model. It involves applying a trained model to new, unseen data to generate an output. Inferencing plays a crucial role in various applications of artificial intelligence and machine learning, covering aspects like Image and Object Recognition, NLP (Natural Language Processing), Speech Recognition, Recommendation Systems, Healthcare Diagnostics, Autonomous Vehicles, Fraud Detection, and Manufacturing and Quality Control.

Caching is a technique used in computing to store and reuse previously computed or fetched data. In the context of machine learning models like the Generative Pre-trained Transformer (GPT), caching might refer to using cached representations of input sequences to speed up inferencing.

In autoregressive language models like GPT, the model processes input sequences one token at a time. If you have a long input sequence and you're generating output tokens sequentially, you might cache the intermediate computations for the input sequence to avoid redundant calculations when generating subsequent tokens. This can be particularly useful during inference when you're generating text or making predictions based on a given context. Caching helps avoid recomputing the entire context for each token, making the process more efficient.

Storing semantic cache typically refers to a caching mechanism that takes into account the semantics or meaning of the data being cached. It goes beyond simple key-value pairs and considers the content or context of the data. A semantic cache can optimize the storage and retrieval of data, providing more intelligent and efficient caching strategies. This can lead to improved performance and responsiveness in speeding up inferencing. LlamaIndex and Langchain are two tools available for this purpose.

In this article, we are going to learn about these two tools to use a semantic GPT cache for speeding up inferencing.

‍

LlamaIndex & Langchain: An Overview

LlamaIndex works as a bridge between large language models and external data sources, while Langchain serves as a framework for managing and empowering applications based on Large Language Models (LLMs).

The basic difference between the two tools is that LlamaIndex focuses more on providing tools to create and organize knowledge using different index types such as tree index, list index, vector store index, etc., allowing users to arrange and assemble indexes in a way that makes sense. On the other hand, a key feature of Langchain, not available in LlamaIndex, is its Agents, which facilitate the use of Large Language Models. However, within LlamaIndex, you can use several different indexes, and then in Langchain, you can use different Agents as a router to the site to achieve the best results.

What Is LlamaIndex

LlamaIndex is a tool that acts as a bridge between your custom data and large language models (LLMs) like GPT-4, which are powerful models capable of understanding human-like text. Whether your data is stored in APIs, databases, or PDFs, LlamaIndex makes it easy to integrate this data into conversations with these intelligent machines. This bridging makes your data more accessible and usable, paving the way for smarter applications and workflows. The following steps occur while using LlamaIndex:

STEP 1: Ingesting Data

It means getting the data from its original source like PDF, API etc. into the system.

STEP 2: Structuring Data

It means organizing the data in a way that the language models can easily understand.

STEP 3: Retrieval of Data

It means finding and fetching the right pieces of data when needed.

STEP 4: Integration

It makes it easier to combine your data with various application frameworks.

The above steps help in facilitating better integration of the LLMs with external sources of data.

Installation and Set-Up

To install LlamaIndex on your system, if you are familiar with Python, use this command:


pip install llama-index

Let us now import the required module:


import os
os.environ["OPENAI_API_KEY"] = "your_api_key"

Now, we will create a LlamaIndex document. We can use the following syntax for doing the same:


from llama_index import download_loader


GoogleDocsReader = download_loader('GoogleDocsReader')
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=[...])

By following the steps outlined above, you can share any document with your large language model (LLM) to provide an increasing amount of external data. For example, let's now experiment with different data sources using data connectors.

PDF Files: We can use SimpleDirectoryReader for this purpose:


from llama_index import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_files = ["XYZ.pdf"])
Pdf_documents = reader.load_data()

Similarly, for Wikipedia pages, we can import download_loader for the same. You can use the above code for Wikipedia too.

There are several other data connectors:

SimpleDirectoryReader: Supports a broad range of file types (.pdf, .jpg, .png, .docx, etc.) from a local file directory.
NotionPageReader: Ingests data from Notion.
SlackReader: Imports data from Slack.
ApifyActor: Capable of web crawling, scraping, text extraction, and file downloading.

Creating Nodes

In LlamaIndex, once the data has been ingested and represented as documents, there is an option to further process these documents into nodes. Nodes are more granular data entities representing 'chunks' of source documents, which could include text chunks, images, or other types of data. They also carry metadata and information about relationships with other nodes, which can be instrumental in building a more structured and relational index.

To parse documents into nodes, LlamaIndex provides NodeParser classes. Here's how you can use a SimpleNodeParser to parse your documents into nodes:


from llama_index.node_parser import SimpleNodeParser
Assuming documents have already been loaded
Initialize the parser
parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
Parse documents into nodes
nodes = parser.get_nodes_from_documents(Pdf_documents)

Now we have to create an index with nodes and documents. The core essence of LlamaIndex lies in its ability to build structured indices over ingested data, represented as either documents or nodes.

Building Index with Documents

Here's how you can build an index directly from documents using the VectorStoreIndex:


from llama_index import VectorStoreIndex
Assuming docs is your list of Document objects
index = VectorStoreIndex.from_documents(docs)

Different types of indices in LlamaIndex handle data in distinct ways:

Summary Index: Stores nodes as a sequential chain, and during query time, all nodes are loaded into the Response Synthesis module if no other query parameters are specified.
Vector Store Index: This index stores each node and its corresponding embedding in a vector store, where queries involve fetching the top-k most similar nodes.
Tree Index: Builds a hierarchical tree from a set of nodes, and queries involve traversing from root nodes down to leaf nodes.
Keyword Table Index: This index extracts keywords from each node to build mapping. Queries then use these relevant keywords to fetch corresponding nodes.

Building Index with Nodes

You can also build an index directly from node objects, following the parsing of documents into nodes or through manual node creation:


from llama_index import VectorStoreIndex
Assuming nodes is your list of Node objects
index = VectorStoreIndex(nodes)

Using Index to Query Data

After having established a well-structured index using LlamaIndex, the next pivotal step is querying this index to extract meaningful insights or answers to specific inquiries.

LlamaIndex provides a high-level API that facilitates straightforward querying, ideal for common use cases.


Assuming 'index' is your constructed index object
query_engine = index.as_query_engine()
response = query_engine.query("your_query")
print(response)

In this simplistic approach, the as_query_engine() method is utilized to create a query engine from your index, and the query() method is used to execute a query.‍

What Is Langchain

Although you probably don’t have enough money and computational resources to train an LLM from scratch in your basement, you can still use pre-trained LLMs to build something cool, such as:

Personal Assistant which can interact with the outside world based on your data.
Chatbots customized for your purpose.
Analysis or Summarization of your documents or code.

LangChain is a framework that helps you build LLM-powered applications more easily by providing you with the following:

A generic interface to a variety of different foundation models.
A framework to help you manage your prompts.
A central interface for long-term memory, external data, other LLMs, and other agents for tasks an LLM is not able to handle (e.g., calculations or search).

Set-up and Installation

To install and run Langchain, run the following Python code

‍Building the Knowledge Base


from datasets import load_dataset
data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')
data

Vector Database

To create a vector database, we first need a free API key from Pinecone. Then we initialize it, like this:


import pinecone
find API key in console at app.pinecone.io
YOUR_API_KEY = getpass("Pinecone API Key: ")
find ENV (cloud region) next to API key in console
YOUR_ENV = input("Pinecone environment: ")
index_name = 'langchain-retrieval-augmentation'
pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002 )

Indexing

We can perform the indexing task using the LangChain vector store object. But, for now, it is much faster to do it via the Pinecone Python client directly.

Creating a Vector Store and Querying

Now that we've built our index, we can switch back to LangChain. We start by initializing a vector store using the same index we just built.


from langchain.vectorstores import Pinecone
text_field = "text"
switch back to normal index for langchain
index = pinecone.Index(index_name)
vectorstore = Pinecone(
    index, embed.embed_query, text_field
)
Now we will query about the data we have provided using the following code:
query = "who was Benito Mussolini?"
vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

‍Generative Question-Answering

In Generative Question-Answering (GQA), we take the query as a question that is to be answered by an LLM, but the LLM must answer the question based on the information it is seeing being returned from the vector store.

To do this, we initialize a Retrieval QA object like this:


from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)
qa.run(query)

By applying the process outlined above, we can learn to use Langchain efficiently with large language models (LLMs) to enhance inferencing.

Use of Langchain and LlamaIndex Together in Storing and Using GPT Cache

Caching mechanisms can be used with large language models like GPT. Caching is often employed to store intermediate results, precomputed values, or model outputs to improve efficiency and reduce computation time. Here's a general approach we might consider:

Langchain is a tool designed for managing caches of language models. You can integrate it into your application or workflow. Langchain provides APIs or utilities for storing and retrieving cached results efficiently.

On the other hand, LlamaIndex is another tool that excels in indexing and organizing cached data. It offers features for searching, updating, and managing the cache, providing a structured way to access stored information.

Integration of Langchain and LlamaIndex with GPT Cache

When using GPT, you can cache the model outputs for specific inputs. This is useful when you have repetitive queries or inputs that are used frequently.

When a new query comes in, first check the cache (using LlamaIndex) to see if the result is already stored (using Langchain). If it is, you can retrieve the cached result instead of re-running the expensive GPT inference.

Updating and Evicting Cache

Implement a strategy for updating and evicting the cache to ensure that the stored results are up-to-date. This might involve setting a time-to-live for cached entries or updating them when underlying data changes.

Managing Cache Size

Consider implementing a mechanism to manage the size of the cache, especially if storage resources are limited. LlamaIndex helps in efficiently managing this.

Concurrency and Parallelism

Take into account potential concurrency issues, especially in a multi-user or multi-threaded environment. Ensure that the caching mechanism is thread-safe and handles concurrent requests appropriately.

It's important to note that the effectiveness of such a caching strategy depends on the specific use case, the nature of the queries, and the characteristics of the data being processed. Always consider the trade-offs between storage, computation, and the frequency of data updates when designing a caching system. Additionally, check the latest documentation for Langchain and LlamaIndex for specific integration details and best practices.

The following semantic cache can be used for speeding up inferencing in LLMs.

Use of Semantic Cache in Speeding Up Inferencing

Semantic caching involves storing the meaning or semantics of data, which can be particularly useful in natural language processing tasks, like working with models such as GPT.

Here are some general steps you can take to increase inferencing speed using a semantic cache:

Identify Repeated Queries

Determine which queries or inputs are repeated frequently. These could be similar or identical requests made to your model.

Semantic Representation

Instead of caching raw inputs, consider storing a semantic representation or summary of the input. This could be a vector representation, a hash, or any other compact and meaningful representation of the input's semantics.

Use Efficient Data Structures

Choose data structures that enable fast retrieval based on semantic representations. Hash tables, for example, can provide quick access to cached results.

Query Transformation

Transform incoming queries into a standardized semantic representation before checking the cache. This ensures that semantically equivalent queries produce the same cache lookup key.

Hashing and Indexing

Utilize efficient hashing algorithms or indexing mechanisms to map semantic representations to cached results. This can significantly speed up the process of retrieving cached data.

Partial Results and Incremental Updates

Cache partial results or intermediate representations if the full inference is expensive. This allows you to reuse parts of the computation when the same or similar queries are encountered.

Versioning and Expiry

Implement versioning for your cache entries to manage updates. Set expiry times for entries to ensure that cached results are not outdated.

Parallel Processing

Explore parallel processing techniques to perform cache lookups concurrently. This can be especially useful in scenarios with high concurrent inference requests.

Monitor and Optimize

Regularly monitor the cache hit rate and overall system performance. Optimize the cache strategy based on usage patterns and evolving requirements.

Consideration for Context

Depending on the nature of the application, we can consider caching results with respect to context. For language models like GPT, context is crucial, so caching should account for it.

Remember that the effectiveness of semantic caching depends on the specific characteristics of the application, the nature of the queries, and the workload. It's often a trade-off between storage space, computational cost, and the benefits gained from caching. We should regularly evaluate and fine-tune our caching strategy based on real-world usage patterns and system requirements. Regularly analysing cache performance and making adjustments as needed based on real-world usage patterns can enhance the process.

Optimizing Large Language Model (LLM) Performance with LlamaIndex, LangChain, and GPTCache

The realm of artificial intelligence (AI) has witnessed remarkable advancements in recent years, with large language models (LLMs) emerging as powerful tools for a wide range of tasks, including natural language processing (NLP), text generation, translation, and question-answering. However, the computational demands of LLMs pose challenges, particularly when dealing with large datasets or repetitive prompts. To address these concerns, the combination of LlamaIndex, LangChain, and GPTCache offers a promising solution.

LlamaIndex: Efficient Data Retrieval

LlamaIndex, a vector search engine, serves as the foundation for efficient data retrieval in this framework. It constructs a semantic index of documents, enabling rapid identification of relevant passages based on their contextual meaning. This index significantly reduces the computational overhead associated with searching through vast text corpora.

To illustrate how LlamaIndex works, consider a document collection containing articles on various topics. LlamaIndex would process each document, creating a vector representation that captures its semantic meaning. When a user submits a query, LlamaIndex would compare the query vector to the document vectors, identifying the most relevant documents based on their semantic similarity.

LangChain: Modular NLP Framework

LangChain, a modular NLP framework, provides a comprehensive set of tools for processing and analyzing natural language. It facilitates the integration of LlamaIndex, enabling seamless access to the indexed data. Moreover, LangChain offers functionalities for text preprocessing, tokenization, and language modeling, further enhancing the NLP pipeline.

In the context of LLM usage optimization, LangChain plays a crucial role in preparing prompts for LLM processing. It can extract key information from retrieved documents, generate concise and informative prompts, and incorporate relevant context to improve the quality of LLM responses.

GPTCache: Semantic Caching

GPTCache, a semantic cache, acts as a gatekeeper between the user and the LLM, preventing unnecessary LLM calls and reducing response latency. It stores frequently used prompts and their corresponding responses, eliminating the need to repeatedly call the LLM for the same information.

GPTCache operates by maintaining a cache of prompt-response pairs. When a user submits a prompt, GPTCache checks if it has been used previously and retrieves the cached response if available. If the prompt is not cached, it is sent to the LLM for generation, and the response is stored in the cache for future use.

Integration and Benefits

The combined use of LlamaIndex, LangChain, and GPTCache offers several advantages for optimizing LLM performance:

Efficiency: LlamaIndex's semantic index enables rapid retrieval of relevant information, reducing search time and minimizing LLM calls.

Accuracy: LangChain's text processing capabilities ensure that prompts accurately reflect the user's intent, leading to more relevant and informative responses from the LLM.

Reduced Latency: GPTCache eliminates redundant LLM calls, significantly improving response time and overall system throughput.

Cost Optimization: By reducing LLM usage, the system incurs lower computational costs, making it more economical to operate.

Code Implementation

To illustrate the practical implementation of this framework, consider the following code snippet:


import llamaindex
import langchain
import gptcache
Create LlamaIndex instance
index = llamaindex.Index()
Load document corpus into the index
index.load_corpus('document_corpus.txt')
Create LangChain instance
langchain = langchain.Pipeline()
Create GPTCache instance
cache = gptcache.Cache()
Process user query
query = input("Enter your query: ")
Retrieve relevant documents using LlamaIndex
relevant_documents = index.search(query)
Process retrieved documents using LangChain
processed_documents = langchain.process_documents(relevant_documents)
Generate prompts based on processed documents
prompts = langchain.generate_prompts(processed_documents)
Check if prompts are cached
cached_responses = cache.get_cached_responses(prompts)
Generate responses using GPTCache and LLM
responses = cache.generate_responses(prompts, cached_responses)
Present responses to the user
print("Responses:")
for response in responses:
    print(response)

This code snippet demonstrates the integration of LlamaIndex, LangChain, and GPTCache to process user queries, retrieve relevant information, generate prompts, and provide responses using the LLM. The cached responses mechanism significantly reduces LLM usage, improving overall system performance and cost-effectiveness.

Summary

In order to increase the efficiency of inferencing with a semantic cache, we essentially aim to leverage precomputed results for frequently occurring or similar queries. This can significantly reduce the computational load and improve response times. We can enhance inferencing using a semantic cache by considering the following:

Identify Reusable Queries

We should always try to analyse the types of queries that are frequently used or repeated in the application. These are good candidates for caching.

Semantic Representation

Convert queries into a semantic representation that captures their meaning. This might involve tokenization, vectorization, or other methods that allow for efficient comparison.

Cache Key Generation

Create a unique cache key for each query based on its semantic representation. This key should be consistent for queries with the same meaning.

Cache Lookup

Before performing an inference, check the semantic cache using the cache key. If a result is found, retrieve it directly instead of running the inference again.

Cache Mishandling

If a cache miss occurs, proceed with the inference as usual. After obtaining the result, store it in the semantic cache with the corresponding cache key.

Expiration and Eviction Policies

We should always implement policies for cache expiration or eviction to ensure that outdated or less relevant results are removed from the cache.

Size Management

Consider the size of the cache and implement mechanisms to manage it. This may involve setting a maximum cache size, using a least recently used (LRU) policy, or other strategies.

Concurrency and Consistency

Ensure that the caching mechanism is thread-safe and handles concurrency appropriately. Consistency is crucial to avoid returning stale or incorrect results.

Logging and Monitoring

Implement logging and monitoring to track cache hits, misses, and overall cache performance. This can help you fine-tune the caching strategy based on actual usage patterns.

Adaptive Caching

Depending on the workload and usage patterns, consider adaptive caching strategies that dynamically adjust cache parameters to optimize performance.

Versioning

If your model or data undergoes changes, implement versioning in the cache to handle different versions of queries and results.

Conclusion

We should always remember that the effectiveness of a semantic cache depends on the nature of your queries and data. It's essential to strike a balance between caching efficiency and the potential for changing or dynamic queries.

We should regularly analyse cache performance and make adjustments based on real-world usage patterns to ensure optimal efficiency in inferencing with Large Language Models.

Sign up for Free Trial

Latest Blogs

Guide to Using GPT Cache with Langchain and LlamaIndex: The Semantic Cache for LLMs to Speed Up Inferencing

Table of Contents

Introduction

LlamaIndex & Langchain: An Overview

What Is LlamaIndex

STEP 1: Ingesting Data

STEP 2: Structuring Data

STEP 3: Retrieval of Data

STEP 4: Integration

Installation and Set-Up

Creating Nodes

Assuming documents have already been loaded

Initialize the parser

Parse documents into nodes

Building Index with Documents

Assuming docs is your list of Document objects

Building Index with Nodes

Assuming nodes is your list of Node objects

Using Index to Query Data

Assuming 'index' is your constructed index object

What Is Langchain

Set-up and Installation

find API key in console at app.pinecone.io

find ENV (cloud region) next to API key in console

Indexing

Creating a Vector Store and Querying

switch back to normal index for langchain

completion llm

Create LlamaIndex instance

Load document corpus into the index

Create LangChain instance

Create GPTCache instance

Process user query

Retrieve relevant documents using LlamaIndex

Process retrieved documents using LangChain

Generate prompts based on processed documents

Check if prompts are cached

Generate responses using GPTCache and LLM

Present responses to the user

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future