Bear in Mind

Pre-trained Large Language Models (LLMs), like ChatGPT, do a tremendous job at generating answers to our questions, re-organizing the information they acquired during training (the way they store information is nothing like a Wikipedia page). Often, we want models to use only part of their knowledge or to use information they didn't have access to during training (for example, the internal documents of our organization).

If we want LLMs to focus only (in reality, mostly) on specific texts, we have two choices:

Fine-tuning a pre-trained model or
Setting up a Retrieval-Augmented Generation (RAG).

In this post, we'll focus on the second solution. This is the first of a series on RAG. While we cover the entire pipeline here, subsequent posts will focus on specific parts.

TLDR

With RAG you can use your documents to enhance prompts for LLMs. LangChain offers tool for the entire pipeline: from loading and indexing documents, to submitting to the model. This post presents a simple application of RAG using LangChain and a local LLM run with Ollama. You can skip ad the end for the whole working code or you can dive into specific parts of the post:

Indexing: where we load documents and store into a vector store.
Retrieval: where we use the query to retrieve relevant documents from the store.
Generation: where we use the query and the retrieved documents to build a prompt and query an LLM.

What is Retrieval-Augmented Generation (RAG)

The prompt used to query an LLM usually contains a context that steers text generation. For example:

During the second world war, who was the president of the United States?

contains the question who was the president of the United States? and the context during the second world war. Even a simple local LLM (llama3.2:1b with Ollama) gives the correct answer

Franklin D. Roosevelt served as the President of the United States from 
1933 to 1945 [...]

The same model gives a completely different answers if we omit the context. Asking who was the president of the United States? gives

As of my last update in 2023, the President of the United States is Joe 
Biden. [...]

If you are analyzing historical documents on WWII, you would like the model to infer the context (i.e., WWII) from such documents. Retrieval-Augmented Generation is a way to automatically do this.

RAG uses previously indexed documents to build a context based on the query, which is then added to construct the final prompt for the LLM.

Theory

You can find more details on RAG and its theoretical underpinnings on my notes on RAG.

Preliminaries

I'll assume you're familiar with Python and basic package management. Besides that, there are two things that you need to do before preparing for this guide

Install Ollama and download a model that suits your hardware (you can use commercial, paid models, like OpenAI, just swap the appropriate LangChain classes).
Prepare a Python environment (a virtual environment is recommended) with all necessary dependencies pip install langchain langchain-community langchainhub langchain-ollama

You’re now ready to build a RAG application.

Indexing 🗂️

The first step in creating a RAG system is to loadn and index the documents of interest. To demonstrate the potential of RAG, I'll use this cat facts dataset, which has already been cleaned and converted to plain text.

Warning

The quality of indexed documents is crucial for the performance of your RAG system. The way in which content is organized heavily impacts the LLM responses.

Loading and splitting 📂

First, we load all files to be indexed. LangChain has a broad selection of Document Loaders and you won't be short on options! (see also the API Documentation).

from langchain.document_loaders import TextLoader

file_path = "./cat-facts.txt"
loader = TextLoader(file_path, encoding="utf8")
docs = loader.load()

After loading, documents are split into chunks. Splitting helps overcome some issues, such as unequal document lengths and limited LLM context. When documents are to be used with an LLM, it is a good idea to perform a tokenization-based split. To keep things simple, we split on newline separator \n, this is coherent with the content of our file. More on LangChain splitters can be found in the documentation.

from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=100,
    chunk_overlap=0,
    length_function=len,
)

all_splits = text_splitter.split_documents(docs)

I used short chunks (chunk_size=100) without overlap (chunck_overlap=0), but results may vary, you should experiment with these hyperparameters (welcome to the machine learning world!).

Warning

This step may generate several warnings saying that splits exceed the 100 characters limits. I decided to ignore them, but it could be better to use higher chunk_size.

Storing vectors 🗄️

Next comes indexing, where documents are transformed in embedding vectors to store in a vector store (so that they can be later retrieved). For such transformation, we need an embedding model, I used OllamaEmbeddings because I’m using Ollama chat model, however, it is not required to use an embedding that matches the chat model since we will translate them back to text to create the prompt.

from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="llama3.2:1b")

It is worth mentioning that the llama3.2:1b model is running locally on my M1 laptop using Ollama. We are now ready to convert documents into vectors and put them in the store. In this example, I used InMemoryVectorStore from LangChain, but you can change based on your needs, see LangChain Vector stores documentation.

from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)

Retrieval 🔍

Having stored all documents, we can use them to retrieve relevant information for any given query, we do that by running a similarity search on the store.

query = "Were cats ever sent to space?"
retrieved_docs = vector_store.similarity_search(query)
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

Results are joined into a single text, with the retrieved documents separated by two newline \n\n.

Generation 🤖

Finally, we can use the original query and the retrieved documents to construct and present a prompt to our LLM. We could do this “manually”, but LangChain has a hub of pre-compiled prompts that we can use, in particular there is one for RAG.

from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
prompt = prompt.invoke({"question": query, "context": docs_content})

In this case, e prompt is like a template that we need to fill out with the question and the context.

from langchain_ollama import ChatOllama, OllamaEmbeddings
llm = ChatOllama(model="llama3.2:1b")
answer = llm.invoke(prompt)

In one case, I got the following answer, which has a useful first part and a less useful second part.

Yes, cats have been sent to space, with Felicette being the first, launched
by France in 1963 and returning safely. The average cat food meal is equivalent
to about five mice, suggesting that many cats are fed enough to sustain
themselves. Despite their popularity, it's unclear how often or under what
conditions cats are actually shown at fairs and exhibitions, but they have
certainly been the subject of cat shows worldwide.

Here is another (very similar) output from the same model

Yes, the French cat Felicette, also known as "Astrocat," was sent to space in 1963.
In contrast, cats are typically fed a diet consisting of around five mice per meal
on average. The first recorded cat show took place in London in 1871.

Wrapping up 🌯

With little preparation: installing Ollama and setting up a proper Python environment

# Optional (reccommended) create an environment
python -m venv .venv
source .venv/bin/activate # MAC and Linux
# Install minimal dependencies
pip install langchain langchain-community langchainhub  langchain-ollama

Here's the full code for our RAG application. You can copy, paste, and run this to see RAG in action!

# IMPORTS
from langchain import hub
from langchain.document_loaders import TextLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# INDEXING 

# Load document
file_path = "./cat-facts.txt"
loader = TextLoader(file_path, encoding="utf8")
docs = loader.load()

# Split doucment into smaller chunks
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=100,
    chunk_overlap=0,
    length_function=len,
)
all_splits = text_splitter.split_documents(docs)

# Construct embeddings for the documents and store in memory
embeddings = OllamaEmbeddings(model="llama3.2:1b")  
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)

# RETRIEVAL

# Query the storer for documents similar to the query
query = "Were cats ever sent to space?"
retrieved_docs = vector_store.similarity_search(query)
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

# GENERATION
# Create a prompt
prompt = hub.pull("rlm/rag-prompt")
prompt = prompt.invoke({"question": query, "context": docs_content})

# Create the chat and send the prompt
llm = ChatOllama(model="llama3.2:1b")
answer = llm.invoke(prompt)

print(answer.content)

Conclusions 👋🏼

In this post I presented a simple RAG application that uses a text file and a locally running LLM to perform an augmented text generation. This is an invaluable tool to start tinkering with some documents and Large Language Models without going through all the costs and issues of an entire fine tuning. I barely scratched the surface of the LangChain library and subsequent posts will focus on ways to make the above code more flexible, stay tuned 📻.

Retrieval-Augmented Generation (RAG) with LangChain and local Ollama