Retrieval-Augmented Generation (RAG)

Analogy
Sometimes we take tests without enough time to prepare. If we are lucky, the test is online and unsupervised, the only obstacle is time to find answers. We could prepare such a test carefully arranging material and devising an efficient search strategy. When we see a question, we reach for the answer on our material. RAG is the equivalent of this strategy for LLMs.
RAG is a way to automatically enhance the prompts presented to Large Language Models (LLM). There are three main steps in RAG:
  1. Indexing the documents used to enhance the prompt are stored into a database as embedding vectors.
  2. Retrieval the database is searched for information that is relevant to the user query.
  3. Generation a prompt containing the original query and the retrieved context is submitted to an LLM.
The next sections will explain in more details what happens during each of the three steps above. Below is reported a schematic representation of the RAG process.
Yin Yang for Theory and Practice
Schematics representation of RAG.
By Turtlecrown - Own work, CC BY-SA 4.0, Original

Indexing

If you are asked to answer questions about some documents, you want to have an index that helps you find content relevant to such question (query). Such index should be created in advanced to be fast in retrieving information.
In RAG all the documents that contain relevant information are indexed into a vector database, which is a special type of database designed to store embedding vectors.
Yin Yang for Theory and Practice

Indexing documents in RAG. The text Bear Generation is translated into a sequence of embedding vectors. The sequence is then stored into a vector database.

The indexing involves the following phases:
  1. Pre-process documents transforming into plain text if needed (Embeddings and LLMs work with text based data);
  2. Chunk documents into smaller units (for example lines or paragraphs);
  3. Encode chunks into a sequence of embedding vectors; and
  4. Store embeddings into the vector database.

Retrieval

When a query is issued by the user, the RAG system looks for relevant information on the vector database. The retrieved text is used to augment the prompt for the LLM. The phases in involved in retrieval are:
  1. Encode the query in embedding vectors using the same encodings used during indexing;
  2. Search within the database using the encoded query;
  3. Decode retrieved embedding vectors into text;
  4. Construct a text prompt for the LLM.
Yin Yang for Theory and Practice

When user submits a query to the system (say Do you know any bear?), the RAG system searches into the vector database for relevant documents. The resulting information (Bear generation and Bear in mind in the image) is included in the prompt submitted to the LLM.

Note
Often, the vector database already includes the encoding and decoding (steps 1. and 3. above).
Note
Because LLMs already translate text to embeddings, it is natural to ask why are we need decoding the retrieved embeddings into text. There are several reasons for this:
  • We may not have access to the post-embedding part of the LLM (for example when using public models like ChatGPT);
  • The embeddings used by LLM and by the database may be different.

Generation

The final step is simply the submission of the original query along with the context, which is the text obtained in the retrieval step.
It may be useful to create a prompt that clearly states what is the context and what is they query.
Given the following contextual information
<<<PUT HERE CONTEXT>>>

Answer the following query
<<<PUT HERE QUERY>>>
This example shows how context and query are well identified (also using <<< ... >>> separators) ad separated to help the LLM constructing an answer.