Services
Technologies
Industries
Success stories
Company
Knowledge

Product ideation

Product design

Product development

Product scaling

Cloud development

What’s RAG and how to use it to make AI apps better?

#artificial intelligence

Summary

Retrieval augmented generation (RAG) is an approach that combines large language models (LLMs) with external data retrieval to improve accuracy and relevance of the results in AI apps.
By sourcing contextual information via vector databases and integrating it into AI prompts, RAG reduces issues like hallucinations and improves reliability, especially in domain-specific or time-sensitive queries.
The key components of a good RAG include: robust retrieval mechanism and LLM, supported by frameworks like LangChain for streamlined development.
RAG is particularly impactful in applications like customer support, virtual assistants, and content summarization, offering scalable and tailored AI solutions by leveraging business-specific data.
Future advancements in interactivity, reasoning, and LLM capabilities promise even greater potential for RAG systems.

When OpenAI released GPT-3, it marked not just an upgrade in capabilities, it also set in motion a transformation in how programmers approach feature development using artificial intelligence. With unprecedented ease of access, Large Language Models (LLMs) entered mainstream use, powering everything from chatbots to content generation tools. Yet, this rapid evolution doesn’t come without its obstacles like hallucinations, prompt injections, and lack of up-to-date information.

In this article, we’ll dive into one particularly pressing challenge: LLMs’ struggle with contextual understanding and up-to-date knowledge that leads to hallucinations and inaccurate responses. But fear not! We’ll explore a solution known as retrieval augmented generation (RAG) – a promising approach that enhances AI’s conversational accuracy.

What is retrieval augmented generation?

In the context of AI, retrieval augmented generation (RAG) is an approach that improves the accuracy of AI apps by enhancing them with contextual data from external sources. It does this by combining information retrieval with the generation capabilities of large language models (LLMs).

Imagine a personal assistant that not only finds relevant factual information but also crafts contextually relevant responses. RAG uses a retrieval process to source data and a generation model to produce natural language output, making it ideal for applications like chatbots and customer support or any other that require specific business context.

Standard interaction with Large Language Model

To better understand this, let’s explore a typical interaction with an LLM. As someone from Złotoryja, a small town in Lower Silesia, Poland, I was curious if AI knows who’s currently the mayor of my hometown. So, I turned to ChatGPT for an answer.

Diagram illustrating the interaction between a user and a large language model (LLM). The user asks "Who is the mayor of Złotoryja?" and the LLM incorrectly responds "Zbigniew Szaleniec".

ChatGPT confidently informed me that the current mayor is “Zbigniew Szaleniec,” which translates to “Zbigniew the Crazy Man.” However, as a proud resident of Złotoryja, I can assert that while this information is very amusing, it is also untrue. This example illustrates a common issue with LLMs: hallucinations. They are prone to generate inaccurate or entirely fabricated answers, especially when they are asked specific domain questions or they lack up-to-date knowledge about the topic.

Introducing retrieval augmented generation to the mix

Large language models rely heavily on the context provided in prompts. By retrieving relevant information related to a user’s query and including it in the prompt, we can significantly enhance the likelihood of receiving a valid response. When accurate information is retrieved, it provides the model with a solid foundation, reducing hallucination risks and increasing reliability. This semantic search approach is central to RAG architecture.

Diagram illustrating the Retrieval Augmented Generation (RAG) process in a large language model (LLM). It shows a user's question being augmented with information retrieved from the internet, files, and a vector database before being processed by the LLM to generate a valid answer.

For instance, identifying the current mayor of Złotoryja would involve retrieving data from sources such as databases or relevant documents about local governance. By integrating accurate and up-to-date information with LLM capabilities, we can achieve more reliable results.

Diagram showing a user asking a large language model (LLM) the question 'Who is the mayor of Złotoryja?'. The user's question is combined with context and a prompt before being processed by the LLM. The context is drawn from information about local governments. The LLM then provides the correct answer.

Key components

As we could see in the previous example, when developing retrieval augmented generation (RAG) systems, two primary components are crucial: the AI model and the retrieval mechanism, typically facilitated by a vector database.

AI model

The AI model is crucial as it generates responses to user queries. The type of model you choose significantly impacts response quality and system effectiveness. If you’re just starting or working on a proof of concept (PoC), using closed-source models from providers like OpenAI or Anthropic is a great way to start. These models are efficient and easy to use, making them perfect for initial exploration.

For production-level retrieval augmented generation (RAG) systems, using your own or open-source models hosted on platforms like AWS Bedrock, Databricks, Google Cloud Vertex, or your own infrastructure is often preferable. This choice enhances data privacy by avoiding third-party data sharing and circumvents common issues like API availability, rate limits, and service disruptions. Self-hosting provides greater control and reliability for your RAG systems, making it ideal for production environments.

Retrieval

The second crucial component of a retrieval augmented generation (RAG) system is, as the name suggests, retrieval. This step plays a vital role in the overall performance of your system. Without accurately retrieving data relevant to the user’s query, the AI model’s response can be unsatisfactory or even incorrect. Ensuring that your retrieval process is fine-tuned is therefore essential.

In scenarios where data volume is small, you might inject all relevant context directly into the model (provided it fits within the model’s context window). However, more often than not, the volume of data exceeds what can be accommodated in a single prompt. This makes it necessary to find an efficient method to filter through the information and extract only what’s important to the query.

This is where vector databases become invaluable. They excel at handling large datasets by organizing training data in a way that allows for quick and precise retrieval of relevant information based on similarity searches. This capability ensures that your RAG system delivers accurate and contextually relevant responses, even when dealing with vast amounts of data.

Retrieval using vector databases

To achieve accurate answers with AI models, it’s crucial that the retrieval process returns highly relevant information. A widely used technique for this purpose is employing vector databases. These databases store vector embeddings, which represent original data in a form that facilitates efficient querying and retrieval of pertinent information.

The first step involves embedding training data by converting various types, such as text or images, into vector representations using embedding algorithms designed to capture semantic characteristics.

Diagram illustrating the process of creating embeddings and storing them in a vector database. Documents are passed through an embedding model, which converts them into numerical representations (embeddings) like '0.3 0.1 0.45...'. These embeddings are then stored in a vector database for efficient search and retrieval.

When you need to retrieve information, the query itself is also converted into a vector representation. Proximity algorithms then compare these query vectors with stored vectors using metrics like cosine similarity or Euclidean distance to identify matching documents efficiently.

Diagram showing the workflow of a query in a system using embeddings and a vector database.

The process starts with a query which is passed through an embedding model to generate a numerical representation. This embedding is then used by similarity search algorithms (often defaulting to cosine similarity) to find relevant documents within the vector database.

This way, we can efficiently retrieve relevant information and use it in model prompts in real time.

To give you a better perspective on the entire process, here’s a typical workflow for using retrieval augmented generation (RAG) shown on the diagram.

Diagram showing how a Large Language Model (LLM) answers user questions using a vector database. Data is pre-processed and embedded into a vector database. When a user asks a question, the system retrieves and ranks relevant information from the database before the LLM generates and refines a result.

As you can see on the diagram, there are three important steps in the RAG workflow that we haven’t discussed yet that deserve a brief mention:

preprocessing: gather, categorize, and cleanse unstructured data for storage in vector databases;
post-processing: validate responses for format correctness or guideline alignment;
reranking: use rerankers to improve recall performance by prioritizing relevant retrieved information. For more details on rerankers: Pinecone’s Guide on Rerankers.

LangChain – framework for building AI power features

Building AI features can be a complex task, given the multitude of tools and techniques available. Fortunately, frameworks like LangChain are available for both Python and JavaScript/TypeScript, simplifying the development process with a range of useful abstractions that promote flexibility and scalability in AI systems.

For example, LangChain provides simple abstractions over chat models, allowing developers to switch between different providers effortlessly without getting bogged down by each provider’s API details. This is particularly beneficial for teams looking to build adaptable applications that can integrate with multiple AI services seamlessly.

Beyond chat models, LangChain offers a wealth of tools such as document loaders, prompts, agents, and validators. These components are designed to streamline feature development across both Python and JS/TS environments, making it easier to implement robust AI solutions without having to reinvent the wheel.

However, LangChain isn’t without its challenges. The framework is under rapid development, which sometimes results in sudden changes that can lead to outdated documentation and problems with development. Despite these growing pains, LangChain’s popularity and the active community surrounding it provide optimism for its continued improvement and reliability.

Building features using retrieval augmented generation

Let’s build something exciting! With the basics covered, we can now put our knowledge into practice. Recently, I had the great honor of being a speaker at the 4Developers conference, so let’s create an AI 4Developers assistant.

We’ll use Pinecone as our vector database, taking advantage of its serverless option to save time and money by eliminating the need to host and manage the database ourselves. Additionally, we’ll use LangChain toolset for building AI apps. This setup will allow us to take our first steps efficiently and effectively.

Keep in mind that I will be skipping the boring parts like adding .envs in this article! You can find all the details from this tutorial in my GitHub repo.

First, we need training data to get started. I’ve prepared a crawler that scans through all the events and scrapes information about every lecture in the history of 4Developers. This process has resulted in a file containing objects with the following structure:

Of course, you can use your own data on any subject. However, since we’re focusing on building a 4Developer assistant, it’s crucial to gather data specifically related to the relevant lectures from the conference.

Next, we’ll use Pinecone to store all the information we’ve gathered for efficient retrieval. We’ll start by creating a serverless Index, where we can store our embeddings.

Screenshot of a Pinecone vector database management interface. The image highlights the 'Indexes' tab in the left-hand navigation menu and the 'Create index' button in the main content area, suggesting the first steps to getting started with creating a new index in Pinecone.

Navigate to database and indexes and click ’Create index’.

Screenshot of the "Create a new index" window in the Pinecone vector database platform. It shows options for configuring the new index, including selecting a pre-configured model, specifying the number of dimensions, and choosing a metric. The "Create index" button is highlighted, ready to finalize the index creation.

Then, pick a name for your serverless index and configuration. Let’s stick to the standard text-embedding-ada-002 and cosine proximity metric.

Screenshot of the Pinecone vector database platform displaying details of an index named '4developers'. The index uses the 'text-embedding-ada-002' embedding model, 'cosine' metric, and has 1536 dimensions. It's hosted on AWS in the 'us-east-1' region with a 'Serverless' capacity mode. The index currently holds no records. Options to 'Add a Record' or 'Import data' are provided.

Index created, now, we are ready to insert the data!

After running the script, the documents are successfully loaded into our database. It’s important to highlight a key aspect of this process: each document inserted into the database consists of `content` and `metadata`.

Content: this is the core data that gets converted into vector embeddings. It represents the information we want to query by, such as lecture titles, abstracts, and other textual details.

Metadata: these are additional pieces of information associated with each embedding. Metadata can include elements like speaker names, lecture dates, or tags. Unlike page content, metadata allows for filtering in a manner similar to traditional databases, enabling more refined and targeted queries.

Screenshot of the Pinecone platform showing search results from a vector database. 10 matches were found, with the top two results displayed. Each result includes an ID, vector values, and metadata like author, date, description, and a final URL. The results are ranked by score, with the highest score (1.0000) at the top.

Now that we are ready to hook everything up, first let’s use LangChain to initiate the tools that we will be using:

Now, we can create a chain that serves us as a way to invoke LLM with a retriever and string parser.

This is how SYSTEM_PROMPT for the chain looks like:

The most important part in our case is:

It interpolates the retrieved context to the question as standard strings.

So, after we retrieve specific documents, this section looks like this:

Okay, looks like everything is set up! We can try to give it a go!

Let’s ask about RAGs presentations!

And here we are!

Response: “Yes, the 4Developers event on November 5, 2024, in Wrocław, Poland, featured a presentation titled Inteligentny system na wyciągnięcie ręki – Architektura RAG by Radosław Karbowiak. This presentation explored the potential of retrieval augmented generation (RAG) systems in creating intelligent applications, discussing key components, best practices for monitoring, evaluation, and further development of RAG systems. It also included practical insights on building a simple RAG system using tools like LangChain and Pinecone.”

Here is everything together:

Where is RAG used?

retrieval augmented generation systems are making waves across a variety of applications, from virtual assistants and chatbots providing customer support to code generation and content summarization. The versatility of RAGs lies in their ability to enhance AI models by not only retrieving knowledge but also by sourcing similar sample answers to identical questions or efficiently sifting through data aggregated from multiple sources.

The real power of RAG systems becomes apparent when they are used to tap into your own business data or any valuable dataset that can inform the model’s responses. Without this retrieval aspect, you might just be deploying a basic GPT wrapper,” which limits the potential of your AI implementation. By integrating effective retrieval strategies, you can find new opportunities and increase value, transforming standard AI applications into robust, insightful tools tailored to specific needs.

What else is needed?

When developing retrieval augmented generation (RAG) systems, there are several key considerations to keep in mind.

Evaluation

Given the non-deterministic nature of AI, the rapid evolution of new models, and ever-changing requirements, maintaining a comprehensive evaluation dataset is crucial for consistently assessing your system’s performance. This set allows you to consistently assess whether your pipeline is functioning as intended. By using this evaluation data, you can define success metrics and fine-tune hyperparameters to optimize performance. Recently, tools like Promptfoo have emerged specifically to aid in this process, providing structured ways to test and refine your RAG implementations.

Monitoring

Once your RAG system is up and running, continuous monitoring becomes essential for ongoing development and optimization. You can opt for third-party solutions like LangSmith, which can be integrated seamlessly with frameworks such as LangChain. Alternatively, you might choose to collect logs independently and use visualization tools like Grafana to monitor the process. Each option has its benefits: third-party tools often offer ease of integration and specialized features, while self-hosted solutions provide greater control over data.

Monitoring your pipeline in a production environment is vital not only for spotting bugs but also for refining prompts based on real-world data. With effective monitoring, you can make informed adjustments that improve system accuracy and reliability over time.

What’s on the horizon for RAG?

As AI technology advances, retrieval augmented generation (RAG) systems are set to revolutionize how we interact with data. Here’s a glimpse of what’s next for RAG:

Interactive and intent-aware retrieval

The next generation of RAG systems promises to be far more interactive, taking cues from innovative platforms like Perplexity.AI. These systems will refine user queries by actively seeking additional input and leveraging intent detection technologies. This means that even when users provide vague or incomplete queries, the system can infer the underlying intent, leading to more precise and relevant results. By understanding the user’s end goal, RAG systems can not only improve accuracy but also enhance user satisfaction by delivering information that truly meets their needs.

Agentic RAG and complex reasoning

Future RAG implementations are expected to incorporate advanced reasoning capabilities, making them adept at managing complex, multi-step queries. By harnessing structured data sources such as knowledge graphs and utilizing text-to-SQL conversions, these systems will be able to draw connections across disparate data sets. This ability to interlink data points allows for a richer understanding of context and enables RAG systems to deliver deeper insights. Whether it’s piecing together information from multiple relevant documents or synthesizing complex relationships in data, these advancements position RAG as a powerful tool for solving complex problems.

Advances in large language models

The development of new large language models (LLMs) like Gemini 1.5 is significantly boosting the capabilities of RAG systems. With context windows extending beyond a million tokens and models fine-tuned specifically for RAG tasks, these LLMs offer unprecedented depth in processing vast amounts of contextual information. Such enhancements lead to greater accuracy in tasks like text summarization and question answering, where the ability to consider a broader array of information results in more comprehensive responses. As these models continue to evolve, they promise to improve the performance of RAG systems across various applications.

RAG in AI – conclusions

At its core, retrieval augmented generation (RAG) is a straightforward concept, yet it serves as a cornerstone for any advanced AI application. The distinction between a generic AI app, often referred to as a “GPT wrapper,” and one that offers unique value lies in the integration of specific business context through RAG. This powerful technique can elevate your AI applications to new heights, especially when there’s a need to leverage context from large volumes of data.

While this article introduces the foundational concepts of RAG, there’s much more beneath the surface. In particular, topics such as evaluating RAG pipelines, data chunking, safety considerations, optimizing parameters, observability, and mastering prompt engineering offer deeper insights into maximizing your system’s potential.

I hope this serves as a great starting point for your journey into RAG systems and inspires you to build your first retrieval augmented generation-powered application! To dive deeper, consider exploring resources online and keep an eye on Gorrion’s blog, where we will cover other AI topics in greater detail.

Have a project in mind?

Let’s meet - book a free consultation and we’ll get back to you within 24 hrs.

Radek Karbowiak

At Gorrion, I am a full-stack software developer specializing in the JavaScript ecosystem and AWS cloud technologies. I am an AWS-certified Solutions Architect and Developer, with a strong interest in computer science, particularly in architecture, artificial intelligence, and cloud computing. Outside of work, I enjoy playing sports and serving as a Game Master in paper RPGs. Seasonally, you can find me sailing or carving through the snow on my snowboard.