When OpenAI released GPT-3, it marked not just an upgrade in capabilities, it also set in motion a transformation in how programmers approach feature development using artificial intelligence. With unprecedented ease of access, Large Language Models (LLMs) entered mainstream use, powering everything from chatbots to content generation tools. Yet, this rapid evolution doesn’t come without its obstacles like hallucinations, prompt injections, and lack of up-to-date information.
In this article, we’ll dive into one particularly pressing challenge: LLMs’ struggle with contextual understanding and up-to-date knowledge that leads to hallucinations and inaccurate responses. But fear not! We’ll explore a solution known as retrieval augmented generation (RAG) – a promising approach that enhances AI’s conversational accuracy.
In the context of AI, retrieval augmented generation (RAG) is an approach that improves the accuracy of AI apps by enhancing them with contextual data from external sources. It does this by combining information retrieval with the generation capabilities of large language models (LLMs).
Imagine a personal assistant that not only finds relevant factual information but also crafts contextually relevant responses. RAG uses a retrieval process to source data and a generation model to produce natural language output, making it ideal for applications like chatbots and customer support or any other that require specific business context.
To better understand this, let’s explore a typical interaction with an LLM. As someone from Złotoryja, a small town in Lower Silesia, Poland, I was curious if AI knows who’s currently the mayor of my hometown. So, I turned to ChatGPT for an answer.
ChatGPT confidently informed me that the current mayor is “Zbigniew Szaleniec,” which translates to “Zbigniew the Crazy Man.” However, as a proud resident of Złotoryja, I can assert that while this information is very amusing, it is also untrue. This example illustrates a common issue with LLMs: hallucinations. They are prone to generate inaccurate or entirely fabricated answers, especially when they are asked specific domain questions or they lack up-to-date knowledge about the topic.
Large language models rely heavily on the context provided in prompts. By retrieving relevant information related to a user’s query and including it in the prompt, we can significantly enhance the likelihood of receiving a valid response. When accurate information is retrieved, it provides the model with a solid foundation, reducing hallucination risks and increasing reliability. This semantic search approach is central to RAG architecture.
For instance, identifying the current mayor of Złotoryja would involve retrieving data from sources such as databases or relevant documents about local governance. By integrating accurate and up-to-date information with LLM capabilities, we can achieve more reliable results.
As we could see in the previous example, when developing retrieval augmented generation (RAG) systems, two primary components are crucial: the AI model and the retrieval mechanism, typically facilitated by a vector database.
The AI model is crucial as it generates responses to user queries. The type of model you choose significantly impacts response quality and system effectiveness. If you’re just starting or working on a proof of concept (PoC), using closed-source models from providers like OpenAI or Anthropic is a great way to start. These models are efficient and easy to use, making them perfect for initial exploration.
For production-level retrieval augmented generation (RAG) systems, using your own or open-source models hosted on platforms like AWS Bedrock, Databricks, Google Cloud Vertex, or your own infrastructure is often preferable. This choice enhances data privacy by avoiding third-party data sharing and circumvents common issues like API availability, rate limits, and service disruptions. Self-hosting provides greater control and reliability for your RAG systems, making it ideal for production environments.
The second crucial component of a retrieval augmented generation (RAG) system is, as the name suggests, retrieval. This step plays a vital role in the overall performance of your system. Without accurately retrieving data relevant to the user’s query, the AI model’s response can be unsatisfactory or even incorrect. Ensuring that your retrieval process is fine-tuned is therefore essential.
In scenarios where data volume is small, you might inject all relevant context directly into the model (provided it fits within the model’s context window). However, more often than not, the volume of data exceeds what can be accommodated in a single prompt. This makes it necessary to find an efficient method to filter through the information and extract only what’s important to the query.
This is where vector databases become invaluable. They excel at handling large datasets by organizing training data in a way that allows for quick and precise retrieval of relevant information based on similarity searches. This capability ensures that your RAG system delivers accurate and contextually relevant responses, even when dealing with vast amounts of data.
To achieve accurate answers with AI models, it’s crucial that the retrieval process returns highly relevant information. A widely used technique for this purpose is employing vector databases. These databases store vector embeddings, which represent original data in a form that facilitates efficient querying and retrieval of pertinent information.
The first step involves embedding training data by converting various types, such as text or images, into vector representations using embedding algorithms designed to capture semantic characteristics.
When you need to retrieve information, the query itself is also converted into a vector representation. Proximity algorithms then compare these query vectors with stored vectors using metrics like cosine similarity or Euclidean distance to identify matching documents efficiently.
This way, we can efficiently retrieve relevant information and use it in model prompts in real time.
To give you a better perspective on the entire process, here’s a typical workflow for using retrieval augmented generation (RAG) shown on the diagram.
As you can see on the diagram, there are three important steps in the RAG workflow that we haven’t discussed yet that deserve a brief mention:
Building AI features can be a complex task, given the multitude of tools and techniques available. Fortunately, frameworks like LangChain are available for both Python and JavaScript/TypeScript, simplifying the development process with a range of useful abstractions that promote flexibility and scalability in AI systems.
For example, LangChain provides simple abstractions over chat models, allowing developers to switch between different providers effortlessly without getting bogged down by each provider’s API details. This is particularly beneficial for teams looking to build adaptable applications that can integrate with multiple AI services seamlessly.
Beyond chat models, LangChain offers a wealth of tools such as document loaders, prompts, agents, and validators. These components are designed to streamline feature development across both Python and JS/TS environments, making it easier to implement robust AI solutions without having to reinvent the wheel.
However, LangChain isn’t without its challenges. The framework is under rapid development, which sometimes results in sudden changes that can lead to outdated documentation and problems with development. Despite these growing pains, LangChain’s popularity and the active community surrounding it provide optimism for its continued improvement and reliability.
Let’s build something exciting! With the basics covered, we can now put our knowledge into practice. Recently, I had the great honor of being a speaker at the 4Developers conference, so let’s create an AI 4Developers assistant.
We’ll use Pinecone as our vector database, taking advantage of its serverless option to save time and money by eliminating the need to host and manage the database ourselves. Additionally, we’ll use LangChain toolset for building AI apps. This setup will allow us to take our first steps efficiently and effectively.
Keep in mind that I will be skipping the boring parts like adding .envs in this article! You can find all the details from this tutorial in my GitHub repo.
First, we need training data to get started. I’ve prepared a crawler that scans through all the events and scrapes information about every lecture in the history of 4Developers. This process has resulted in a file containing objects with the following structure:
Of course, you can use your own data on any subject. However, since we’re focusing on building a 4Developer assistant, it’s crucial to gather data specifically related to the relevant lectures from the conference.
Next, we’ll use Pinecone to store all the information we’ve gathered for efficient retrieval. We’ll start by creating a serverless Index, where we can store our embeddings.
Navigate to database and indexes and click ’Create index’.
Then, pick a name for your serverless index and configuration. Let’s stick to the standard text-embedding-ada-002 and cosine proximity metric.
Index created, now, we are ready to insert the data!
After running the script, the documents are successfully loaded into our database. It’s important to highlight a key aspect of this process: each document inserted into the database consists of `content` and `metadata`.
Now that we are ready to hook everything up, first let’s use LangChain to initiate the tools that we will be using:
Now, we can create a chain that serves us as a way to invoke LLM with a retriever and string parser.
This is how SYSTEM_PROMPT
for the chain looks like:
The most important part in our case is:
It interpolates the retrieved context to the question as standard strings.
So, after we retrieve specific documents, this section looks like this:
Okay, looks like everything is set up! We can try to give it a go!
Let’s ask about RAGs presentations!
And here we are!
Response: “Yes, the 4Developers event on November 5, 2024, in Wrocław, Poland, featured a presentation titled Inteligentny system na wyciągnięcie ręki – Architektura RAG by Radosław Karbowiak. This presentation explored the potential of retrieval augmented generation (RAG) systems in creating intelligent applications, discussing key components, best practices for monitoring, evaluation, and further development of RAG systems. It also included practical insights on building a simple RAG system using tools like LangChain and Pinecone.”
Here is everything together:
retrieval augmented generation systems are making waves across a variety of applications, from virtual assistants and chatbots providing customer support to code generation and content summarization. The versatility of RAGs lies in their ability to enhance AI models by not only retrieving knowledge but also by sourcing similar sample answers to identical questions or efficiently sifting through data aggregated from multiple sources.
The real power of RAG systems becomes apparent when they are used to tap into your own business data or any valuable dataset that can inform the model’s responses. Without this retrieval aspect, you might just be deploying a basic GPT wrapper,” which limits the potential of your AI implementation. By integrating effective retrieval strategies, you can find new opportunities and increase value, transforming standard AI applications into robust, insightful tools tailored to specific needs.
When developing retrieval augmented generation (RAG) systems, there are several key considerations to keep in mind.
Given the non-deterministic nature of AI, the rapid evolution of new models, and ever-changing requirements, maintaining a comprehensive evaluation dataset is crucial for consistently assessing your system’s performance. This set allows you to consistently assess whether your pipeline is functioning as intended. By using this evaluation data, you can define success metrics and fine-tune hyperparameters to optimize performance. Recently, tools like Promptfoo have emerged specifically to aid in this process, providing structured ways to test and refine your RAG implementations.
Once your RAG system is up and running, continuous monitoring becomes essential for ongoing development and optimization. You can opt for third-party solutions like LangSmith, which can be integrated seamlessly with frameworks such as LangChain. Alternatively, you might choose to collect logs independently and use visualization tools like Grafana to monitor the process. Each option has its benefits: third-party tools often offer ease of integration and specialized features, while self-hosted solutions provide greater control over data.
Monitoring your pipeline in a production environment is vital not only for spotting bugs but also for refining prompts based on real-world data. With effective monitoring, you can make informed adjustments that improve system accuracy and reliability over time.
As AI technology advances, retrieval augmented generation (RAG) systems are set to revolutionize how we interact with data. Here’s a glimpse of what’s next for RAG:
The next generation of RAG systems promises to be far more interactive, taking cues from innovative platforms like Perplexity.AI. These systems will refine user queries by actively seeking additional input and leveraging intent detection technologies. This means that even when users provide vague or incomplete queries, the system can infer the underlying intent, leading to more precise and relevant results. By understanding the user’s end goal, RAG systems can not only improve accuracy but also enhance user satisfaction by delivering information that truly meets their needs.
Future RAG implementations are expected to incorporate advanced reasoning capabilities, making them adept at managing complex, multi-step queries. By harnessing structured data sources such as knowledge graphs and utilizing text-to-SQL conversions, these systems will be able to draw connections across disparate data sets. This ability to interlink data points allows for a richer understanding of context and enables RAG systems to deliver deeper insights. Whether it’s piecing together information from multiple relevant documents or synthesizing complex relationships in data, these advancements position RAG as a powerful tool for solving complex problems.
The development of new large language models (LLMs) like Gemini 1.5 is significantly boosting the capabilities of RAG systems. With context windows extending beyond a million tokens and models fine-tuned specifically for RAG tasks, these LLMs offer unprecedented depth in processing vast amounts of contextual information. Such enhancements lead to greater accuracy in tasks like text summarization and question answering, where the ability to consider a broader array of information results in more comprehensive responses. As these models continue to evolve, they promise to improve the performance of RAG systems across various applications.
At its core, retrieval augmented generation (RAG) is a straightforward concept, yet it serves as a cornerstone for any advanced AI application. The distinction between a generic AI app, often referred to as a “GPT wrapper,” and one that offers unique value lies in the integration of specific business context through RAG. This powerful technique can elevate your AI applications to new heights, especially when there’s a need to leverage context from large volumes of data.
While this article introduces the foundational concepts of RAG, there’s much more beneath the surface. In particular, topics such as evaluating RAG pipelines, data chunking, safety considerations, optimizing parameters, observability, and mastering prompt engineering offer deeper insights into maximizing your system’s potential.
I hope this serves as a great starting point for your journey into RAG systems and inspires you to build your first retrieval augmented generation-powered application! To dive deeper, consider exploring resources online and keep an eye on Gorrion’s blog, where we will cover other AI topics in greater detail.
Have a project in mind?
Let’s meet - book a free consultation and we’ll get back to you within 24 hrs.
At Gorrion, I am a full-stack software developer specializing in the JavaScript ecosystem and AWS cloud technologies. I am an AWS-certified Solutions Architect and Developer, with a strong interest in computer science, particularly in architecture, artificial intelligence, and cloud computing. Outside of work, I enjoy playing sports and serving as a Game Master in paper RPGs. Seasonally, you can find me sailing or carving through the snow on my snowboard.