Forem: Sanket Sudake

Retrieval-Augmented Generation: Using your Data with LLMs

Sanket Sudake — Thu, 18 Apr 2024 05:43:12 +0000

Large Language Models (LLM) are great at answering questions based on which they have been trained on, mostly content available in the public domain. However, most enterprises want to use the LLMs with their specific data. Methods like LLM finetuning, RAG, and fitting data into the prompt context are different ways to achieve this.

This article explains the basics of Retrieval-Augmented Generation (RAG) and demonstrates how your data can be used with LLM-backed applications.

Understanding Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) involves retrieving specific data chunks from an indexed dataset and using a Large Language Model (LLM) to generate an answer to a given question or task. This process typically consists of two main components at a high level:

Retriever: This component retrieves relevant data for the user-provided query.
Generator: The generator performs an augmentation process on retrieved data (usually by putting data in prompt context) and feeds data to the LLM, to generate a relevant response.

(High-level RAG)

SkillScout - Implementing Retrieval Augmented Generation (RAG)

While the above explanation gives a very high-level overview of the RAG, below we take an example of RAG implementation for the case internally explored at InfraCloud.

It started with an internal ask from talent management team,

“Team, as of today, when we are looking for a specific skills for a customer from our talent, we mostly scan through updated resumes or try to match based on our past memories, if a person has certain skillset. This is not very efficient.

We can explore, if we can leverage LLM to help us query our resume dataset better.”

Let's say we have a dataset of resumes/CVs of candidates from an organization and the talent search/management team is looking to extract valuable insights from the dataset via a ChatGPT kind of interface through Question-Answering. We are giving a nickname to our solution, “SkillScout”.

Samples prompts/questions we are expecting from users of the solution:

“Show me candidate skilled in {skill}”
“Show me candidates with {certification name} certification”
“Find resume with {x-y} years of experience”
“What are the skills of {candidate} ?”
“How much experience does {candidate} have?”

Implementation

This section gives a brief overview of different components within our solution and data flow amongst them.

We use langchain and OpenAI text generation APIs for implementation, but most of the parts explained in the below pipelines are customizable. You can refer to the source from SkillScout GitHub Repo.

Eg. LLM or Embeddings model could be swapped with any compatible implementation, as you can notice from the core package.

We have three pipelines at a high level:

Document Preprocessing pipeline

This pipeline handles the extraction of specific details from resumes in various PDF formats. Given the diverse structures of resumes, this step is crucial for retrieving relevant information based on user queries.

(Document Preprocessing Pipeline)

The summary prompt below is sent to LLM along with resume data.

Please extract the following data from the resume text provided in
triple quotes below and provide them only in format specified:

Name of the candidate:
Professional Summary:
Total years of experience:
Skills (comma separated):
Companies worked for (comma separated):
Education:
Certifications (comma separated):

When extracting the candidate’s total years of experience if available
in the resume text or calculate the total years of experience based on
the dates provided in the resume text.

If you are calculating then assume current month and year as Jan 2024.

Resume Text:
{text}

Vector Store Ingestion pipeline

Once the resumes are processed and summarized, this pipeline stores and indexes all the resume data with a vector store. This ensures the information is readily available and easily accessible for the Retrieval-Augmented Generation (RAG) process.

We use “sentence-transformers” encodings for our pipeline, but the MTEB leaderboard is a good source to compare different embedding models.

(Vector Store Ingestion Pipeline)

Retrieval Augmented Generation pipeline

When a new user query arrives, the query undergoes a formatting phase, where it is restructured based on past chat history and previous questions. This step ensures that the query is not only understandable but also aligns with the context of the ongoing conversation.

Next, the reformatted query is fed into our LLM embeddings model to calculate its embedding. This process converts the query into an embedded query.

With the query now embedded, we move on to a pivotal step: the vector similarity search operation. Here, we compare the embedded query against a vector store containing relevant information. This search with retriever enables us to retrieve highly relevant and contextually appropriate data, setting the stage for a more informed response.

Armed with the retrieved information, we construct a prompt tailored to the LLM model, enriching it with the contextual data gleaned from the vector search. This custom prompt is then presented to the LLM, which leverages it to generate responses that are not just accurate but also deeply contextualized to the user's query and the ongoing conversation.

(Retrieval Augmented Generation Pipeline)

Based on chat history and query from a user, we first check if we should reformat the question and reduce its dependency on the chat history.

Let's say we have two questions,

Query 1: What skills does John Doe have?

Query 2: What's his total years of experience?

Now in Query 2, “his” is a possessive pronoun which refers to John Doe. We may rewrite Query 2 as “What’s John Doe’s total years of experience?”

This helps in embedding contextual information, such as chat history, into the query,

Chat History:
{chat_history}

Question: {question}

Given a chat history and the latest user question which might
reference context in the chat history, formulate a standalone question
which can be understood without the chat history.

Do NOT answer the question, just reformulate it if needed and
otherwise return it as is.

Generation with retrieved content from the vector store is done via the following prompt.

Context has resume data of candidates. 
Answer the question only based on the following context.
If you don't know the answer, just say that you don't know.
Be concise.

Context: {context}
Question: {question}
Answer:

SkillScout demo

We load our minimal Resume data as mentioned in SkillScout project README. After generating a summary of resumes and loading data in the vector store, we can see the chat interface via the app command.

You can see in the screenshot below that we ask about a particular candidate's experience and skill set in the following question. We also ask people with certain experience.

Enhancing RAG Implementation for production applications

One of the key reasons behind RAG's popularity among application developers is its relatively lower barrier to entry compared to directly working with LLMs. With a background in application development, one can easily get started with RAG. Additionally, RAG allows developers to experiment with different foundational or customized LLMs and compare the results, offering flexibility and exploration.

However, RAG implementation could be nuanced and complex, depending on factors such as:

Accuracy expectations: The level of accuracy required for the task at hand will influence the design and implementation of the RAG system.
Hallucination prevention: Ensuring the generated responses are accurate and relevant, without introducing false or misleading information, is crucial.
Dataset scale and nature: The dataset's size and nature will impact the retrieval process and the overall performance of the RAG system.

While the above implementation demonstrates the feasibility of the RAG approach, several considerations are crucial for deploying RAG in production applications.

Factors for Retrieved Chunks in RAG:

Relevance: Chunks should be highly relevant to the query or context, ensuring that generated responses are on-topic and useful.
Diversity: Chunks should be diverse, avoiding redundancy and providing a broad range of information.
Contextuality: Chunks should be contextual, considering the conversation history for coherent responses.
Accuracy: Ensure retrieved information is accurate and factually correct to prevent incorrect distribution of facts across chunks.
Ranking: The order of retrieved chunks matters; re-ranking steps post-retrieval and incorporating user feedback improves relevance.

Optimizing the RAG Pipeline:

Optimizing the RAG pipeline is essential for generating better and more relevant responses. Considerations include:

Privacy and security: Ensure personal and sensitive information is not revealed.
Data management policies: Adhere to data management policies, including forgetting past data and injecting fresh data as per requirements.
Scalability, reliability, and cost-efficiency: Ensure the RAG solution is scalable, reliable, and cost-efficient for enterprise usage.

Addressing these considerations can optimize RAG implementations for production applications, providing users with accurate, diverse, and contextually relevant responses while maintaining privacy and security.

What are other datasets where you can consider using the RAG technique?

Dataset	Description
Doc QA	A chatbot can answer questions based on product documentation.
Medical QA	We can have a bot answering questions about different medicines based on actual manuals, etc.
Product QA	Product buyers/consumers can ask questions about the product info and any additional information.

Comparing RAG and LLM finetuning

When it comes to integrating your data with LLMs, two prominent methods stand out: LLM finetuning and Retrieval-Augmented Generation (RAG). Each approach has its own advantages and drawbacks, and ongoing developments continuously enhance their efficacy.

In a practical solution, you may use a combination of both to get better results.

LLM Finetuning	RAG
Alters LLM	No alterations to LLM
Less adaptability/flexibility	More adaptability/flexibility
Tightly coupled with LLM	There is no tight dependency on LLM, the model can be swapped
Data set size constraints overall considering the cost of finetuning	No data restrictions as such, cost marginally increases with dataset size increase

Conclusion

In conclusion, Retrieval-Augmented Generation (RAG) represents a powerful approach to leveraging Large Language Models (LLMs) for tasks requiring contextual understanding and data retrieval.

RAG enables more accurate and contextually relevant responses, making it particularly useful for applications that provide tailored responses to user's queries by extracting information.

As advancements in GenAI technology continue, we can expect RAG to play an increasingly important role in enabling more intelligent and context-aware interactions with data. You can follow this article to train LLM-backed applications on your data with RAG; however, having an efficient, scalable cloud infrastructure for AI application building is crucial. For that, you can bring in AI & GPU Cloud experts onboard to help you out.

We hope you found this post informative and engaging. For more posts like this one, subscribe to our weekly newsletter. We’d love to hear your thoughts on this post, so do start a conversation on LinkedIn :)

References

Using CoreDNS Effectively with Kubernetes

Sanket Sudake — Sat, 08 May 2021 14:56:54 +0000

Backstory

We were increasing HTTP requests for one of our applications, hosted on the Kubernetes cluster, which resulted in a spike of 5xx errors.
The application is a GraphQL server calling a lot of external APIs and then returning an aggregated response.
Our initial response was to increase the number of replicas for the application to see if it improves the performance and reduces errors.
As we drilled down further with the application developer, found most of the failures related to DNS resolution.
That’s where we started drilling down DNS resolution in Kubernetes.

This post highlights our learnings related to CoreDNS, as we did deep dive in process of troubleshooting.

CoreDNS Metrics

DNS server stores record in its database and answers domain name query using the database.
If the DNS server doesn’t have this data, it tries to find for a solution from other DNS servers.

CoreDNS became the default DNS service for Kubernetes 1.13+ onwards.
Nowadays, when you are using a managed Kubernetes cluster or you are self-managing a cluster for your application workloads, you often focus on tweaking your application but not much on the services provided by Kubernetes or how you are leveraging them.
DNS resolution is the basic requirement of any application so you need to ensure it’s working properly.
We would suggest looking at dns-debugging-resolution troubleshooting guide and ensure your CoreDNS is configured and running properly.

By default, when you provision a cluster you should always have a dashboard to observe for key CoreDNS metrics.
For getting CoreDNS metrics you should have Prometheus plugin enabled as part of the CoreDNS config.

Below sample config using prometheus plugin to enable metrics collection from CoreDNS instance.

.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods verified
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

Following are the key metrics, we would suggest to have in your dashboard.
If you are using Prometheus, DataDog, Kibana etc, you may find ready to use dashboard template from community/provider.

Cache Hit percentage: Percentage of requests responded using CoreDNS cache
DNS requests latency
- CoreDNS: Time taken by CoreDNS to process DNS request
- Upstream server: Time taken to process DNS request forwarded to upstream
Number of requests forwarded to upstream servers
Error codes for requests
- NXDomain: Non-Existent Domain
- FormErr: Format Error in DNS request
- ServFail: Server Failure
- NoError: No Error, successfully processed request
CoreDNS resource usage: Different resources consumed by server such as memory, CPU etc.

We were using DataDog for specific application monitoring. Following is just a sample dashboard I built with DataDog for my analysis.

Reducing DNS Errors

As we started drilling down more into how the application is making requests to CoreDNS, we observed most of the outbound requests happening through the application to an external API server.

This is typically how resolv.conf looks in the application deployment pod.

nameserver 10.100.0.10
search kube-namespace.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5

If you understand how Kubernetes tries to resolve an FQDN, it tries to DNS lookup at different levels.

Considering the above DNS config, when the DNS resolver sends a query to the CoreDNS server, it tries to search the domain considering the search path.

If we are looking for a domain boktube.io, it would make the following queries and it receives a successful response in the last query.

botkube.io.kube-namespace.svc.cluster.local  <= NXDomain
botkube.io.svc.cluster.local <= NXDomain
boktube.io.cluster.local <= NXDomain
botkube.io.us-west-2.compute.internal <= NXDomain
botkube.io <= NoERROR

As we were making too many external lookups, we were getting a lot of NXDomain responses for DNS searches.
To optimize this we customized spec.template.spec.dnsConfig in the Deployment object.
This is how change looks like:

     dnsPolicy: ClusterFirst
     dnsConfig:
       options:
       - name: ndots
         value: "1"

With the above change, resolve.conf on pods changed.
The search was being performed only for an external domain.
This reduced number of queries to DNS servers.
This also helped in reducing 5xx errors for an application.
You can notice the difference in the NXDomain response count in the following graph.

A better solution for this problem is Node Level Cache which is introduced Kubernetes 1.18+.

Customizing CoreDNS to your needs

We can customize CoreDNS by using plugins.
Kubernetes supports a different kind of workload and the standard CoreDNS config may not fit all your needs.
CoreDNS has a couple of in-tree and external plugins.
Depending on the kind of workloads you are running on your cluster, let’s say applications intercommunicating with each other or standalone apps that are interacting outside your Kubernetes cluster, the kind of FQDN you are trying to resolve might vary.
We should try to adjust the knobs of CoreDNS accordingly.
Suppose you are running Kubernetes in a particular public/private cloud and most of the DNS backed applications are in the same cloud.
In that case, CoreDNS also provides particular cloud-related or generic plugins which can be used to extend DNS zone records.

If you are interested in customizing DNS behaviour for your needs, we would suggest going through the book “Learning CoreDNS” by Cricket Liu and John Belamaric.
Book provides a detailed overview of different CoreDNS plugins with their use cases.
It also covers CoreDNS + Kubernetes integration in depth.

If you are running an appropriate number of instances of CoreDNS in your Kubernetes cluster or not is one of the critical factors to decide.
It’s recommended to run a least two instances of the CoreDNS server for a better guarantee of DNS requests being served.
Depending upon the number of requests being served, nature of requests, number of workloads running on the cluster, and size of cluster you may need to add extra instances of CoreDNS or configure HPA (Horizontal Pod Autoscaler) for your cluster.
Factors like the number of requests being served, nature of requests, number of workloads running on the cluster and cluster size should help you in deciding the number of CoreDNS instances.
You may need to add extra instances of CoreDNS or configure HPA (Horizontal Pod Autoscaler) for your cluster.

Summary

This blog post tries to highlight the importance of the DNS request cycle in Kubernetes, and many times you would end up in a situation where you start with “it’s not DNS” but end up “it’s always DNS !”.
So be careful of these landmines.

Enjoyed the article? Let's connect on Twitter and start a conversation.