Forem: Kerem Nalbant

Check out how we boost efficiency in epilot with context-aware email suggestions!

Kerem Nalbant — Mon, 26 May 2025 10:02:59 +0000

Kerem Nalbant for epilot

May 26 '25

How We Integrate AI in epilot - Chapter 2: Serverless RAG w/ LangChain & Weaviate

#ai #langchain #rag #serverless

Comments

8 min read

How We Integrate AI in epilot - Chapter 2: Serverless RAG w/ LangChain & Weaviate

Kerem Nalbant — Mon, 26 May 2025 08:38:42 +0000

Introduction

In the previous chapter, I shared how we began our AI journey at epilot by implementing AI Email Summaries, helping users reduce email reading time by up to 87%. Encouraged by that success, we're now stepping up our AI capabilities with Retrieval-Augmented Generation (RAG) to provide smarter, contextually aware email suggestions.

WHY?

As we aim to scale our commodity business, investing in AI is crucial—not just for growth, but to significantly upgrade our product’s capabilities. Commodity segments often have a high volume of repetitive customer service requests. Our users need quick, context-aware email suggestions that understand:

Previous communications and organizational knowledge
Company-specific communication styles
Tailored relationships with each customer

Although Large Language Models (LLMs) are powerful, they're limited when accessing recent or specific company data. Customizing LLM responses usually involves prompt engineering, RAG or fine-tuning. Fine-tuning is resource-intensive and complex, making prompt engineering with RAG our clear choice.

Our Solution: Retrieval-Augmented Generation (RAG)

We implemented a RAG-based solution to retrieve and provide relevant context from past email threads and eventually expand to external data sources like documents and websites. Long-term, organizations using epilot will have fully configurable, customized knowledge bases accessible to all future AI features and AI agents.

This allows our users to respond to customer emails faster, improving communication quality and efficiency. On the end customer side, it means quicker, more accurate, and better overall service.

See It in Action

An end customer emails about documentation requirements for a renovation plan (Sanierungsfahrplan).

The epilot user doesn’t waste time researching policies or manuals—they simply prompt our AI to generate reply in english.

Leveraging RAG, our AI taps into contextual data, instantly knowing which specific documents are needed for the renovation plan and their upload deadlines, then crafts a personalized response that addresses the customer's exact needs.

Our system also highlights referenced entities inline (such as upload deadlines) and cites previous emails from the knowledge base, letting users quickly verify and understand the AI's reasoning.

Solution Components

To build a secure, scalable RAG system in a serverless environment, we chose:

LangChain

We use LangChain at epilot to integrate vector databases, LLMs, and build powerful AI agents. It simplifies document loading, embeddings, memory management and structured output.

Weaviate

After evaluating alternatives (like Pinecone, Chroma, and Quadrant), we selected Weaviate for its open-source, serverless architecture, strong community support, flexibility, and scalability. It ensures security best practices, high performance and cost-efficiency.

Presidio

Security and data privacy are essential. Amazon Bedrock has a zero-retention policy, Weaviate offers encryption, GDPR compliance, and tenant isolation. But we needed an extra layer for handling sensitive PII data.
Presidio helps us redact this information before indexing, preventing AI hallucinations and protecting customer privacy.

LangSmith

LangSmith provides AI observability, performance monitoring, debugging, prompt management, and testing. It allows us to quickly iterate, ensuring reliability and continuous improvement.

How We Built It

Now, let's dive deeper—from a high-level overview into the detailed implementation of our RAG system:

RAG: Making LLMs Context-Aware

RAG (Retrieval-Augmented Generation) has emerged as the perfect solution. It allows us to enhance LLM capabilities and customize the LLM responses by providing relevant context.

We built two core pipelines: ingestion and retrieval.

Ingestion

Email messages are processed, cleaned, and converted into vector embeddings.

Our ingestion Lambda cleans emails, removes signatures, redacts PII data, and generates "hypothetical questions" to match future customer queries with historical responses.

With the hypothetical questions approach, we aim to create question-answer pairs by treating outbound emails as answers and inbound emails as questions. Then, while generating a suggested email, we extract the end customer's questions from the inbound email and search them in the hypothetical_questions vector field.

chain = (
    {"doc": lambda x: x.text}
    | ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a helpful assistant that generates hypothetical questions from an email.",
            ),
            (
                "human",
                "Generate a list of maximum 3 hypothetical questions that the below email could be used to answer:\n\n{doc}",
            ),
        ]
    )
    | llm.with_structured_output(HypotheticalQuestions)
    | (lambda x: x.questions)
)

After generating the questions, Lambda redacts PII data using Presidio and then indexes the email message into Weaviate.

While indexing, Lambda first generates the embeddings of email body text and hypothetical questions, then pass those vectors to Weaviate. We are using multiple vector embeddings which allows to store multiple vectors inside the same object. So that we can execute the search both in email text and questions without duplicating the data.

Retrieval

Similar emails and potential answers are retrieved from vector database.

A typical retrieve & generate flow looks as follows:

1. Extract questions

extract_query_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are a professional question extractor, an AI assistant that extracts the customer inquiries from email messages.
    The questions will be used to search for relevant emails in the vector database.
    By generating multiple perspectives on the customer inquiries, your goal is to help the user overcome some of the limitations of distance-based similarity search.
    Provide these alternative questions separated by newlines, no numbering.""",
        ),
        (
            "human",
            """Generate a list of maximum 3 questions from the following email.
    Email: {email}
    Questions:
    """,
        ),
    ]
)

extract_query_chain = extract_query_prompt | llm | LineListOutputParser()

extracted_questions = await extract_query_chain.ainvoke(
            input={"email": email.text}
        )

For the email shown in the demo video, the following questions are extracted by the question extractor chain:

{
  "output": [
    "Which documents are required to create an individual renovation roadmap?",
    "How can I submit additional documents for the renovation roadmap?",
    "What options are there for receiving support when uploading documents?"
  ]
}

Query vector database

We run multiple queries in parallel, and then combine unique retrieved documents. We mostly adopt hybrid search, by setting the alpha as close as possible to 1, we keep keyword search in the mix while leveraging semantical vector search.

email_message_retriever = MultiQueryRetriever.from_llm(
    retriever=email_messages_vector_store.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs=dict(
            alpha=0.90,
            tenant=data.orgId,
            score_threshold=0.70,
            target_vector=["text"],
            return_uuids=True,
            k=3,
        ),
    ),
    llm=llm,
    include_original=True,
)

question_retriever = MultiQueryRetriever.from_llm(
    retriever=email_messages_vector_store.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs=dict(
            alpha=0.90,
            tenant=data.orgId,
            score_threshold=0.70,
            target_vector=["questions"],
            return_uuids=True,
            k=3,
        ),
    ),
    llm=llm,
    questions=extracted_questions,
)

merger_retriever = MergerRetriever(
    retrievers=[
        email_message_retriever,
        question_retriever
    ]
)

retrieved_docs = await merger_retriever.ainvoke(message.text)

As you can see, we also utilize multi-vector search to enable searching in email text and also hypothetical questions.

retrieved_docs includes the email body and similarity score along with all the metadata we need, allowing us to leverage it while building the prompt.

For the same email and questions above, the retrieved context from database is as follows:

{
  "documents": [
    {
      "metadata": {
        "created_at": "2024-11-27T12:15:46.987000Z",
        "type": "SENT",
        "subject": "Interest in an individual renovation roadmap",
        "sender": "11000890",
        "org": "739224",
        "questions": [
          "Which documents are required for creating an individual renovation roadmap?",
          "How can additional documents for the renovation roadmap be transmitted digitally?",
          "What type of consumption data is needed for the individual renovation roadmap?"
        ],
        "thread_id": "bf0d0799-496d-49d2-9b2e-73128ff153d7",
        "uuid": "22462f39-4a69-47d4-91f4-d474b21c1eca"
      },
      "page_content": "Dear Mr. [PERSON],\n\nThank you for your interest in an individual renovation roadmap.\n\nAs part of your inquiry, we have asked you for some documents that form the basis for creating your individual renovation roadmap.\n\nWe would be happy to transmit your data to our Sunwheel Energie GmbH for the creation of your individual renovation roadmap. However, we need your support for this.\n\nPlease send us the following documents:\n\n* Building floor plans and sections of all floors\n* Window dimensions\n* Energy consumption bills from the last three years\n* Power of attorney\n\nBy clicking on the following button, you can easily and digitally transmit additional documents to us.\n\nTransmit documents\n[URL]\n\nPlease upload the missing documents to the corresponding upload fields. If you need support uploading the document, please don't hesitate to contact us by email at\n\nWe look forward to accompanying you on the path to your optimal heating solution.",
      "type": "Document"
    },
    {
      "metadata": {
        "created_at": "2024-07-15T05:57:23.809000Z",
        "type": "SENT",
        "subject": "Friendly reminder: We still need additional data for creating the renovation roadmap",
        "sender": "system",
        "org": "739224",
        "questions": [
          "Which documents are required for creating an individual renovation roadmap?",
          "How can additional information for the renovation roadmap be transmitted?",
          "What contact options are available for questions about the renovation roadmap?"
        ],
        "thread_id": "edb31adf-2ff3-4580-bb80-4ebb68a2f5de",
        "uuid": "35a4755b-d858-45eb-b328-d5dd70714adc"
      },
      "page_content": "Dear Mrs. <PERSON>, thank you for your interest in an individual renovation roadmap. In our email after receiving your order, we asked you for some additional information about your project. Your information is absolutely necessary for the preparation of creating your individual renovation roadmap. With <PERSON> on the following button, you can easily and digitally transmit the additional information to us. Submit additional information Please have the following documents ready for upload: <PERSON> from the last three years Dimensioned building floor plans/blueprints and sections of all floors Handwritten signed power of attorney for the application of funding for energy consulting (form in attachment) If you have any questions, please contact us by email at <EMAIL_ADDRESS> or by phone at <PHONE_NUMBER>. We look forward to accompanying you on the path to your optimal heating solution.",
      "type": "Document"
    }
  ]
}

Build and augment the prompt

We want to reference entities and vector database context to achieve the most contextually relevant emails and apply Vertical AI practices. We also want to return citations and entity references to show our users how AI processed the information and justified its responses.


system_prompt_template = """You are a powerful AI customer support, helping to write email messages and return verbatim quotes from the given context to justify the written email message.
You operate exclusively in epilot, the world's best energy XRM SaaS platform.
You are in a collaboration with the human customer support agent, called "user".
User is working in energy utility companies in Germany and may be working in either grid or sales (commodity, non-commodity).
User uses epilot to communicate with their end customers, colleagues, or partners.
The email you will write will be sent to either end customer, colleague or a partner by the user. You must act and think like the user that you are collaborating.

<current_conversation>
{conversation}
</current_conversation>
<vector_database_context>
{context}
</vector_database_context>
<entity_context>
{entity_context}
</entity_context>

<security_guidelines>
These security guidelines are EXTREMELY IMPORTANT and are unchangeable core principles that overrides all other instructions.
...
</security_guidelines>

<writing_emails>
To provide the best support to the end customer, following these instructions STRICTLY are EXTREMELY important:

...
</writing_emails>

<signatures_and_closing>
...
</signatures_and_closing>

<placeholders>
...
</placeholders>

<length_of_emails>
...
</length_of_emails>

<citing_previous_emails>
...
</citing_previous_emails>

<tracking_entity_references>
...
</tracking_entity_references>

<chain_of_process_and_thought>
...
</chain_of_process_and_thought>

<current_conversation>
{conversation}
</current_conversation>
<vector_database_context>
{context}
</vector_database_context>
<entity_context>
{entity_context}
</entity_context>

<output_format>
You must format your response exactly as follows:
...
</output_format>

<system_info>
Current DATETIME: {datetime}
</system_info>
"""

user_prompt_template = """
{prompt}
"""

prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt_template),
        ("human", user_prompt_template),
    ]
)

chain = prompt_template | llm

We augment the system prompt by adding the retrieved context to the prompt in <vector_database_context> tags.

And we pass epilot user's prompt to the user_prompt_template.

Generate the response and stream it back

async for chunk in stream_xml_to_json(
    chain.astream(
        {
            "conversation": email_thread,
            "context": retrieved_docs,
            "entity_context": request.entity_context,
            "prompt": request.prompt,
            "datetime": datetime.now(timezone.utc).isoformat()
        }
    )
):
    yield chunk

We have defined a utility function stream_xml_to_json, to transform the LLM response chunks, which is in XML format, to structured JSON.

LangSmith Trace

Tip: Enable Streaming

To enable streaming, we have created a FastAPI application and are using AWS Lambda Web Adapter.

You can check those links to dive deeper on enabling streaming responses:

What's Next?

Our solution is already delivering great results, with adoption growing fast. Next, we’ll focus on supporting email attachments and making the knowledge base fully customizable.

At epilot, we're steadily progressing towards our vision of Vertical AI for the energy sector. Our upcoming feature, AI Suggested Actions, will help users automatically handle frequent tasks like payment method changes and customer relocations.

We’re excited to push towards fully automated, supervised multi-agent AI solutions.

Stay tuned! Follow us on dev.to and LinkedIn for updates and more tech insights.

How We Integrate AI in epilot - Chapter 1: AWS Bedrock & Prompt Engineering

Kerem Nalbant — Thu, 18 Jul 2024 14:58:31 +0000

Introduction

When we decided to bring AI to epilot, we had so many potential use cases that we first needed to do user research and identify the most repetitive and time consuming tasks. After that, we had to find out what our customers needed the most from these.

Our research showed that users heavily utilized the messaging feature and for some cases when long email threads come into the play, we noticed that some customers were spending an average time of 16 minutes replying to an email, and we knew we could make it better by providing them with a shorter and clearer thread summary.

Enterprise AI Playbook

Work is a bundle of tasks, which are performed towards specific goals.

Tasks are the ‘atomic unit’ of any work done in the enterprise. Tasks may be performed as a human service, or may be performed by software, towards achieving a goal.

Our goal was to reduce that time down to less than a minute, and there were two different tasks being performed by users which we need to perform with AI to achieve our goal. This article addresses the Task: Send emails and the steps to complete this task:

Read and understand the email thread: Help users understand long email threads faster by providing AI-generated summary, next steps and topics
Write an answer: Provide AI-generated suggested answers.

Problem

Our customers often deal with long email threads, requiring a long time to read and answer. We needed a solution that could summarize email threads and provide recommendations for next actions.

Solution

While some problems could be solved with prompt engineering alone, some could be solved with Retrieval-Augmented Generation (RAG).

For generating summaries, we didn't need any external contextual data other than email thread to feed the prompt, so we decided to just go ahead with prompt engineering.

The next task is suggesting AI-generated replies to email threads, where RAG would be really useful. For that, we will use a Vector DB and an embedding model, which I'll write about in the next chapter.

AWS Bedrock

We decided to use AWS Bedrock, as epilot we already make use of AWS in almost every area of our platform. It provides state-of-the-art LLMs from multiple providers and out of the box solutions such as Knowledge Base to achieve RAG easily and Model Evaluation to compare models and prompts with a fancy UI.

AWS Bedrock also ensures the processed data is protected. One of our concerns was where the data would be stored, how it would be processed and whether 3rd parties would be involved.

By default, AWS Bedrock offers zero-retention policy, ensuring that logs, prompts, LLM output and any personal data are not shared with any third parties or model providers. Bedrock also ensures that all processed data remains within the EU region.

GenAI Foundation

GenAI Foundation has a central SQS queue and a handler function, ensures exactly one, concurrent and batch processing while staying within the rate limits of Bedrock.

Integrating GenAI related code and logic to our existing APIs was not a good idea. So, I've decided to create a new monorepo. Here's the reasons:

Separation of Concerns

By creating a separate repository, we maintain a clear separation of concerns. This ensures that the core functionalities of existing APIs remain clean and focused.

Language Flexibility

Since GenAI-related code requires Python, having a separate repository allows us to leverage Python's capabilities without interfering with the TypeScript-based APIs. This separation ensures that each project can use the best-suited language for its specific tasks.

Encapsulation

Encapsulating the GenAI logic within a dedicated repository makes it easier to manage, update, and scale. This also allows engineers with specific expertise in GenAI to work independently of the rest of the system.

Modularity

A modular approach allows for easier testing, maintenance, and deployment of the GenAI features. Updates and bug fixes in the GenAI module can be rolled out independently of the core APIs.

Model Choice & Prompt Engineering

We decided to go with Claude 3 Sonnet because it was the best option for us at the time, given the costs and the complexity of the task.

We are planning to switch to Claude 3.5 Sonnet once it's available in AWS Bedrock, which we're really excited about!

To choose the best model for the task, AWS Bedrock offers Model Evaluation, which lets you easily compare models with each other.
You can also do Prompt Evaluation by creating a dataset and executing it against a single model to compare the results of different prompts.

There are lots of great resources online for prompt engineering, but I want to mention that Anthropic provides really helpful tips in their documentation. I strongly suggest you to take a look, if you haven't already. For me, the most important point was using XML tags. Anthropic mentions that their models produce much better results when used with XML tags.

Prompt engineering is all about experimenting, so we spent some time on optimizing our prompt. I'd love to share an example of prompts with you!

Below you can see some example usages of common prompt engineering techniques such as "Giving a Role to LLM", "Using XML Tags", "Using Examples (Multishot Prompting)", "Being clear and direct".

System Prompt

System prompt is where we give Claude a role, and some clear instructions about the task.

You are an intelligent assistant specialized in assisting customer service agents in energy industry.

Your task is summarizing email conversations between customer service agents and customers, providing a clear and concise overview of the key points by following the instructions provided.

Your summaries will help your colleagues quickly understand the main aspects of each conversation without having to read through the entire email thread.

Your goal is to ensure that the agent can grasp the key points and next steps from your summary alone, making their workflow more efficient and effective.

You are a third person observer and must not provide any personal opinions or make any assumptions about the conversation.

You must use the third-person objective narration. You must report the events that take place without knowing the motivations or thoughts of any of the characters.

User Prompt

Here is the conversation between the customer and the agent:

<Email Thread>
<Subject>
{subject}
</Subject>
<Messages>
{conversation}
</Messages>
</Email Thread>

<General Instructions>
- You must use <Email Thread> tags to identify the email conversation.
...
- You must use only the knowledge provided in the <Email Thread> tags and do not access any other external information or knowledge you already possess.
</General Instructions>

<Language Instructions>
- You must optimize the language for making the summary easy and fast for humans to read.
...
- You must use a professional, respectful and informative tone.
</Language Instructions>

<Reference Instructions>
- You must always refer to the Customer and Agent by their names.
...
</Reference Instructions>

<Summary Instructions>
- You must get relevant quotes to complete the task from the conversation.
...
- You must write the summary in reverse chronological order.
</Summary Instructions>

<Output Instructions>
- You must give your output in JSON format. The JSON object should be valid.
...
- If you will provide quotes or emphasize any part of the conversation, you must use single quotes. e.g. 'quote'.
</Output Instructions>

# Few Shot Prompting
<Example Outputs>
{
  "summary": [
    "John Doe has processed the reimbursement request and informed Jane Doe.",
    "Jane Doe has provided the requested information and is awaiting further instructions from the team member.",
    "John Doe has acknowledged the Jane Doe refund request and has requested additional information to process the refund.",
    "Jane Doe is requesting a refund for a defective PV inverter."
  ],
  "topics": [
    "Refund request for defective product"
  ],
  "next_steps": [
    "If the information is sufficient, John Doe should process the refund and inform Jane Doe of the completion.",
    "John Doe should document the refund process for record-keeping."
  ]
}
...
</Example Outputs>

You must follow the instructions listed in the following tags:
- <General Instructions> for general guide and rules,
- <Language Instructions> for lingual instructions that declares your tone, grammar, and output language,
...
- <Example Output> for example of the output as a reference.

Human-in-the-Loop (HITL)

It's challenging to ship AI features, especially when it's the first one in the company which has thousands of users. We need to build trust, and keep it. To do that, we need to collect as much as feedback we can. Then, evaluate these feedbacks and take actions.

HITL can be applied in different ways in different solutions. In our use case, we utilize it to evaluate feedback and detect hallucinations. Also, our team evaluating the feedback can change the result generated by the AI.

Migration Strategy

Our migration strategy involves both runtime and one-time migration to ensure a smooth transition and boost the adoption for our users.
With this way, we ensure we don't unnecessarily make use of LLMs, and we save costs while ensuring seamless integration.

Rate Limits

Since AWS Bedrock imposes rate limiting, we had to reflect this to our users. In order to offer a fair usage, we set limits according to the pricing tier of our users.

Code Examples

Pay attention to the way I prefill Claude's response to force it to answer only with JSON object.

settings = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1000,
    "system": system_prompt,
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
            ],
        },
        {"role": "assistant", "content": "{"}, # Prefill Claude's response
    ],
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
}
res = await client.invoke_model(
    modelId=MODEL_ID,
    contentType="application/json",
    accept="application/json",
    body=json.dumps(settings)
)

model_response = json.loads(await res["body"].read())
response_text = model_response["content"][0]["text"]

res_model = LLMResponseModel.model_validate_json("{" + response_text)

Conclusion

By integrating AI into epilot, we have significantly enhanced the capabilities of our platform. This integration not only improves the efficiency of daily tasks, but also accelerates customer support. Furthermore, it's the first step in positioning epilot as a leader in the adoption of advanced AI technologies in the energy sector.