Forem: Vitalii Honchar

flow-run: LLM Orchestration, Prompt Testing & Cost Monitoring

Vitalii Honchar — Tue, 19 Aug 2025 15:00:00 +0000

Introduction

Over the past couple of years, I've been observing a trending phenomenon on X (Twitter): "build in public." Developers building products share screenshots, code snippets, and progress updates from their projects, posting them with the hashtag #buildinpublic.

While this trend is fascinating, the projects being showcased are typically closed source and proprietary. I believe that #buildinpublic should be truly public, with projects being open sourced from day one.

That's why I'm excited to announce my new open source project flow-run, which I'll build completely in public and document every step of the journey. The source code will be available on GitHub from the very first day of development.

The idea for this project was inspired by my previous product ai-svc, developed for AI Founder, which I described here: Building ai-svc: A Reliable Foundation for AI Founder

Context

AI Engineering is an emerging trend, much like building in public, and we're seeing an explosion of AI-native applications being released. While I enjoy building AI-native applications myself, they all share common challenges:

LLM providers lack reliability (I previously published an article exploring this issue)
LLM prompt development differs fundamentally from classical programming, yet modern approaches tightly couple prompts with application code
AI application frameworks are limited to specific languages like Python or TypeScript, restricting the choice of programming languages for efficient application development

Drawing from my infrastructure experience, LLM integration resembles an infrastructure component rather than an application component. Prompts are remarkably similar to SQL queries - we have a dedicated engine (the LLM) that executes prompts, and we send requests to it. So why should we restructure our applications to embed prompts directly in application code when this is fundamentally an infrastructure concern?

The same principle applies to prompt versioning. While prompts resemble SQL, they're far more complex. It's insufficient to run only integration and load tests to verify functionality. With prompts, we need evaluation tests to ensure new versions perform better than previous ones. Embedding prompts within application code makes these tests unnecessarily complex.

Through reflection on these challenges, a product idea crystallized:

What if we treated prompts as code, similar to Infrastructure as Code tools like Terraform?
What if prompt execution was completely decoupled from application execution, with applications simply calling an execution engine?
What if prompt developers could focus exclusively on prompt development, evaluation, and deployment?

User Stories

Let's define the scope and requirements for this project. I'll use the user stories technique to understand user needs and derive project requirements from them.

User Personas

I'm building flow-run for two distinct user personas:

User Persona 1: Prompt Developer

Role: Develop, debug, evaluate, and deploy prompts
Background: Former software engineer with knowledge of building software products and using development tools
Primary Pain Point: No unified approach to prompt development; constantly writing Python scripts for quick testing, implementing workarounds for prompt evaluation

User Persona 2: Application Developer

Role: Develop, debug, and deploy application servers
Background: Software engineer with expertise in building software products and using development tools
Primary Pain Point: Lacks time or specialized knowledge for prompt development; focuses primarily on business logic implementation; existing LLM integrations are unreliable

Prompt Developer User Stories

US-1-1: Develop Prompts

As a Prompt Developer, I want to define my prompts without traditional programming languages like Python, while maintaining the benefits of source code versioning (Git) and syntax highlighting (IDE).

US-1-2: Test prompts

As a Prompt Developer, I want to test my prompts immediately after development without writing custom Python code or waiting for CI builds to execute.

US-1-3: Evaluate prompt versions

As a Prompt Developer, I want to evaluate newly developed prompts against their production versions to ensure the new version performs better than the previous one.

US-1-4: Deploy prompts

As a Prompt Developer, I want to deploy my prompts to dev and prod environments easily and reliably.

US-1-5: Automated prompts testing

As a Prompt Developer, I want to automate my prompt testing and run tests in CI after each Git push.

US-1-6: Automated prompts deployment

As a Prompt Developer, I want to automate prompt deployments through CD after each Git push to the main branch.

US-1-7: Prompts Workflows

As a Prompt Developer, I want to build workflows with my prompts where each step executes sequentially.

US-1-8: Prompts Agents

As a Prompt Developer, I want to build agents with my prompts and incorporate them into workflows as described in US-1-7.

US-1-9: Observability & Costs Management

As a Prompt Developer, I want to monitor prompt execution and track LLM costs.

US-1-10: Easy LLM swap

As a Prompt Developer, I want to switch LLM providers easily without extensive code changes.

Application Developer User Stories

US-2-1: Execute AI Flow

As an Application Developer, I want to reliably execute AI Flows defined by Prompt Developers in the flow-run service to add AI integration to my application.

US-2-2: Get AI Flow results

As an Application Developer, I want to retrieve AI Flow results from the flow-run service when they're ready.

Project Requirements

Based on the defined user stories, I can establish the following project requirements:

Reliable execution of AI flows using fire-and-forget semantics with guaranteed execution
Infrastructure-as-Code approach for prompt development and deployment
CLI tool for running, evaluating, and deploying prompts
CI/CD support for prompt testing and deployment
Support for common AI flow abstractions: tasks, workflows, and agents
Observability and cost reporting for AI flow executions
Multi-LLM provider support within the execution engine

Project Roadmap

Stage	User Story	Description
v1	US-1-1: Develop prompts	Enable prompt developers to create prompts
v1	US-1-7: Prompts Workflows	Implement workflow support in `flow-run`
v1	US-1-4: Deploy prompts	Enable prompt deployment capabilities
v1	US-1-10: Easy LLM swap	Support LLM provider switching from day one
v1	US-2-1: Execute AI Flow	Enable application developers to execute developed prompts
v1	US-2-2: Get AI Flow results	Enable retrieval of execution results
v2	US-1-2: Test prompts	Improve prompt quality in `flow-run`
v2	US-1-3: Evaluate prompt versions	Enable prompt evolution without quality degradation
v2	US-1-9: Observability & Costs Management	Add monitoring capabilities and cost tracking
v2	US-1-5: Automated prompts testing	Enable CI integration for testing
v2	US-1-6: Automated prompts deployment	Enable CD integration for deployments
v3	US-1-8: Prompts Agents	Implement support for prompt agents

Roadmap Explanation:

v1 stage delivers a minimum viable product supporting basic prompt development and workflow execution. This enables applications to begin integrating with flow-run without waiting for full feature completion
v2 stage introduces improvements in prompt testing and observability capabilities
v3 stage implements AI Agent support, which represents a complex feature requiring dedicated development focus

Conclusions

Thank you for reading this announcement! I'm thrilled to launch this truly public project that I've been contemplating for the past 2-3 years. In upcoming articles, I'll cover the system design and share regular progress updates. All source code will be available on GitHub throughout the development journey.

The idea of current project was inspired by my previous product ai-svc developed for AI Founder which I described here: Building ai-svc: A Reliable Foundation for AI Founder

Designing AI Applications: Principles from Distributed Systems Applicable in a New AI World

Vitalii Honchar — Tue, 05 Aug 2025 15:00:00 +0000

Introduction

For more than a year, I've been exploring AI Engineering by observing numerous new startups leveraging AI and launching my own products. AI is a super hot topic, and when you first try to build an AI tool or read about AI applications, it seems like a magical world where completely new principles are applied. But the key thing I learned during this period is that building AI applications is a process that's not too different from building distributed systems where reliability is a mandatory requirement.

In this article, I will explain how to build AI applications by applying distributed systems principles to make applications reliable and scalable.

Main Problem

Let's consider an example of the simple application:

User sends a request to a service
Service sends a request to LLM
LLM returns a response
Service returns a response to user

The main problem is that LLMs don't respond reliably, and too often LLM providers return 429 Too Many Requests errors. In this case, users will be unhappy but can retry a request themselves (or leave the app and never use it again).

But the situation gets worse when we have async user request processing, where the service accepts a user request and starts a batch job that needs to perform N communications with the LLM.

What if on the 3rd LLM request, the LLM returns a 429 Too Many Requests error and our batch job fails?

In the end, when users check the results, they'll receive an empty response because the LLM failed to process the job, which will result in losing this user and, at scale, losing many such users.

To deal with this situation, there is a solution, and it's not OpenRouter 😉.

Solution

The problem described above is a classic vendor dependency problem in software engineering, where the vendor API is not reliable and you can't go down when the vendor is unable to handle your request. To solve this, we need to use:

Retries - the most genius things in the world are simple.
Timeout between retries - because we don't want to DDoS the vendor and need to allow them to heal their systems in case of outages.
Durability - a guarantee that if we accepted a user request, we will handle it with 100% guarantee.

There are two options for how these properties can be implemented.

Option 1: In-Memory Retries

The most obvious and simple solution is to add a for-loop and retry if an exception happens:

async def send_request_with_retries():
    err = None
    for i in range(0, 10): # 1. Retries 10 times
        if i > 0:
            await asyncio.sleep(random.uniform(0.05, 0.2)) # 2. Randomly sleeps
        try:
            return do_request()
        except Exception as e:
            err = e
    raise RuntimeError(f"unable to process a request: {err}") # 3. Throws error

This code works like this:

Retries a request at most 10 times
Randomly sleeps between retries from 50 ms to 200 ms to not overload the LLM API. We can't use a fixed sleep timeout here because it would still work as a DDoS.
If we reach the retry limit, throw an error

The code is simple, but it doesn't solve the last property: Durability. If our service fails in the middle of the for loop, we'll forget about this request and never retry it again. This is not an issue in Option 2.

Option 2: Transactional Outbox Pattern

Instead of running a simple retry loop, we can use the Transactional Outbox Pattern.

User sends a request to our service.
Service saves the request to a database.
Service responds 200 OK to the user.
The scheduler module in the service checks every 1 second for pending jobs in the database.
The scheduler module executes the LLM request if a pending job is in the database, and if LLM execution is successful, it saves the execution result in the database and marks the pending job as done.

This pattern ensures that if the service accepts a user request, it will provide a 100% guarantee that the request will be handled (or near 100% because limits are applicable here as well).

If we have a reliable mechanism to call LLMs, it allows us not only to make the application reliable but also to reduce LLM costs. Instead of using expensive LLMs like Anthropic or OpenAI, we can use cheap models like DeepSeek and significantly reduce the costs of running the application.

Fun fact: I spent $20 during development of reddit-agent just for running local tests. But after implementing the Transactional Outbox pattern, I completely migrated from OpenAI API to DeepSeek, and right now reddit-agent costs me $0 because the LLM is not reliable but free, and the reliable pattern makes it reliable.

Internally, the scheduler module will use retry logic from Option 1 but also adds a persistence layer to ensure that the service will handle user requests.

During the rest of this article, I will explain how to adopt this Transactional Outbox pattern using the example of my application reddit-agent, which I built specifically for this article. The source code is available on GitHub.

Example: reddit-agent

I built an AI Agent that researches Reddit for predefined topics. It is available at https://insights.vitaliihonchar.com/.

The design is shown in this diagram:

Every day, the scheduler executes a batch of AI Agents that analyze Reddit and try to retrieve information specified in prompts.

Each agent was implemented with the ReAct pattern:

The ReAct pattern is a perfect choice here because it allows LLM creativity to find the right posts and analyze them.

Finally, users can read findings on the insights page:

Implementation: reddit-agent

It's all clear about the high-level design, but the details are where things get complex. To implement my service, I want to make it scalable to have an example of an AI application that is reliable and scalable. So let's look at its architecture:

I introduced 2 services:

insights - web UI that is responsible for serving user requests and showing a page with insights, available at insights.vitaliihonchar.com
agentapi - a generic platform to run LangGraph agents, which currently contains only one AI Agent - Reddit Search Agent. I have some plans for AI Agent experiments, which is why I developed a generic platform to have flexibility in the future.

This agentapi implements the Transactional Outbox pattern by allowing any agent to be executed via API:

When insights executes an agent, agentapi simply saves AgentExecution in the pending state in the database.

And a scheduler inside agentapi checks every 1 second for pending executions in the database:

So if the LLM fails at any step of execution, my scheduler inside agentapi will restart the agent.

Transactional Outbox Pattern Implementation

The most interesting part is how to query pending agent executions, because it's a common performance bottleneck in a wrongly implemented Transactional Outbox pattern that I've observed in different companies during my career.

This is a good topic for another article, but in short: do not use pessimistic locks to get pending jobs from the database; instead use optimistic locks. A detailed implementation is on GitHub:

Find pending agent executions - find agent executions that match criteria. Always specify a time threshold and do not query the most recent ones, because there's a possibility that another node is concurrently handling the same agent executions, and we don't want to handle them twice.
Optimistically acquire a lock for an agent execution - just compare the state in the database with the in-memory state, and if it matches, update the record in the database while incrementing a counter. A - atomicity ensures that one of the concurrent updates will succeed and another will fail, returning 0 affected rows in the result. For all 0s, we say that we can't acquire a lock and just skip this agent execution for now.

Lock looks like this in SQL (but need to commit transaction as fast as possible to minimize SQL contention):

UPDATE agent_execution 
SET state = 'processing', 
    executions = executions + 1 
WHERE id = ? 
    AND state = 'pending' 
    AND executions = ?

So this approach is super efficient for Postgres, MySQL, CockroachDB, and MongoDB databases (I've implemented it for all of them in different projects at different companies during the last 5 years) and allows us to scale our services without a database bottleneck.

As a result, each scheduler is not guaranteed to execute the same number of jobs as received in the initial find pending jobs query, but that's OK because another scheduler concurrently executes the missed jobs, and it works fine for a distributed system.

Benefits of Transactional Outbox Pattern

This pattern improves the reliability and scalability of applications, which allows us to:

Improve user experience by ensuring that almost all user requests will be processed.
Use cheaper LLMs or even self-host Open Source LLMs with less strict uptime guarantees, which will significantly reduce infrastructure bills. (This is what I did by moving from OpenAI to DeepSeek, as I mentioned above).

So again, no OpenRouter needed here to improve AI application reliability - just old software engineering practices proven over time.

Scalability

Let's consider the second important topic of today's problem - scalability. We've applied the pattern described above, and our application reliably handles user requests, but how do we scale it? If the Transactional Outbox Pattern was implemented correctly, it's trivial - just deploy more nodes.

Also, with the Transactional Outbox Pattern, we don't need to fear scaling nodes as we did previously due to reaching request limits in LLMs, because even if we receive errors, we'll retry them.

So with the described approach, it's very easy to scale AI applications.

Conclusions

In this article, I explained why we need to use the Transactional Outbox Pattern for AI applications. This is my first article in a series on applying distributed systems knowledge in the new AI world. Key outcomes:

To achieve AI application reliability, you need to use the Transactional Outbox Pattern.
To implement the Transactional Outbox Pattern, you need to use optimistic locks, not distributed locks or pessimistic locks.
Do not rely on external services like OpenRouter for reliability; instead, use proven software engineering practices.

Subscribe to my Substack to not miss my next articles, in which I'm planning to focus on:

Observability in AI applications
Infrastructure for AI applications
Development best practices for AI applications development

Subscribe to my Substack to not miss my new articles 😊

Why LangGraph Overcomplicates AI Agents (And My Go Alternative)

Vitalii Honchar — Tue, 15 Jul 2025 15:00:00 +0000

Introduction

LangGraph tries to reinvent programming language control flow by implementing graphs for AI agent development. But here's the fundamental issue: programming languages already are graphs with compile-time validation and control flow management.

During my research into AI agent development, I built agents using Python and LangGraph for cybersecurity scanning, documented in these articles:

The key insight I discovered is that an AI agent is fundamentally just a pattern of using LLMs that looks like this:

for {
    res := callLLM(ctx)
    if res.ToolsCalling {
        ctx = executeTools(res.ToolsCalling)
    }
    if res.End {
        return
    }
}

This is simply calling an LLM in a loop and allowing the LLM to make decisions for the next step.

Subscribe to my Substack to not miss my new articles 😊

The LangGraph Problem

LangGraph proposes using graph structures to implement application flow:

This introduces unnecessary complexity because programming languages already implement graph structures with compile-time flow validation. In LangGraph:

Vertices specify business logic
Edges specify control flow

In any programming language, the same functionality is achieved with standard language constructs:

Operators specify business logic
Conditions (if/else) specify control flow

The agent code example demonstrates this natural graph structure:

for {
    res := callLLM(ctx)     // vertex (business logic)
    if res.ToolsCalling {   // edge (control flow)
        ctx = executeTools(res.ToolsCalling) // vertex (business logic)
    }
    if res.End {            // edge (control flow)
        return
    }
}

LangGraph compiles graphs and performs validation, which adds little value in compiled programming languages that already provide these guarantees. This observation led me to develop my own AI agent library that leverages existing language features instead of reimplementing them.

The go-agent Library

Current Status: Active development, not production-ready

GitHub: https://github.com/vitalii-honchar/go-agent

Features:

ReAct Agent support
OpenAI API integration
Type-safe AI agent development

I chose Go for several technical advantages over Python:

Strict compilation checks catch errors at build time
True parallelism with goroutines vs Python's GIL limitations
Superior performance for infrastructure workloads
Better suited for engineering tasks rather than data science experiments

Instead of implementing graph abstractions, I focused on agent patterns. The first implementation targets the ReAct pattern:

// Define tool parameters with JSON schema validation
type AddToolParams struct {
    Num1 float64 `json:"num1" jsonschema_description:"First number to add"`
    Num2 float64 `json:"num2" jsonschema_description:"Second number to add"`
}

type AddResult struct {
    llm.BaseLLMToolResult
    Sum float64 `json:"sum" jsonschema_description:"Sum of the two numbers"`
}

// Create type-safe tool with validation
addTool := llm.NewLLMTool(
    llm.WithLLMToolName("add"),
    llm.WithLLMToolDescription("Adds two numbers together"),
    llm.WithLLMToolParametersSchema[AddToolParams](),
    llm.WithLLMToolCall(func(callID string, params AddToolParams) (AddResult, error) {
        return AddResult{
            BaseLLMToolResult: llm.BaseLLMToolResult{ID: callID},
            Sum:              params.Num1 + params.Num2,
        }, nil
    }),
)

// Configure agent with usage limits and behavior
calculatorAgent, err := agent.NewAgent(
    agent.WithName[CalculatorResult]("calculator"),
    agent.WithLLMConfig[CalculatorResult](llmConfig),
    agent.WithBehavior[CalculatorResult]("Use the add tool to calculate sums. Do not calculate manually."),
    agent.WithTool[CalculatorResult]("add", addTool),
    agent.WithToolLimit[CalculatorResult]("add", 5), // Maximum 5 calls
)

Developer Experience Advantages

The library requires developers to specify only:

Tools that the agent can use
Behavior prompts focused on domain-specific tasks

The system prompt for ReAct pattern implementation is handled automatically (source):

var systemPromptTemplate = NewPrompt(`You are an agent that implements the ReAct ` +
    `(Reasoning-Action-Observation) pattern to solve tasks through systematic thinking and tool usage.

## REASONING PROTOCOL

Before EVERY action:
1. **THINK**: State your reasoning for the next step
2. **ACT**: Execute the appropriate tool with complete parameters
3. **OBSERVE**: Analyze the results and their implications

Always maintain explicit reasoning chains. Your thoughts should be visible and logical.

## EXECUTION CONTEXT

TOOLS AVAILABLE TO USE:
{{.tools}}

CURRENT TOOLS USAGE:
{{.tools_usage}}

TOOLS USAGE LIMITS:
{{.calling_limits}}

## AGENT BEHAVIOR

<BEHAVIOR>
{{.behavior}}
</BEHAVIOR>
`)

This abstraction allows developers to focus on business logic rather than ReAct implementation details.

Flexible LLM Configuration

The library supports flexible LLM configuration with a simple interface:

agent.WithLLMConfig[HashResult](llm.LLMConfig{
    Type:        llm.LLMTypeOpenAI,
    APIKey:      apiKey,
    Model:       "gpt-4o",
    Temperature: 0.0,
})

Currently supporting OpenAI API with planned expansion to other providers.

Development Roadmap

The go-agent library is in early development. I'm building real AI agents with it to refine the API before releasing version 1.0.0. Planned features include:

Memory support for persistent agent state
Ollama integration for local LLM deployment
Multi-agent orchestration capabilities
Concurrent tool execution leveraging Go's parallelism
Advanced error handling patterns

Technical Philosophy

I built go-agent because I see AI agents becoming critical infrastructure components that require:

High performance for production workloads
Strong guarantees through type safety
Maintainability by software engineering teams

The separation of concerns should be:

Software engineers build and maintain the agent infrastructure layer
Data scientists/prompt engineers develop domain-specific prompts and behavior

This division of responsibility makes LangGraph's approach problematic due to Python's performance limitations and the unnecessary complexity of reimplementing control flow that programming languages already provide efficiently.

Conclusion

LangGraph attempts to solve problems that don't exist in compiled languages while introducing complexity that hinders development velocity. The go-agent library demonstrates that AI agents can be built more efficiently by leveraging existing language features rather than creating new abstractions.

By focusing on what actually matters—type safety, performance, and developer productivity—we can build more reliable AI agent systems that scale with real-world infrastructure demands.

Subscribe to my Substack to not miss my new articles 😊

Pipeline of Agents Pattern: Building Maintainable AI Workflows with LangGraph

Vitalii Honchar — Tue, 08 Jul 2025 09:09:00 +0000

Introduction

In the previous article How to Build a ReAct AI Agent for Cybersecurity Scanning with Python and LangGraph I explained how to build a simple ReAct Agent to scan a web target for vulnerabilities. But the scope of work for cyber security audits is bigger than just scanning. It includes:

Scanning Stage - get information about possible vulnerabilities in the target.
Attacking Stage - try to exploit vulnerabilities and prove our hypothesis from the scanning stage.
Reporting Stage - create comprehensive report for company which requested audit to apply fixes.

And to build this I tried to go with simple graph first but then realized that this approach is not flexible and violates "Single Responsibility" from SOLID.

That's why I have built the pipeline of agents where each agent is responsible only for one thing and does it pretty well.

Subscribe to my Substack to not miss my new articles 😊

Pipeline of Agents

Pipeline of Agents - is an architectural pattern that chains specialized AI agents in a sequential workflow, where each agent processes the output from the previous agent and passes refined data to the next. Unlike monolithic agents that try to do everything, pipeline agents follow single responsibility principle - each agent excels at one specific task.

Main characteristics of this pattern:

Each agent has a single, specialized responsibility.
Sequential execution with data flow - output from Agent N becomes input for Agent N + 1.
Composable and modular - you can swap agents or change the pipeline order.
State isolation - agents don't share internal state, only defined outputs.
Failure handling

Why Use Pipeline of Agents?

Let's see with an example why we should use Pipeline of Agents. In my original implementation of Cyber Security AI Agent I didn't use it and the system was very hard to develop and maintain. Here is the code which builds a graph:

def create_graph() -> CompiledStateGraph:
    llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

    # tools
    attack_tools = [ffuf_directory_scan, curl_tool, flexible_http_tool]
    scan_tools = [ffuf_directory_scan]

    llm_with_attack_tools = llm.bind_tools(attack_tools, parallel_tool_calls=True)
    llm_with_scan_tools = llm.bind_tools(scan_tools, parallel_tool_calls=True)

    # nodes init
    process_tool_result_node = ProcessToolResultNode(llm=llm)
    generate_report_node = GenerateReportNode(llm=llm)
    scan_target_node = ScanTargetNode(llm_with_tools=llm_with_scan_tools)
    attack_target_node = AttackTargetNode(llm_with_tools=llm_with_attack_tools)

    # edges init
    scan_tools_router = ToolRouterEdge(
        origin_node="scan_target_node",
        tools_type="scan",
        end_node="attack_target_node",
        tools_node="scan_tools",
    )
    attack_tools_router = ToolRouterEdge(
        origin_node="attack_target_node",
        tools_type="attack",
        end_node="generate_report",
        tools_node="attack_tools",
    )

    # graph init
    builder = StateGraph(TargetScanState)

    # nodes
    builder.add_node("scan_target_node", scan_target_node)
    builder.add_node("attack_target_node", attack_target_node)
    builder.add_node("scan_tools", ToolNode(scan_tools))
    builder.add_node("attack_tools", ToolNode(attack_tools))
    builder.add_node(
        "process_scan_results", process_tool_result_node.process_tool_results
    )
    builder.add_node(
        "process_attack_results", process_tool_result_node.process_tool_results
    )
    builder.add_node("generate_report", generate_report_node.generate_report)

    # edges
    builder.add_edge(START, "scan_target_node")
    builder.add_conditional_edges("scan_target_node", scan_tools_router)
    builder.add_conditional_edges("attack_target_node", attack_tools_router)

    builder.add_edge("scan_tools", "process_scan_results")
    builder.add_edge("process_scan_results", "scan_target_node")

    builder.add_edge("attack_tools", "process_attack_results")
    builder.add_edge("process_attack_results", "attack_target_node")

    builder.add_edge("generate_report", END)

    # Add memory checkpointer for state persistence
    memory = MemorySaver()
    return builder.compile(checkpointer=memory)

Here I'm just creating a bunch of nodes with edges which conditionally jump from one node to another node. No single responsibility principle, no specialization. This code is hard to develop and hard to test because I can't test only the "scanning" stage of my system because there is only one way to test - launch the whole workflow and a bug in the "attacking" stage hides bugs from the "scanning" stage.

So LangGraph graphs should be small and simple like microservices for efficient development. That's why Pipeline of Agents is a good solution in my case because it provides me the possibility to split my big graph into:

Scan Agent graph
Attack Agent graph

Develop and test these 2 graphs in isolation and only after that build the whole pipeline for my Cyber Security Agent by combining 2 smaller agents.

High Level Design

User sends input information about a target.
Scan Agent scans a target and generates scan summary.
Attack Agent attacks a target and generates attack summary.
Summary Generation generates a final summary based on the scan and attack summaries.

Scan ReAct Agent

This Agent uses ReAct architecture with tools:

ffuf - for enumeration of possible endpoints.
curl - for quick testing enumeration output or perform custom testing.

I fully covered ReAct pattern and this agent implementation in the How to Build a ReAct AI Agent for Cybersecurity Scanning with Python and LangGraph article.

Attack ReAct Agent

Attack Agent uses the same ReAct architecture and only curl tool to exploit vulnerabilities because I did it in the scope of a research project and if needed, other tools can be easily added.

Summary Generation

Implementation

To implement this system I used:

LangGraph
Python

Source code is available on GitHub.

LangGraph Graph

User sends URL
scan_agent_node executes Scan Agent and performs web target scanning.
scan_agent_node produces scanning result as an output.
attack_agent_node executes Attack Agent with an output from scan_agent_node and performs target attack.
summary_node generates summary based on outputs from scan_agent_node and attack_agent_node

There is code which builds this graph in Python (also available on GitHub):

from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import END, START, StateGraph
from langgraph.graph.state import CompiledStateGraph

from cybersecurity_agent.node import ScanAgentNode, AttackAgentNode, CybersecuritySummaryNode
from cybersecurity_agent.state import CybersecurityAgentState


def create_cybersecurity_graph(
    scan_react_limit: int = 25,
    scan_ffuf_limit: int = 2,
    scan_curl_limit: int = 5,
    attack_react_limit: int = 25,
    attack_curl_limit: int = 10,
) -> CompiledStateGraph:
    llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0.3)

    # Use parameterized wrapper nodes with configurable limits
    scan_agent_node = ScanAgentNode(
        react_usage_limit=scan_react_limit,
        ffuf_tool_limit=scan_ffuf_limit,
        curl_tool_limit=scan_curl_limit,
    )
    attack_agent_node = AttackAgentNode(
        react_usage_limit=attack_react_limit,
        curl_tool_limit=attack_curl_limit,
    )
    cybersecurity_summary_node = CybersecuritySummaryNode(llm=llm)

    # Build the graph
    builder = StateGraph(CybersecurityAgentState)

    # Add nodes that use compiled sub-graphs internally
    builder.add_node("scan_agent", scan_agent_node)
    builder.add_node("attack_agent", attack_agent_node)
    builder.add_node("cybersecurity_summary", cybersecurity_summary_node)

    # Define the workflow: scan -> attack -> summary
    builder.add_edge(START, "scan_agent")
    builder.add_edge("scan_agent", "attack_agent")
    builder.add_edge("attack_agent", "cybersecurity_summary")
    builder.add_edge("cybersecurity_summary", END)

    return builder.compile(checkpointer=MemorySaver())

Passing State for Child Graphs

All my agents are built with LangGraph and are graphs. LangGraph provides the possibility to embed an external graph inside my graph but as far as I don't want to share the state of my parent graph with child graph I don't want to embed it. That's why I decided to build a node wrapper which will convert parent graph state to the child graph state and pass only required data for child graph execution.

This allows me to hide information and work only with the minimal amount of it. This is my parent graph state (see on GitHub):

class CybersecurityAgentState(TypedDict):
    target: Target
    scan_summary: ScanAgentSummary | None
    attack_summary: AttackReportSummary | None
    cybersecurity_report: CybersecurityReport | None

And this is my child graph state:

class AttackAgentState(ReActAgentState):
    scan_summary: ScanAgentSummary
    attack_summary: AttackReportSummary | None

To achieve my goal of hiding information I have built wrapper nodes for each agent.

Scan Agent Execution

Detailed implementation of the Scan Agent was described in How to Build a ReAct AI Agent for Cybersecurity Scanning with Python and LangGraph

Wrapper node for Scan Agent (see on GitHub):

from langchain_core.runnables.config import RunnableConfig

from scan_agent.graph import create_scan_graph
from cybersecurity_agent.state import CybersecurityAgentState
from agent_core.state import ReActUsage, Tools, ToolsUsage
from agent_core.tool import CURL_TOOL, FFUF_TOOL


class ScanAgentNode:
    def __init__(
        self,
        react_usage_limit: int = 25,
        ffuf_tool_limit: int = 2,
        curl_tool_limit: int = 5,
    ):
        self.scan_graph = create_scan_graph()
        self.react_usage_limit = react_usage_limit
        self.ffuf_tool_limit = ffuf_tool_limit
        self.curl_tool_limit = curl_tool_limit

    async def __call__(self, state: CybersecurityAgentState) -> dict:
        scan_state = {
            "target": state["target"],
            "usage": ReActUsage(limit=self.react_usage_limit),
            "tools_usage": ToolsUsage(
                limits={
                    FFUF_TOOL.name: self.ffuf_tool_limit,
                    CURL_TOOL.name: self.curl_tool_limit,
                }
            ),
            "tools": Tools(tools=[FFUF_TOOL, CURL_TOOL]),
        }

        config = RunnableConfig(
            max_concurrency=10,
            recursion_limit=25,
            configurable={"thread_id": f"scan_{hash(str(state['target']))}"},
        )

        final_state = await self.scan_graph.ainvoke(scan_state, config)

        scan_summary = final_state.get("summary")

        return {"scan_summary": scan_summary}

This node just creates a state for Scan Agent and executes it. Scan Agent doesn't know anything about parent state or even that it is part of a bigger flow. And even more, here I'm pretty flexible to define what tools and limits are allowed for Scan Agent.

Attack Agent Execution

Attack Agent node has similar code to the Scan Agent node (see on GitHub):

from langchain_core.runnables.config import RunnableConfig

from attack_agent.graph import create_attack_graph
from cybersecurity_agent.state import CybersecurityAgentState
from agent_core.state import ReActUsage, Tools, ToolsUsage
from agent_core.tool import CURL_TOOL


class AttackAgentNode:
    def __init__(
        self,
        react_usage_limit: int = 25,
        curl_tool_limit: int = 20,
    ):
        self.attack_graph = create_attack_graph()
        self.react_usage_limit = react_usage_limit
        self.curl_tool_limit = curl_tool_limit

    async def __call__(self, state: CybersecurityAgentState) -> dict:
        attack_state = {
            "target": state["target"],
            "scan_summary": state["scan_summary"],
            "usage": ReActUsage(limit=self.react_usage_limit),
            "tools_usage": ToolsUsage(
                limits={
                    CURL_TOOL.name: self.curl_tool_limit,
                }
            ),
            "tools": Tools(tools=[CURL_TOOL]),
        }

        config = RunnableConfig(
            max_concurrency=10,
            recursion_limit=25,
            configurable={"thread_id": f"attack_{hash(str(state['target']))}"},
        )

        final_state = await self.attack_graph.ainvoke(attack_state, config)
        attack_summary = final_state.get("attack_summary")

        return {"attack_summary": attack_summary}

Summary Generation

To generate a summary I decided to pass Scan and Attack Agents outputs because scan output may contain something which wasn't too useful for the attack process and attack agent simply ignored it. But in the summary I want to see all information from the cyber security assessment. (see on GitHub)

import json
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import SystemMessage

from cybersecurity_agent.state import CybersecurityAgentState
from cybersecurity_agent.state.cybersecurity_agent_state import CybersecurityReport

CYBERSECURITY_SUMMARY_PROMPT = "Omitted for simplicity. Full prompt available on GitHub."

class CybersecuritySummaryNode:
    def __init__(self, llm: BaseChatModel):
        self.structured_llm = llm.with_structured_output(CybersecurityReport)

    def __call__(self, state: CybersecurityAgentState) -> dict:
        target = state["target"]
        scan_summary = state["scan_summary"]
        attack_summary = state["attack_summary"]

        system_prompt = CYBERSECURITY_SUMMARY_PROMPT.format(
            target_url=target.url,
            target_description=target.description,
            target_type=target.type,
            scan_summary=scan_summary.model_dump_json() if scan_summary else "No reconnaissance data available",
            attack_summary=attack_summary.model_dump_json() if attack_summary else "No attack execution data available"
        )

        # Create a simple message to trigger the analysis
        user_message = "Please analyze the provided reconnaissance and exploitation data to create a comprehensive cybersecurity assessment report."

        prompt_messages = [
            SystemMessage(content=system_prompt),
            {"role": "user", "content": user_message}
        ]

        cybersecurity_report = self.structured_llm.invoke(prompt_messages)

        return {"cybersecurity_report": cybersecurity_report}

Testing

To perform my testing I generated a vulnerable application with Claude Code which is available on GitHub. After that I simply executed my flow from Jupyter Notebook

Result contains a lot of details useful for a business owner who asked to perform security testing for a system or security team.

Summary

In this article I explained how to build pipeline of agents by using LangGraph node wrappers in Python. As a result pipeline of agents provides a powerful system which has strict control of agent executions.

All code from this article is available on GitHub.

Previous article from the Cyber Security AI Agent development cycle is: How to Build a ReAct AI Agent for Cybersecurity Scanning with Python and LangGraph

Subscribe to my Substack to not miss my new articles 😊

How to Build a ReAct AI Agent for Cybersecurity Scanning with Python and LangGraph

Vitalii Honchar — Tue, 24 Jun 2025 09:09:00 +0000

Introduction

ReAct agents are tricky to implement correctly, and in this article I will show how to do it using a cybersecurity AI Agent example that can find vulnerabilities in any provided web target. Today I will explain:

How to use tokens efficiently in ReAct agents
How to force ReAct agents to use tools efficiently and not be too lazy

Subscribe to my Substack to not miss my new articles 😊

Theory

Before we jump to the implementation, let's first define what an AI Agent is and how I can build it.

AI Agent

An AI Agent is a hot architecture pattern with LLM, loop, and actions (tools) that the LLM can perform. The LLM is used here as a brain that can decide what to do as a replacement for regular code that software used previously to make decisions.

Traditional automation breaks when conditions change. Agents adapt. That's the real value - resilience, not just "LLM with extra steps."

There is a very basic AI Agent architecture:

User executes AI Agent
LLM makes its own decision to call some tool to perform an action
Tool returns a result to LLM and allows LLM to perform a new decision. This process loops until the LLM decides that a result can be provided to the user or certain conditions are met.
LLM produces a final result for the agent.

AI Agents have different patterns used to build them, and for me they are similar to classic software patterns like Factory, Singleton, or Strategy. But in this article I focused on the simplest one - the ReAct pattern.

ReAct Agent Pattern

ReAct is a pattern for AI Agents with these steps:

Reason - LLM thinks about data or tool results
Act - LLM calls tools to perform some actions
Observe - LLM handles results of tool execution
LLM provides final result

Project Requirements

To learn how to build a ReAct Agent, I decided to build a vulnerability scanning AI Agent. It will accept a web service URL as input and provide a vulnerability report as output.

I used this technology stack to build it:

Python
LangGraph

System Design

User specifies a target to scan.
scan_node asks LLM to reason about context and make a decision about the next step.
scan_node uses tools to perform Web Target scanning if LLM decides to do so.
tools scan Web Target.
scan_node calls summary_node if LLM decides that no additional tool usage is required.
summary_node provides context about scanning results to LLM.
LLM produces summary output

So it's a basic ReAct pattern but with an extra node to perform summary generation. This approach produces a summary with higher quality rather than direct ReAct pattern result consumption.

Implementation

Short term memory - graph state

Code available on GitHub

To implement short term memory I used this graph state:

class ReActAgentState(MessagesState):
    usage: ReActUsage
    tools_usage: ToolsUsage
    tools: Tools
    results: Annotated[list[ToolResult], operator.add]
    target: Target

Which contains:

tools_usage - to track current tool usage and check if it doesn't exceed limits.
usage - to track the depth of graph recursion execution and prevent reaching limits.
tools - a dynamic list of tools that users can specify during graph execution, which makes this graph reusable.
results - list of tool execution results used to reduce LLM token usage.

In the ReAct agent implementation, we are allowing LLM to call tools and LLM should parse tool results to perform a reason step to decide what to do next. The problem with the default approach in LangGraph:

Call tool.
Receive tool result as a message.
Perform reasoning

This history of tool executions is saved until the current execution of the graph reaches the end. This makes token usage enormously high. To reduce it, I decided to save current tool execution in the state.results field and pass to LLM only when it's needed and not on each LLM call.

Common node - ReActNode

Code available on GitHub.

This node inherits from the common ReAct node which I introduced to omit duplicate work in the future:

system_prompt = """
You are an agent that should act as specified in escaped content <BEHAVIOR></BEHAVIOR>.

TOOLS AVAILABLE TO USE:
{tools}

TOOLS USAGE LIMITS:
{tools_usage}

TOOLS CALLING LIMITS:
{calling_limits}

PREVIOUS TOOLS EXECUTION RESULTS:
{tools_results}

<BEHAVIOR>
{behavior}
</BEHAVIOR>
"""


class ReActNode[StateT: ReActAgentState](ABC):
    def __init__(self, llm_with_tools: Runnable[LanguageModelInput, BaseMessage]):
        self.llm_with_tools = llm_with_tools

    def __call__(self, state: StateT) -> dict:
        prompt = system_prompt.format(
            tools=json.dumps(state["tools"].to_dict()),
            tools_usage=json.dumps(state["tools_usage"].to_dict()),
            calling_limits=json.dumps(state["usage"].to_dict()),
            tools_results=json.dumps([r.to_dict() for r in state.get("results", [])]),
            behavior=self.get_system_prompt(state),
        )
        system_message = SystemMessage(prompt)

        res = self.llm_with_tools.invoke([system_message])

        logging.debug(
            "[ReActNode] Executed LLM request: state = %s, response = %s", state, res
        )
        return {"messages": [res]}

    @abstractmethod
    def get_system_prompt(self, state: StateT) -> str:
        pass

The StateT type is generic, which means that any subclass can specify a custom state used in it, which makes this node very flexible. In the __call__ method I'm building a prompt that controls tool usage. Since the tools field is dynamic, I don't need to hardcode the tool usage guide inside my system prompt because I can generate it dynamically, based on the tools field state.

Subclasses of the ReActNode class should implement the get_system_prompt method which should return a node-specific prompt, but the subclass shouldn't care about common things implemented in the ReActNode class.

Core of the system - scan_node

Code available on GitHub.

The scan_node node is a subclass of the ReActNode class:

from typing import override

from langchain_core.language_models import LanguageModelInput
from langchain_core.messages import AIMessage, BaseMessage, SystemMessage
from langchain_core.runnables import Runnable

from agent_core.node import ReActNode
from scan_agent.state import ScanAgentState

SCAN_BEHAVIOR_PROMPT = "Omitted for simplicity. Full prompt available on GitHub."


class ScanNode(ReActNode[ScanAgentState]):
    def __init__(self, llm_with_tools: Runnable[LanguageModelInput, BaseMessage]):
        super().__init__(llm_with_tools=llm_with_tools)

    @override
    def get_system_prompt(self, state: ScanAgentState) -> str:
        target = state.get("target", {})
        target_url = getattr(target, "url", "Unknown") if target else "Unknown"
        target_description = (
            getattr(target, "description", "No description provided")
            if target
            else "No description provided"
        )

        return SCAN_BEHAVIOR_PROMPT.format(
            target_url=target_url, target_description=target_description
        )

And it basically just provides scan-specific task prompt.

Control Edge - ToolRouterEdge

Code available on GitHub.

To control graph execution I used a dynamic routing edge:

import logging
from dataclasses import dataclass

from langchain_core.messages import AIMessage

from agent_core.state import ReActAgentState


@dataclass
class ToolRouterEdge[StateT: ReActAgentState]:
    origin_node: str
    end_node: str
    tools_node: str

    def __call__(self, state: StateT) -> str:
        """Route based on tool calls and limits"""
        last_message = state["messages"][-1]
        usage = state["usage"]
        tools_usage = state["tools_usage"]
        tools = state["tools"]
        tools_names = [t.name for t in tools.tools]

        if usage.is_limit_reached():
            logging.info(
                "Limit is reached, routing to end node: usage = %s, end_node = %s",
                usage,
                self.end_node,
            )
            return self.end_node

        if isinstance(last_message, AIMessage) and last_message.tool_calls:
            logging.info("Routing to tools node: %s", self.tools_node)
            return self.tools_node

        if not tools_usage.is_limit_reached(tools_names):
            logging.info(
                "Limit is not reached: tools = %s, usage = %s, origin_node = %s",
                tools_names,
                tools_usage,
                self.origin_node,
            )
            return self.origin_node

        logging.info(
            "ToolRouterEdge: No tool calls found in the last message. "
            "Usage limit reached. Routing to end node: %s. "
            "Last message: %s",
            self.end_node,
            last_message,
        )
        return self.end_node

This decides what node to call next based on the LLM decision and current tool usage.

Usually in the ReAct pattern, LLM decides what tools to call, but LLM is not a deterministic system and sometimes it can be lazy and use a tool only once or even never use it. We never know when LLM will decide to behave like this. To omit such cases, I decided to use LLM only as a "decision engine" but control tool calling from good old code with if-else statements:

Problem	Solution
LLM constantly calls tools	If tool usage exceeds the limit - stop using tools even if LLM decided to do so and go to the `end_node`
LLM doesn't call tools	If the LLM decides not to use any tool but tool usage didn't exceed the limit - restart previous node to force the LLM to decide what tool to use until tool usage exceeds the limit

Tools Results Processor - ProcessToolResultsNode

Code available on GitHub.

As mentioned above, I fought with the high LLM token usage problem and I introduced a special field in the state:

class ReActAgentState(MessagesState):
    results: Annotated[list[ToolResult], operator.add]

This contains tool result executions. But to populate this field I need to parse the LangGraph messages list in the ProcessToolResultsNode class:

from langchain_core.messages import (
    AIMessage,
    AnyMessage,
    ToolMessage,
)

from agent_core.state import ReActAgentState, ToolResult
import logging


class ProcessToolResultsNode[StateT: ReActAgentState]:
    def __call__(self, state: StateT) -> dict:
        messages = state["messages"]
        tools_usage = state["tools_usage"]
        new_results = []

        results = state.get("results", [])

        call_id_to_result = {
            result.tool_call_id: result for result in results if result.tool_call_id
        }

        reversed_messages = list(reversed(messages))
        for msg in reversed_messages:
            if isinstance(msg, ToolMessage):
                if msg.tool_call_id not in call_id_to_result:
                    if msg.name is not None:
                        tools_usage.increment_usage(msg.name)

                    new_results.append(
                        ToolResult(
                            result=str(msg.content),
                            tool_name=msg.name,
                            tool_arguments=self._find_tool_call_args(
                                reversed_messages, msg.tool_call_id
                            ),
                            tool_call_id=msg.tool_call_id,
                        )
                    )

        logging.debug(
            "ProcessToolResultsNode: Processed tool results: %s",
            new_results,
        )
        return {
            "results": list(reversed(new_results)),
            "tools_calls": tools_usage,
        }

    def _find_tool_call_args(
        self, messages: list[AnyMessage], tool_call_id: str
    ) -> dict | None:
        for msg in messages:
            if isinstance(msg, AIMessage):
                for tool_call in msg.tool_calls:
                    if tool_call.get("id") == tool_call_id:
                        return tool_call.get("args")

This basically associates current tool results with tool request messages and saves them in the graph state for next processing.

Summary Generation

Code available on GitHub.

After finishing scanning or reaching limits, we need to generate a summary that will be easy to consume by people or the next AI Agent in the chain:

import json
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import SystemMessage

from scan_agent.state import ScanAgentState
from scan_agent.state.scan_agent_state import ScanAgentSummary

SUMMARY_BEHAVIOR_PROMPT = "Omitted for simplicity. Full prompt available on GitHub."

class SummaryNode:
    def __init__(self, llm: BaseChatModel):
        self.structured_llm = llm.with_structured_output(ScanAgentSummary)

    def __call__(self, state: ScanAgentState) -> dict:
        target = state["target"]

        system_prompt = SUMMARY_BEHAVIOR_PROMPT.format(
            target_url=target.url,
            target_description=target.description,
            target_type=target.type,
            tool_results=json.dumps([r.to_dict() for r in state.get("results", [])]),
        )

        prompt_messages = [SystemMessage(content=system_prompt), state["messages"][-1]]
        summary = self.structured_llm.invoke(prompt_messages)

        return {"summary": summary}

Graph

Code available on GitHub.

To build a graph I used this code:

from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import END, START, StateGraph
from langgraph.graph.state import CompiledStateGraph
from langgraph.prebuilt import ToolNode

from agent_core.edge import ToolRouterEdge
from agent_core.node import ProcessToolResultsNode
from agent_core.tool import ffuf_directory_scan, curl_tool
from scan_agent.node import ScanNode
from scan_agent.node.summary_node import SummaryNode
from scan_agent.state import ScanAgentState


def create_scan_graph() -> CompiledStateGraph:
    llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0.3)
    tools = [ffuf_directory_scan, curl_tool]
    llm_with_tools = llm.bind_tools(tools, parallel_tool_calls=True)

    scan_node = ScanNode(llm_with_tools=llm_with_tools)
    summary_node = SummaryNode(llm=llm)
    process_tool_results_node = ProcessToolResultsNode[ScanAgentState]()

    tools_router = ToolRouterEdge[ScanAgentState](
        origin_node="scan_node",
        end_node="summary_node",
        tools_node="scan_tools",
    )

    builder = StateGraph(ScanAgentState)

    builder.add_node("scan_node", scan_node)
    builder.add_node("summary_node", summary_node)
    builder.add_node("scan_tools", ToolNode(tools))
    builder.add_node("process_tool_results_node", process_tool_results_node)

    builder.add_edge(START, "scan_node")
    builder.add_edge("scan_tools", "process_tool_results_node")
    builder.add_edge("process_tool_results_node", "scan_node")
    builder.add_edge("summary_node", END)

    builder.add_conditional_edges("scan_node", tools_router)

    return builder.compile(checkpointer=MemorySaver())

Testing

To perform testing of my scan agent, I asked Claude Code to develop a vulnerable REST API with FastAPI and launched it locally. Code of that service is available here.

After execution of my agent on the specified target with this script:

import uuid
from datetime import timedelta

from langchain_core.runnables.config import RunnableConfig

from agent_core.graph import run_graph
from agent_core.state import ReActUsage, Target, Tools, ToolsUsage
from agent_core.tool import CURL_TOOL, FFUF_TOOL
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

state = {
    "target": Target(
        url="http://localhost:8000", description="Local REST API target", type="web"
    ),
    "usage": ReActUsage(limit=25),
    "tools_usage": ToolsUsage(
        limits={
            FFUF_TOOL.name: 2,
            CURL_TOOL.name: 5,
        }
    ),
    "tools": Tools(tools=[FFUF_TOOL, CURL_TOOL]),
}
thread_id = str(uuid.uuid4())[:8]
config = RunnableConfig(
    max_concurrency=10,
    recursion_limit=25,
    configurable={"thread_id": thread_id},
)

print(f"🚀 Starting improved event processing with thread ID: {thread_id}")
print("=" * 80)

event = await run_graph(graph, state, config)

I got pretty solid results:

My agent found critical vulnerabilities and that was just a scan agent, not an attack agent (about which I will explain in a next article). So I was very happy that this agent worked so well. LLMs have real power to make unexpected decisions which were literally impossible to code in the previous software era.

Summary

I built a simple agent to perform cybersecurity scanning and it worked amazingly well. Modern LLMs provide great power for software engineers to build really powerful systems that were just impossible to build before LLMs. I'm excited to build more agents and solve real world problems.

Main insights for ReAct agent development:

To minimize LLM token usage, you need to save tool output in the graph state instead of simply using a list of messages.
To guarantee sufficient tool usage, you need to control it from the source code instead of relying on the LLM decision.

In the next article I will explain how to combine multiple AI Agents to perform complete cybersecurity assessment for a system with LangGraph.

Subscribe to my Substack to not miss my new articles 😊

From SaaS to Open Source: The Full Story of AI Founder

Vitalii Honchar — Tue, 10 Jun 2025 07:31:10 +0000

Introduction

Originally published on my blog: From SaaS to Open Source: The Full Story of AI Founder

I have a project AI Founder which supposed to help people validate business problems before launching new products. Target audience of it are:

Indie Hackers
Serial Entrepreneurs
Startups

But during past months of it's execution I decided to do a pivot and focus on the blog. In this blog post I will explain how I have been built AI Founder and provide Open Source version of it.

AI Founder is now in Open Source and available on GitHub

Landing Page is available here.

AI Founder is available here.

Demo of AI Founder is available here: YouTube

Subscribe to my Substack do not miss my new article 😊

Problem Statement

I'm software engineer and I love to develop my own products but I don't know how to decide which idea is worst to invest my efforts in it. That's why I decided to automate the process of idea validation in AI Founder

High Level Analysis Flow

User sends an idea to analyze
AI performs HWW Analysis.
AI performs TAM-SAM-SOM Analysis.
AI performs Competitors Analysis.
AI generates Summary

HWW Analysis:

How big is this problem?

Why does this problem exist?

Why is nobody solving it?

Who faces this problem?

TAM-SAM-SOM Analysis:

Total Addressable Market

Serviceable Available Market

Serviceable Obtainable Market

Market Landscape

Key take learning here:

LLM can't perform a good multi step analysis in an one prompt because than analysis is incomplete and LLM misses critical parts of analysis.

That's why it worst to follow Single Responsibility for analysis here and perform analysis by LLM separately to improve accuracy.

That's why my analysis flow contains multiple steps and not:

Differentiation Points

While I'm using multistep analysis during a work with AI (LLM) this application still is very easy to copy because my main differentiation points right now are:

Prompts
Multistep analysis graph

To make application more unique I decided to bring a good user interface as a third differentiation point.

So final list of differentiation points are:

Prompts
Multistep analysis graph
User Interface

And the most critical thing which I didn't solve and which eventually become a problem of an application execution is that LLM was a golden source of trust for AI Founder which means that all data used in analysis was from the LLM itself.

I will explain solution for it later in the article.

So generic idea of analysis and differentiation points is clear and let's jump to an implementation part.

Technologies Decision

I decided to use:

Next.js - to build a good UI with leveraging LLM possibilities which were learned on the huge amount of React code. As a result LLMs can generate a pretty good UI code.
JavaScript - Next.js is JavaScript framework, so choice was obvious. Also on that times I decided do not use TypeScript to speed up development which was a mistake because project very quickly become complex and next time I will simply use TypeScript.
Supabase - managed Postgres with additional features like user registration and authorization.
Claude Sonnet 3.5 - on that times it was one of the best LLM models. I wanted to have a good reasoning during my idea analysis and not too big price for API.
DigitalOcean - as far as LLM analysis is long process I couldn't use Vercel without paying $20 per month to Vercel. DigitalOcean was significantly cheaper option especially because I know how to build infrastructure.
Cursor - to speed up development I used Cursor Agent which boosted my work a lot. Time to market was reduced in 2x-3x times.

High Level Design

User sends an idea to analyze to ai-founder
ai-founder saves user idea to the database
ai-founder performs idea analysis via LLM

Long analysis processing design decision

As far as LLM analysis takes some time, to perform the whole analysis of an idea ai-founder took 60-70 seconds.

Users will not wait until a page shows loaders during 60-70 seconds. They will just close my application and never open again.

That's why I decided to solve this issue with this approach:

Each analysis saves in the database immediately after receiving from LLM. As far as this is analysis direct graph where each node computes based on the previous one it's possible to save intermediate result in the database and allow client to query ai-founder each 1 second to get first results which user can read in 7-12 seconds instead of 60-70.

On the diagram above Client is Web Browser JS client which automatically queries ai-founder, it's not a manual query from a user 😊.

I didn't use websockets here because I was building MVP and I just didn't need real time updates.

So it was main design decisions and let's jump to the implementation.

Implementation

Source code of AI Founder is available on GitHub

Validation Service

Validation Service - component responsible to execute idea analysis and validate how good this idea is.

const createValidationService = () => {

    const saveValidationInput = async (projectId, request) =>
        projectUpdateService.updateProject(projectId, async (project) => {
            project.data = project.data || defaultData;
            project.data.input.validation = request;
            project.data.tasks = project.data.tasks || {};
            project.data.analysis.validation = {};
            project.data.tasks.validation = Object.values(validationTasks);
        });

    const processAsyncValidation = async (log, projectId, userInput) => {
        const data = { userInput };
        // 4. analyze hwww
        const hww = await analyzeAndSaveHww(log, projectId, data);
        data.hww = hww;

        // 5. analyze tam sam som
        const tamSamSom = await analyzeAndSaveTamSamSom(log, projectId, data);
        data.tamSamSom = tamSamSom;

        const competitorAnalysis = await generateAnalyzeCompetitors(log, projectId, data);
        data.competitorAnalysis = competitorAnalysis;

        const summary = await generateSummary(log, projectId, data);
        data.summary = summary;

        await generateOptimizations(log, projectId, data);
    };

    const generateValidation = async (userId, projectId, request) => {
        const log = loggerWithProjectId(userId, projectId);
        const startTime = Date.now();

        if (!request || typeof request !== 'object') {
            throw new Error('Invalid validation request format');
        }

        try {
            // 1. save user input
            await saveValidationInput(projectId, request);

            withDuration(log, startTime).info('Starting parallel analysis execution');
            // 2. generate name for the idea
            generateAndSaveName(log, projectId, request);
            // 3. execute LLM analysis logic
            processAsyncValidation(log, projectId, request)
                .finally(() => withDuration(log, startTime).info('Validation analysis completed'))
                .catch(error =>
                    withDuration(log, startTime).error({ error }, 'Initial analysis failed in async context'));
            return await projectRepo.get(projectId);
        } catch (error) {
            log.error({ error: error.message }, 'Validation generation failed');
            throw error;
        }
    };
    return { generateValidation };
};

export default createValidationService();

I omitted the whole validation_service.js file here to simplify understanding and focus on the most important parts. The whole code of this Validation Service is available on GitHub.

The most important steps here are:

Save initial user input in the database to allow user retrieve analysis from the database when he will need it again.
Generate name for idea
Execute LLM analysis logic where each analysis step will be stored in the database.
Analyze HWWW
Analyze TAM-SAM-SOM

As you can see in the processAsyncValidation function I have plain old one by one execution instead of a graph implementation. This was a strategic decision to deliver AI Founder faster instead of implementing a graph execution for AI flow. But I still think that representing AI flows as a graph is a good idea, so I evolved it in the separate service ai-svc described in Building ai-svc: A Reliable Foundation for AI Founder. This service will be Open Source as well and I will publish an article about it soon.

Prompts

Prompts - this is special instructions to LLM to perform some actions, for example analysis. It's like code in JavaScript but in English. It's very important to keep prompts clean and explicit otherwise LLM will not provide expected output.

There is an example of prompt which I used to generate HWWW analysis:

import { validationPrompts } from '@/lib/ai/prompt/context';

// 1. Prompt itself
const instructions = `
- Perform comprehensive HWW analysis with growth-oriented criteria:
    - Market Opportunity Assessment:
        - Validate breakthrough potential
        - Classify as "Market disruption" vs "Growth accelerator"
        - Identify innovation opportunities
        - Evaluate first-mover advantages
        - Analyze success accelerators
        - Study scaling opportunities

    - Market Leadership Validation:
        - Assess premium value potential
        - Analyze pricing optimization opportunities
        - Identify rapid adoption catalysts
        - Map market domination paths
        - Optimize customer acquisition channels
        - Document value multipliers

    - Solution Leadership Check:
        - Validate competitive advantages
        - Identify scaling opportunities
        - Map compliance innovations
        - Optimize launch timeline
        - Analyze efficiency multipliers
        - Evaluate market leadership potential

    - Target Demographics Analysis:
        - Map market domination potential
        - Document behavioral opportunities
        - Track engagement accelerators
        - Identify loyalty builders
        - Map growth multipliers
        - Validate premium positioning
        - Assess revenue optimization
        - Map decision catalysts
        - List adoption accelerators

- Use evidence-based validation with breakthrough focus
- Focus on market leadership metrics
- Document clear domination paths
- Maintain balanced yet ambitious assessment
- Provide readiness to pay and a number for money which users are willing to pay for the solution.
`.trim();

// 2. Output schema
const schema = {
  "how_big_a_problem_is": {
    "overview": {
      "description": "",
      "size": 0,
      "dimension": ""
    },
    "frequency": [
      {
        "name": "",
        "explanation": ""
      }
    ],
    "readiness_to_pay": {
      "summary": "",
      "pricing": 0,
      "researches": [
        {
          "research": "",
          "explanation": ""
        }
      ]
    },
    "persistence": {
      "duration": "",
      "trend": "",
      "explanation": ""
    },
    "urgency": {
      "level": "",
      "explanation": ""
    },
    "historical_attempts": [
      {
        "name": "",
        "result": ""
      }
    ]
  },
  "why_does_this_problem_exist": {
    "summary": "",
    "reasons": [
      {
        "name": "",
        "explanation": ""
      }
    ]
  },
  "why_nobody_solving_it": {
    "summary": "",
    "reasons": [
      {
        "name": "",
        "explanation": ""
      }
    ]
  },
  "who_faces_this_problem": {
    "summary": "",
    "metrics": {
      "characteristics": [
        {
          "name": "",
          "value": ""
        }
      ],
      "geography": [],
      "psychology_patterns": [
        {
          "name": "",
          "value": ""
        }
      ],
      "specific_interests": [],
      "habitual_behaviour": [
        {
          "name": "",
          "value": ""
        }
      ],
      "trust_issues": [
        {
          "name": "",
          "value": ""
        }
      ],
      "where_to_find_them": [
        {
          "name": "",
          "value": ""
        }
      ]
    }
  }
};

export default validationPrompts(instructions, schema);

So as you can see I have two parts of a prompt:

Prompt itself which contains instructions in English.
Output schema which LLM should produce. It contains field names and empty values for types to give a clue for LLM what type need to use in a field.

Another prompts used in AI Founder are available on GitHub

UI Development

AI Founder is available here.

As I mentioned above UI became a differentiation point in my application so I should to build a good UI. But I'm back-end engineer and I can build UIs which are good for me but not very good for an end user. So to build a good UI I decided to use Cursor in an agent mode with some prompts like:

Use TailwindCSS and Next.js to build idea analyser page which will contain:
- HWW 
- TAM-SAM-SOM
- ...

This page should have a beautiful UX and be mobile friendly

By making some amount of iterations Cursor has been built a beautiful UI for me:

Also I added dynamic behaviour like query analysis results by timeout to improve UX:

import { useState, useEffect } from 'react';
import projecApi from '@/lib/client/api/project_api';

const POLLING_INTERVAL = 3000;

const hasAnyTasks = (tasks) => {
    if (!tasks) return false;
    return Object.values(tasks).some(taskList => Array.isArray(taskList) && taskList.length > 0);
};

export default function useProjectPolling(initialProject) {
    const [project, setProject] = useState(initialProject);
    const [isPolling, setIsPolling] = useState(hasAnyTasks(initialProject?.data?.tasks));

    const startPolling = (project) => {
        setProject(project);
        setIsPolling(true);
    };

    useEffect(() => {
        let timeoutId;

        const pollProject = async () => {
            if (!isPolling) return;

            try {
                console.log('Polling project:', project.id);
                const updatedProject = await projecApi.getProject(project.id);
                setProject(updatedProject);

                if (hasAnyTasks(updatedProject?.data?.tasks)) {
                    timeoutId = setTimeout(pollProject, POLLING_INTERVAL);
                } else {
                    setIsPolling(false);
                }
            } catch (error) {
                console.error('Polling error:', error);
                if (isPolling) {
                    timeoutId = setTimeout(pollProject, POLLING_INTERVAL);
                }
            }
        };

        if (isPolling) {
            timeoutId = setTimeout(pollProject, POLLING_INTERVAL);
        }

        return () => {
            if (timeoutId) {
                clearTimeout(timeoutId);
            }
        };
    }, [project?.id, isPolling]);

    return { project, startPolling, setProject };
}

The code above uses React hooks to query analysis for a project every 3 seconds and with it I can achieve dynamic feeling for my end users.

Also there a lot of interesting code solutions in this project, I can't cover them all but code is on GitHub so you are free to check it.

Deployment

I deployed this AI Founder project in Digital Ocean simply by using docker-compose with this configuration:

services:
  swag:
    image: linuxserver/swag
    container_name: swag
    cap_add:
      - NET_ADMIN
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Etc/UTC
      - URL=<url>
      - EXTRA_DOMAINS=<extra_url>
      - VALIDATION=http
      - EMAIL=<email>
    volumes:
      - ./swag/config:/config
    ports:
      - 443:443
      - 80:80
    restart: unless-stopped
    networks:
      - web
  aifounder:
    image: weaxme/pet-project:ai-business-founder-latest
    container_name: aifounder
    environment:
      - NEXT_PUBLIC_SUPABASE_URL=<supabase_url>
      - NEXT_PUBLIC_SUPABASE_ANON_KEY=<supabase_key>
      - ANTHROPIC_API_KEY=<antropic_key>
      - NODE_ENV=production
      - BASE_URL=<url>
      - NEXT_PUBLIC_API_URL=<api_url>
      - STRIPE_SECRET_KEY=<stripe_secret_key>
      - STRIPE_HOBBY_PRICE_ID=<stripe_price>
      - STRIPE_PRO_PRICE_ID=<stripe_price>
    restart: unless-stopped
    networks:
      - web
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
networks:
  web:
    external: false

linuxserver/swag - for proxy management
weaxme/pet-project:ai-business-founder-latest - AI Founder docker image

Important tips for deployment of an indie project in docker-compose:

Need to always define healthcheck endpoint to allow Docker restart a container if something happened like OOM or other problem with a service. Also this endpoint allows to quickly see service status in the docker ps output.
Docker Hub allows to host only one private repository for docker images for free which means that if I have multiple projects I need to buy premium plan on Docker Hub. But if use docker image tag as not version but as service name like I did: weaxme/pet-project:ai-business-founder-latest, Docker Hum allows to host infinity number of pet projects on the free plan. Because image tag is a service name and version instead of docker registry policy to keep service name before image tag. ## Key Learnings

The AI Founder product wasn't a product which earned some money but it provided me valuable learnings which I will use in my next developments.

Technologies Decisions

After developing a project I would like to change these things next time:

JavaScript -> TypeScript - I decided to use JS instead of TS on the start of a project to speed up development but it was a huge mistake because a project very quickly became complex and types would simplify development a lot. So next time I will simply use TS from the start.
Supabase -> Postgres - I decided to use Supabase to minimize infrastructure configuration by myself but it was a huge mistake. I did a vendor locking with very strict limits provided by Supabase and that flow for user registration simply didn't work and simple so well as I expected. Next time I will just use Postgres and implement user registration flow by myself. Even more I already configured for myself infrastructure in VPS which is cheap and reusable without any strict vendor limits.
Digital Ocean -> Hetzner - while Digital Ocean is pretty good and significantly cheaper than Vercel or AWS I found that Hetzner is cheaper than Digital Ocean. And I already built an infrastructure for myself in Hetzner for my next projects.

Product Idea Validation

While I was working on the product to validate business idea I validated an idea of this product with Custom GPT - Product Idea Analyzer which I developed in June 2024. It showed promising results and I didn't check anything. This was huge mistake because users just didn't see any value in my product for which them are ready to pay.

I learned that statistical data is not enough to validate product idea and I should speak to clients first. In March 2025 I already tried to sell my AI Founder for 1 month and I realized that I'm doing something wrong that's why I started learning at the ID Accelerator program. It was one of the most important product knowledge which I acquired during last time.

I shouldn't build anything first, but I should speak with real clients and only after that I should start prototyping a solution

This is completely different approach to build products from my own:

Build product
Try to sell
Fail
Repeat

In this new approach everything becomes differently:

Find customers for interviews
Speak with customers to identify their problems
Prototype a solution for customers
Validate prototype with customers
Build a product

That's it. Building a new product is not about programming it's about:

Understanding of customers and their problems
Marketing
20% programming

Product Decisions

The most important learnings:

Do not build login via email / passoword use OAuth2 instead like Google or Facebook. Too much users truncated my app after didn't receive email in first couple seconds which was a common issue on the start. I just disabled requirement to confirm an email. In next product I will use something like Auth0 from the beginning.
Product features should work good and stable. That's where a new project ai-svc was born.

Differentiation points in AI product

I think that to make AI product not prompts should bring value but these:

Unique data.
Unique process automation (doesn't have relation to AI but still valid here)

Conclusions

Originally published on my blog: From SaaS to Open Source: The Full Story of AI Founder

AI Founder become for me a project where I learned a lot about AI Engineering and product building. During it I built another product which supposed to be AI infrastructure project. Also I learned that I did product idea validation in the wrong way previously.

I should act as an entrepreneur not as an engineer if I want to execute a successful product.

It's hard to accept for me, I still love to write a code and configure infrastructure too much. As a result of AI Founder is this technical blog where I will publish my new learnings and journey.

Subscribe to my Substack do not miss my new article 😊

References

You can access AI founder on GitHub
Try AI Founder here
Demo is here
ID Accelerator - highly recommend if you want to get product building skills.

Python RAG API Tutorial with LangChain & FastAPI – Complete Guide

Vitalii Honchar — Thu, 29 May 2025 07:55:45 +0000

Introduction

Originally published on vitaliihonchar.com

During last few months I was observing new releases in AI sector and new startups which are using AI. So I was curious what they are doing? How they are doing these AI things? While I have some experience with building AI applications I feel that's it's not enough and I want to know more about building AI apps. That's why with this new blog post I'm starting a new journey in my life - blogging about software engineering.

In this blog post I will explain how to build AI powered application to chat with uploaded PDF files. It will use these techniques and frameworks:

Retrieval Augmented Generation (RAG)
LangChain to build RAG and communicate with OpenAI
FastAPI to build API
Python 😊

Code from this article is available at GitHub

High Level Architecture

pdf-analyzer - service which analyzes PDF documents and retrieves answers for user questions from PDF documents

User sends a question to the pdf-analyzer service
The pdf-analyzer services gets related document to a user question from the Postgres database
The pdf-analyzer sends a request with a user question and retrieved documents from the step 2 to OpenAI API to get an answer for a user question.

Before we will jump to the details of implementation let's understand why this architecture has been called "retrieval augmented generation".

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) - the pattern in AI applications in which to provide an answer for a user am application will provide related information for a user request to LLM. Which will make LLM answer more "smarter" because LLM will get more context about a problem which it should solve.

So the process of RAG the best depicts this diagram:

User sends a request to AI application
AI application retrieves information from the external storage
AI application augments original user request with a retrieved information and sends to LLM to generate an answer

This approach results in much more better LLM responses than just directly send the document with a lot of pages to LLM and ask for a response.

Use Cases of RAG

Use case of the RAG pattern is to analyze information for cases when amount of information is higher than LLM context. While modern LLMs have huge context size RAG pattern can still be a benefit because if LLM context is filled more than 50% the chances of hallucinations are very high. So to get the best responses from LLM need to keep context usage minimal.

Use Cases of RAG in the real world

In the real world RAG can be used in these applications:

AI Chat with company documentation
Customer Support AI Bot
Frequent retrieval of information from unstructured data
Middle step of more complex flow

That's it from the theory and let's jump to the implementation part 😎

User Flows

Upload PDF document

User uploads PDF document in the pdf-analyzer service.
The pdf-analyzer service reads PDF to text, splits text by chunks to increase accuracy of data retrieval.
The pdf-analyzer service uses OpenAI API to convert text to a vector which will represent provided text chunk. Next we will use this vector to perform search in the database by using math.
Save vector in the storage. So at this step we are saving numeric vectors of text and the text itself in the storage. Later we will use math to find the most relevant text chunks to a user question

Chat with uploaded PDF document

User sends a question via API to the pdf-analyzer service
The pdf-analyzer service converts user question to a numeric vector by using OpenAI API
The pdf-analyzer service finds the most close vectors in the storage for a user question.
The pdf-analyzer sends user question, retrieved documents and system prompt to the OpenAI API to get the most accurate answer

Technology Decisions

By knowing user flows above we can decide what technologies we will use to build this application.

LangChain Framework - the best framework to build AI systems which covers a lot of cases
Python - original language for LangChain is Python, so we will go with it
FastAPI - the modern and super convenient framework to build APIs in Python which can handle huge load. Also it allows to handle high load in Python.
Postgres - A mature database with a support of vector storage via plugin

Service Architecture

The pdf-analyzer service will use a classical layered architecture:

Routes files and chats will handle HTTP requests and use services to execute business logic
Services document service and ai service will execute business logic and integrate with Postgres and OpenAI API

This architecture approach provides a possibility to satisfy single responsibility principle and keep system simple.

The whole source code of an article is available at GitHub. For a simplicity of an article I will include only code which highlights the most important concepts of RAG API.

Implementation

Document Service

DocumentService - the service which is responsible to save/read documents.

import tempfile

from langchain_core.vectorstores import VectorStore
from langchain_core.documents import Document
from langchain_text_splitters.base import TextSplitter
from pdf_analyzer.models import File
from dataclasses import dataclass
from sqlmodel import Session
from langchain_community.document_loaders import PyPDFLoader
from pdf_analyzer.repositories.file import FileRepository
from uuid import UUID


@dataclass
class DocumentService:

    vector_store: VectorStore
    text_splitter: TextSplitter
    file_repository: FileRepository

    async def save(self, session: Session, file: File) -> File:
        # 1. Save file to the database
        file = self.file_repository.create_file(session, file)

        # 2. Convert file to a list of LangChain documents
        documents = self.__convert_to_documents(file)
        # 3. Split list of LangChain documents to smaller documents to improve accuracy of RAG
        all_splits = self.text_splitter.split_documents(documents)
        # 4. Adds metadata to a file to allow communicate with specific file 
        self.__add_metadata(all_splits, file)
        # 5. Save documents in the vector store
        await self.vector_store.aadd_documents(all_splits)

        return file

    async def search(self, text: str, file_ids: list[UUID] = []) -> list[Document]:
        documents_filter = None
        if file_ids:
            documents_filter = {
                "file_id": {"$in": [str(file_id) for file_id in file_ids]}
            }
        return await self.vector_store.asimilarity_search(text, filter=documents_filter)

    def __add_metadata(self, documents: list[Document], file: File):
        for doc in documents:
            doc.metadata["file_name"] = file.name
            doc.metadata["file_id"] = str(file.id)

    def __convert_to_documents(self, file: File) -> list[Document]:
        with tempfile.NamedTemporaryFile(suffix=".pdf", delete=True) as tmp_file:
            tmp_file.write(file.content)
            tmp_file.flush()

            loader = PyPDFLoader(tmp_file.name)
            return loader.load()

The most interesting part of the system is this DocumentService which saves file in the database by following these steps:

Save file to the database
Convert file to a list of LangChain documents
Split list of LangChain documents to smaller documents to improve accuracy of RAG
Adds metadata to a file to allow communicate with specific file
Save documents in the vector store

Pretty important step is step 4 because at the end our user wants to communicate with specific files and not all files in the system. That's why we are adding metadata tag file_id in the __add_metadata method.

User 1 uploads file 1 and the __add_metadata method specifies file_id: 123 for it.
User 2 uploads file 2 and the __add_metadata method specifies file_id: 456 for it.

When users will search relevant content in files they will pass file_id tag which will be used to find specific files as it was done in the search method.

AI Service

AIService - the service which is responsible for OpenAI LLM API integration.

from langchain_core.language_models import BaseChatModel
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        ("system", "{data}"),
        ("human", "{text}"),
    ]
)


class Output(BaseModel):
    answer: str | None = Field(
        default=None,
        description="Answer on the question",
    )


class AIService:

    def __init__(self, llm: BaseChatModel):
        self.llm = llm
        self.structured_llm = llm.with_structured_output(schema=Output)

    def retrieve_answer(self, question: str, docs: list[Document]) -> str | None:
        data = "\n\n".join(doc.page_content for doc in docs)
        prompt = prompt_template.invoke({"text": question, "data": data})
        llm_result = self.structured_llm.invoke(prompt)

        return Output.model_validate(llm_result).answer if llm_result else None

The retrieval of an answer from a document looks like this:

The list of LangChain documents joins together in a string
LangChain prompt template substitutes template variables and generates a final prompt
LangChain llm class generates a structured response Output by sending my prompt to OpenAI
LLM response validates to be a valid Pydentic Output model

ChatService

ChatService - the service which is responsible for a user conversation with LLM and augmenting user requests to LLM.

from dataclasses import dataclass
from pdf_analyzer.schemas import ChatCreate
from pdf_analyzer.repositories import ChatRepository, MessageRepository
from pdf_analyzer.models import Chat, Message, SenderType
from sqlmodel import Session, select
from pdf_analyzer.schemas import MessageCreate
from pdf_analyzer.services.ai import AIService
from pdf_analyzer.services.document import DocumentService
from uuid import UUID
from typing import Sequence


@dataclass
class ChatService:
    chat_repository: ChatRepository
    message_repository: MessageRepository
    ai_svc: AIService
    document_svc: DocumentService

    def create_chat(self, session: Session, chat_create: ChatCreate):
        chat = Chat(name="New Chat", files=[])
        return self.chat_repository.create(session, chat, chat_create.file_ids)

    def find_all_chats(self, session: Session):
        return self.chat_repository.find_all(session)

    def get_chat(self, session: Session, chat_id: UUID):
        chat = session.exec(select(Chat).where(Chat.id == chat_id)).one_or_none()
        if not chat:
            raise ValueError(f"Chat with ID {chat_id} does not exist.")
        return chat

    async def send_message(
        self, session: Session, chat_id: UUID, message_create: MessageCreate
    ):
        human_message = Message(
            content=message_create.content,
            chat_id=chat_id,
            sender_type=SenderType.HUMAN,
        )

        chat = self.get_chat(session, chat_id)
        docs = await self.document_svc.search(
            human_message.content, [file.id for file in chat.files]
        )

        answer = self.ai_svc.retrieve_answer(
            human_message.content,
            docs,
        )
        if not answer:
            answer = "N/A"

        ai_message = Message(content=answer, chat_id=chat_id, sender_type=SenderType.AI)

        self.message_repository.save_messages(session, human_message, ai_message)

        return ai_message

    def find_messages(self, session: Session, chat_id: UUID) -> Sequence[Message]:
        return self.message_repository.find_by_chat_id(session, chat_id)

The most interesting method is send_message which is doing:

Gets chat by message id
Gets documents related to a chat
Sends a request to LLM with user request and retrieved documents
Save user message and AI response
Return a response to a user

Testing

0. Install dependencies

To run this project Poetry should be installed in the system.

poetry install - installs dependencies
poetry shell - uses virtualenv Python in this shell

1. Create .env file

Let's test this API by hands to see how it works. The code is available in GitHub so you can clone a repository and run code locally. Need to create .env file with specified variables:

PDF_ANALYZER_OPENAI_API_KEY - OpenAI API key.
PDF_ANALYZER_DB_URL - Postgres connections string.
- Specify postgresql://root:root@localhost:5432/pdf-analyzer if you will run Postgres from the docker-compose.yaml file.

2. Launch docker-compose.yaml

docker compose up -d - this will start Postgres with configured vector plugin in the Docker container.

3. Launch FastAPI server

Run this command to start FastAPI:

fastapi dev src/pdf_analyzer/main.py

Logs will look like this:

4. Upload a file

Open http://127.0.0.1:8000/docs#/files/upload_file_files_upload__post and upload any file. I will upload Technology Radar pdf in my example.

5. Create a chat

Open http://127.0.0.1:8000/docs#/chats/create_chat_chats__post and create a chat with using file id received in a response after file uploading.

6. Send a message

Open http://127.0.0.1:8000/docs#/chats/send_message_chats__chat_id__message_post and send a message to a chat to communicate with uploaded file.

There is a response:

Conclusions

The source code is available on GitHub.

In this article I highlighted how to build RAG API in Python with LangChain and FastAPI. This RAG technique looks useful and I will look to integrate it with some real world applications.

Just to repeate general RAG algorithm looks like this:

Originally published on vitaliihonchar.com

🚀 If you enjoyed this, check out my blog for more AI + backend deep-dives: 🔗 vitaliihonchar.com

Or subscribe to my newsletter on building real-world AI systems: 📬 Substack – Vitalii Honchar

⛓️‍💥 Let's connect!