Forem: nghiach

Beyond RAG: Why AI Agents Need Memory as an Asset — Not a Cache

nghiach — Thu, 26 Feb 2026 14:30:30 +0000

Most AI agents today have memory.
But almost all of them forget.

We’ve built smarter models.
We’ve built tool-using agents.
We’ve built autonomous loops.

And yet — when you talk to most AI systems for a week, they still feel… stateless.

They don’t evolve with you.
They don’t accumulate structured knowledge.
They don’t grow.

The problem is not reasoning.
The problem is memory.
And more specifically — the way we think about memory is fundamentally broken.

The Illusion of Memory in Modern AI Agents

Let’s look at how “memory” is implemented today.

1️⃣ Chat History

Most systems treat memory as conversation history.

Append messages.
Trim old tokens.
Inject recent context.

This is not memory.
This is a sliding window buffer.

It is fragile, token-bound, and disposable.

2️⃣ Vector Databases

The more advanced pattern?

Store embeddings.
Retrieve semantically similar chunks.
Inject into prompt.

This is often called “long-term memory.”
But it’s still not memory.
It’s retrieval.
Frameworks like LangChain and many RAG systems popularized this pattern.
Even large research organizations like OpenAI have experimented with memory retrieval layers.

But let’s be honest:
A vector store is not memory.
It’s an index.

3️⃣ RAG

Retrieval-Augmented Generation (RAG) is powerful.
But RAG is about:

Injecting documents
Improving grounding
Reducing hallucination

RAG does not answer:

Who owns the memory?
How does memory evolve?
How is memory versioned?
When should memory be archived?

RAG treats memory as context.
Not as a governed asset.

Why This Is Fundamentally Broken

If we want agents to move beyond chatbots, three things must change.

❗ 1. No Lifecycle

In most systems:
Memory is written.
And then… it just sits there.

No:

Validation
Confirmation
Evolution
Archiving
Deprecation

In real systems — especially enterprise systems — data has lifecycle:

Draft
Review
Commit
Archive

Why should agent memory be different?

❗ 2. No Governance

Enterprise AI requires:

Audit trails
Versioning
Ownership
Access control

But current agent memory layers are:

Implicit
Opaque
Unstructured

If an agent “remembers” something wrong, how do you correct it?
If it infers a pattern incorrectly, how do you roll it back?
Without governance, memory becomes liability.

❗ 3. No Abstraction

Most frameworks reduce memory to:

embedding → similarity search → context injection

That’s an implementation detail.
Memory is not an embedding.
Memory is structured knowledge that evolves over time.

A Shift in Mindset: Memory as an Asset

In modern data engineering, platforms like Dagster, Apache Hamilton introduced a powerful idea:

Treat data as assets.

Assets are:

Versioned
Observable
Governed
Materialized through pipelines

What if we applied the same thinking to agent memory?

Instead of:

Memory = context

We define:

Memory = Asset

That means memory becomes:

First-class
Queryable
Lifecycle-managed
Policy-enforced

This changes everything.

From Chatbot to Stateful System

There is an evolution happening in AI systems.

Phase 1: Chatbot

Stateless. Reactive.

Phase 2: Tool-Using Agent

Calls APIs. Slightly more capable.

Phase 3: Autonomous Agent

Loops. Plans. Executes.

Phase 4 (Emerging): Stateful System

Structured memory. Lifecycle. Governance.

The first three focus on reasoning.
The fourth focuses on continuity.
Continuity is what makes intelligence compound.

What Does “Memory as Asset” Actually Mean?

It means we stop thinking about memory as a blob of text.
Instead, we design it like a data system.

1️⃣ Memory Has a Lifecycle

Every memory object should go through:

Draft → Refine → Commit → Archive

For example:

User says:
“I started going to the gym.”

The system should not instantly mutate the user profile permanently.

Instead:

Draft memory: “User may have started gym habit.”
Confirm or observe consistency.
Promote to committed memory.
Archive if outdated.

This mirrors how enterprise systems handle critical data.

2️⃣ Memory Has Layers

Not all memory is equal.
A practical structure:

Fact — Atomic event
Temporal — Time-aware sequence
Pattern — Repeated behavior
Insight — Derived inference

Example:

Fact:
“User bought coffee.”

Temporal:
“User buys coffee almost every weekday.”

Pattern:
“User has strong morning routine behavior.”

Insight:
“User productivity correlates with early caffeine intake.”

This layered design prevents shallow personalization.
It enables structured evolution.

3️⃣ Memory Is Governed

When memory becomes an asset, it must support:

Versioning
Traceability
Ownership
Deletion policies
Access scope

This is critical for enterprise AI.
Without governance, long-term memory becomes a compliance nightmare.

Why This Matters for Enterprise AI

Enterprises don’t fear AI because of hallucination.
They fear:

Uncontrolled state mutation
Lack of audit trail
Untraceable decisions

If an AI system accumulates knowledge about customers, employees, or operations:
It must behave like a data platform — not a chatbot.
Memory as Asset enables:

Trust
Control
Observability
Compliance

It bridges the gap between:

LLM experimentation
and
Enterprise-grade systems.

Why This Matters for Personal AI

On the consumer side, the impact is even bigger.

Imagine an AI that:

Truly understands your long-term goals
Evolves with your habits
Detects patterns in your behavior
Refines its understanding over months or years

Not by guessing every session.
But by managing memory structurally.
This is how we move from:

“Smart assistant”
to
“Personal AI Operating System.”

The Future of Agentic AI

Today, most research focuses on:

Better reasoning
Longer context windows
Stronger tool use

But the real bottleneck is not reasoning power.
It is memory structure.
The future of agentic AI will not be defined by:

Bigger models
More tools
Longer prompts

It will be defined by:

Stateful systems
Structured memory
Lifecycle governance
Asset-based intelligence

Final Thought

Autonomous agents are exciting.
But autonomy without memory discipline is chaos.
If we want AI systems that:

Persist
Personalize
Compound intelligence
Earn trust

We must stop treating memory as a cache.
And start treating it as an asset.

If this idea resonates with you, I’m currently exploring it deeper through a structured architecture approach for stateful AI systems.
Because the next frontier of AI is not just thinking better.
It’s remembering better.

Atomic Stateful Agent — From Architecture Idea to Working Code

nghiach — Sun, 22 Feb 2026 08:24:20 +0000

In a previous article, I introduced the idea of the Atomic Stateful Agent (ASA) — an architecture for building AI agents that can operate safely inside real enterprise workflows.

This repo, atomic-stateful-agent, is the practical implementation of that idea.

It’s not about making agents “talk better”.
It’s about making agents behave safely around real system state.

GitHub: https://github.com/chnghia/atomic-stateful-agent

🚨 The Problem with Chat-Centric Agents

Most agent systems today look like this:

User message → LLM → tool call → system update

This works well for demos.
But in real systems (ERP, CRM, tickets, finance records…):

The LLM might misunderstand intent
A half-complete instruction can still trigger a write
Conversation history becomes your “state” (unstructured, fragile)

That’s risky.

Enterprises don’t run on chat history.
They run on entities, workflows, and transactions.

💡 What Is an Atomic Stateful Agent?

An Atomic Stateful Agent (ASA) is a system where:

Workflow logic is modeled as a state machine
Business objects are treated as entities
Changes happen through a draft → commit process
LLM reasoning is controlled, not in charge

Think of it like this:

LLM = reasoning layer
ASA = transactional control layer

The LLM can suggest.
The system decides.

🧩 AIU + ASA: Two Layers, Two Responsibilities

This repo combines two ideas:

🧠 AIU — Atomic Inference Unit

(from my earlier article)

AIU is about atomic reasoning.

An AIU:

Has clear structured input
Produces structured output
Solves one focused inference task
Is stateless
Does not control workflows or databases

In this repo, AIUs are built using the JIL stack:

Jinja → Instructor → LiteLLM

They live mainly in:

core/      # inference abstractions
schemas/   # structured IO contracts
prompts/   # Jinja templates

So:

AIU = safe, structured thinking

🔄 ASA — Atomic Stateful Agent

(the workflow/state layer)

ASA is about atomic state control.

It handles:

Workflow steps
State transitions
Draft data
Commit/cancel logic
Interaction with external systems

This layer is implemented through:

nodes/     # workflow nodes
state.py   # AgentState
graph.py   # LangGraph state machine

So:

AIUs think
ASA decides what can happen

This separation is the core design principle.

🔑 Core ASA Patterns in This Repo

1️⃣ Sticky Routing (Intent Lock)

Once a workflow starts, the agent stays in that intent.

Example state:

{
  "active_intent": "daily_log",
  "active_draft": {...},
  "record_id": None
}

The agent doesn’t jump to other tasks just because the user changes wording.

2️⃣ Draft → Commit Protocol

The agent never writes to the real system immediately.

Instead:

User input
   ↓
Create draft state
   ↓
Refine over multiple turns
   ↓
User confirms
   ↓
Commit → persist

During conversation, the user edits a draft object, not production data.

3️⃣ Recall & Hydrate

When editing existing data:

Load record from DB
Hydrate it into draft state
Let the user edit
Commit or cancel

This is not chat memory.
It’s structured state restoration.

📦 Project Structure

src/
├── core/        # AIU foundation (inference layer)
├── schemas/     # structured IO for AIUs
├── prompts/     # Jinja templates (AIU prompts)
├── nodes/       # ASA workflow nodes
├── state.py     # AgentState model
├── mock_db.py   # in-memory DB
└── graph.py     # LangGraph builder

main.py          # CLI entry point

Key idea:

Prompts do not define the system
State logic lives in code
LLM reasoning is modular (AIUs)
Workflow control is deterministic (ASA)

▶️ Running the Agent

pip install -e .
cp .env.example .env
# add LLM key

python main.py

You get an interactive console, but the behavior is state-driven — not just free chat.

🧪 Example — Creating a Record

You: Log task: Review PR for auth feature
Agent: Draft created

You: Change priority to high
Agent: Draft updated

You: Done
Agent: Task saved successfully!

Notice:

The task was not saved immediately
Only after confirmation does it become real

🔁 Editing an Existing Record

You: Edit the PR review task
Agent: Record found. Hydrating draft…

You: Change title to "Review PR plan"
Agent: Draft updated

You: Save
Agent: Task updated!

This is structured state editing, not memory guessing.

🏗 Why This Matters

If your agent:

Updates tickets
Modifies contracts
Handles finance data
Talks to ERP/CRM

Then prompt engineering is not enough.

You need:

State machines
Entity models
Transaction-like control

That’s what ASA provides.

🔄 Architecture → Implementation

My previous article explained the architecture and mindset.

This repo shows:

“Here’s how to actually build it.”

Together:

AIU → Atomic reasoning
ASA → Atomic state control

🧠 Final Thought

The future of enterprise agents is not:

“Smarter chat”

It’s:

Controlled AI operating on structured state

Start thinking less about longer prompts,
and more about state, entities, and workflows.

Beyond the Chatbot: The “Atomic Stateful Agent” Architecture for Enterprise AI

nghiach — Wed, 28 Jan 2026 09:00:07 +0000

Most AI agents today are great at talking.
Enterprises, however, don’t run on conversations — they run on transactions, entities, and state.

This article introduces a new architectural mindset: Atomic Stateful Agents (ASA) — a pattern designed not for demos, but for real enterprise workflows.

1. The Pain Point: When Agent Demos Meet Enterprise Reality

We’ve all seen it:

LangGraph demos 🤯
CrewAI task chains 🚀
Multi-agent orchestration videos with glowing UI

It works beautifully when the task is:

“Research this topic”
“Summarize this report”
“Generate a plan”

Then we bring it into the company.

Suddenly the task becomes:

Log a task into the system
Submit a PO approval
Update project cost
Modify contract metadata

And everything… breaks.

The Root Cause: The Linear Chat Trap

Most current agent systems operate like this:

User → Message → Agent → Tool → Response → Done

This model assumes:

State = conversation history

But enterprise workflows don’t work like that.

In real systems:

Reality	What Chat Agents Assume
Data is stored in entities	Data is “remembered” in chat
Changes are transactions	Changes happen by “saying things”
Work is paused and resumed	Conversations are linear
Edits are normal	Past output is final

We are trying to fix structured data using unstructured dialogue.

That’s the mismatch.

2. The Trinity Model: Brain – Heart – Face

To move beyond chatbot-style agents, we need a layered view.

The A.S.G Stack

Layer	Role	What it Represents
Brain	Reasoning	LLM, planning, tool selection
Heart	State Control	Atomic Stateful Agent (ASA)
Face	Interface	Chat UI, Forms, APIs

Most systems today focus almost entirely on the Brain.

But enterprises fail without the Heart.

❤️ ASA = The Heart

The Atomic Stateful Agent is responsible for:

Managing entities
Controlling state transitions
Enforcing workflow structure
Acting as a transaction boundary

It’s not “smart” in the LLM sense.
It’s deterministic, structured, and stubborn — by design.

The Brain can improvise.
The Heart must not.

3. The Core Patterns (The Real Power)

This is where ASA becomes fundamentally different from “fire-and-forget” agents.

🔒 Pattern 1: Sticky Routing (The Lock)

Problem:
Normal routers behave statelessly.

Message A → Agent X  
Message B → Agent Y  
Message C → Agent Z

But enterprise tasks are sessions tied to an entity.

Example: Creating a Purchase Order.

Once a user starts editing PO #2026-001:

All actions must stay inside this entity context.

Sticky Routing does:

Binds the session to a specific entity instance
Prevents jumping to unrelated flows
Ensures the agent continues within the same state machine

This is not “conversation memory.”
This is context locking.

📝 Pattern 2: Draft–Commit Protocol (The Editor)

This is the most important shift.

❌ Traditional Agent Model

User says something → Agent calls tool → Data saved immediately.

That’s like autosaving every typo directly to production.

✅ ASA Model

We introduce a transaction buffer.

User Input → Draft State → Review → Modify → Commit

Key ideas:

Draft is mutable
Commit is atomic
Nothing touches the real system until commit

This mirrors ACID transaction thinking:

Database Concept	ASA Equivalent
Transaction	Workflow session
Buffer	Draft state
Commit	Final confirmation
Rollback	Discard draft

We are no longer “talking to a system.”
We are editing a transaction.

⏳ Pattern 3: Recall & Hydrate (The Time Machine)

This is where ASA breaks away from RAG-based thinking.

❌ How Most Systems Treat the Past

Past = text.

You retrieve:

“User created a PO for laptops.”

But this is just dead text.

You cannot continue editing that PO.

✅ ASA View

Past workflows are serialized entities, not paragraphs.

When recalled, they are:

Reconstructed → Hydrated → Returned to Draft Mode

Hydration means:

Restore entity structure
Restore state machine position
Restore editable fields

You don’t just “read history.”

You re-enter a live workflow instance.

This is closer to:

Reopening a saved document
than retrieving a document summary

4. Mindset Shift: AI Should Not Invent Workflows

Here’s the hard truth:

Free-form AI is bad at governance.

We must stop asking AI to “figure out the process.”

Instead:

Design deterministic workflows. Let AI flex inside them.

Or simply:

Don’t use AI to invent the process.
Use AI to flexibly fill the process.

This is the marriage of:

Software Engineering	Generative AI
Structure	Flexibility
State Machines	Language Understanding
Deterministic flows	Adaptive input
Transactions	Draft reasoning

ASA vs Traditional Agent Architectures

Aspect	Fire-and-Forget Agent	Atomic Stateful Agent
State	Conversation history	Explicit entity state
Data Handling	Immediate tool writes	Draft → Commit
Workflow Model	Linear chat	State machine
Past Sessions	Text memory (RAG)	Hydratable entities
Routing	Message-based	Context-locked
Determinism	Low	High
Enterprise Safety	Fragile	Structured

Why This Matters Now

As AI moves from:

“cool assistant” → “system operator”

We need architectures that treat:

State as first-class
Entities as durable objects
AI as a collaborator inside constraints

ASA is not about making agents smarter.

It’s about making them safe to plug into real systems.

P/S: This article is the implementation of this architecture: Atomic Stateful Agent

Decoupling the AI Stack: How to Architect a Production-Grade Local LLM System

nghiach — Thu, 22 Jan 2026 03:44:17 +0000

From "Localhost" to "On-Premise": An open-source blueprint for building a privacy-first, scalable AI infrastructure with vLLM and LiteLLM.

We are currently living in the "Golden Age" of Local AI. Tools like Ollama and LM Studio have democratized access to Large Language Models (LLMs), allowing any developer to spin up a 7B parameter model on their laptop in minutes.

However, a significant gap remains in the ecosystem. While these tools are fantastic for single-user experimentation, they often encounter bottlenecks when promoted to a shared, enterprise environment.

When you try to move from a "Hobbyist" setup to a "Production" on-premise infrastructure for your team, you face different challenges:

Concurrency: How do you serve multiple concurrent users without queuing requests indefinitely?
Decoupling: How do you swap models (e.g., Llama 3 to Qwen 2.5) without breaking client applications?
Governance: How do you manage API keys, log usage, and enforce budget limits?

This article explores an architectural approach to solving these problems by decoupling the AI stack. I will also introduce SOLV Stack, an open-source reference implementation I built to demonstrate this architecture.

The Architectural Shift: Decoupling Components

In traditional web development, we wouldn't connect our frontend directly to our database. We use API Gateways and Backend services. We need to apply the same rigor to AI Infrastructure.

A production-grade Local AI system should be composed of three distinct, loosely coupled layers:

The Presentation Layer (UI): Where users interact (Chat interface).
The Governance Layer (Gateway): Where routing, logging, and auth happen.
The Inference Layer (Compute): Where the raw model processing occurs.

By separating these concerns, we avoid vendor lock-in and ensure scalability.

The Reference Architecture (SOLV)

To implement this philosophy practically, I created a Dockerized boilerplate called SOLV Stack. It stands for the four core components selected for their performance and enterprise readiness:

SearXNG (Privacy-focused Search)
OpenWebUI (The Interface)
LiteLLM (The Gateway)
VLLM (The Inference Engine)

Here is how data flows through the system:

1. The Inference Layer: Why vLLM?

For local development, tools like Ollama (based on llama.cpp) are excellent. However, for a shared infrastructure, throughput is king.

I chose vLLM for this stack because of its PagedAttention technology. In a multi-user scenario, vLLM manages GPU memory much more efficiently than standard loaders, allowing for higher continuous batching. It is designed to be a server first, maximizing the utilization of your expensive GPUs (like the RTX 5090).

2. The Gateway Layer: The Power of LiteLLM

This is perhaps the most critical component for an enterprise architecture. LiteLLM acts as a universal proxy.

It normalizes all inputs to the OpenAI standard format. This means your client applications (whether it's OpenWebUI, a custom React app, or an IDE plugin like Continue) only need to know how to speak "OpenAI." They don't need to know if the backend is running vLLM, Azure, or Anthropic.

This enables a Hybrid Architecture:

Routine tasks: Route to local vLLM (Zero cost, 100% privacy).
Complex reasoning: Route to GPT-4 (Pay per token).

This logic is handled strictly at the config level, not in your application code.

3. The Interface: OpenWebUI

Currently, OpenWebUI offers the most comprehensive feature set for teams, including RAG (Retrieval Augmented Generation) pipelines, user role management, and chat history. Because our stack is decoupled, if a better UI comes out next year (e.g., LibreChat), you can swap this layer without touching your backend models.

Implementation: The SOLV-Stack Boilerplate

I have packaged this entire architecture into a docker-compose setup that supports NVIDIA GPUs on both Linux and Windows (WSL2)—a crucial feature for organizations where developers work on Windows machines.

Configuration Example

The magic happens in the litellm_config.yaml. Here, we map our internal vLLM instance to a user-facing model name:

model_list:
  # The client sees "gpt-4-local"
  - model_name: gpt-4-local
    litellm_params:
      # But we route it to our local Qwen 2.5 instance
      model: openai/qwen2.5-coder
      api_base: http://vllm-backend:8000/v1
      api_key: EMPTY

Real-World Use Case: The Private Coding Assistant

One of the most immediate benefits of this stack is enabling AI coding assistants for your team without sending code to the cloud.

Deploy SOLV Stack on a local server with an RTX 5090.
Developers install the Continue or Cline extension in VS Code.
Point the extension to http://your-server:8080/llm/v1.
Result: A Copilot-like experience that runs entirely within your firewall.

Conclusion

Building a local AI platform is not just about downloading model weights; it's about designing a system that is stable, observable, and adaptable.

By moving from a monolithic "localhost" tool to a decoupled architecture using vLLM and LiteLLM, you gain control over your data and your infrastructure.

If you want to try this architecture yourself, I've open-sourced the setup here. It includes scripts for model downloading, Nginx configuration, and RAG pipelines setup.

Repo: github.com/chnghia/solv-stack

I'd love to hear how you are architecting your local AI stack. Are you using a Gateway pattern? Let me know in the comments!

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

nghiach — Mon, 19 Jan 2026 07:55:57 +0000

An opinionated foundation for reliable, composable LLM inference

Large language model (LLM) applications grow complex fast. Prompt logic, schema validation, multi-provider setups, and execution patterns become scattered. What if you could standardize how individual inference steps are written, validated, and executed — leaving orchestration, pipelines, and workflows to higher-level layers?

That’s the problem the atomic-inference-boilerplate aims to solve: provide a production-ready foundation for building robust inference units that are:

Atomic: Each unit performs one focused step — rendering a prompt, calling an LLM, validating structured output
Composable: Easily integrated into larger workflows such as LangGraph, Prefect, or custom orchestration layers
Type-safe: Outputs are never raw strings; results conform strictly to Pydantic schemas
Provider-agnostic: Works with OpenAI, Anthropic, Ollama, LM Studio via LiteLLM routing — switch models without rewriting logic

Let’s unpack what this boilerplate brings to your AI toolkit.

🧱 Project Philosophy: Atomic Execution Units

At the heart is a simple but powerful design principle:

“Complex reasoning should be broken down into atomic units — single, focused inference steps.”

An Atomic Unit encapsulates:

A Prompt Template (Jinja2) – separates text generation templates from business logic
A Schema (Pydantic) – defines strong typing expectations on outputs
A Runner (LiteLLM + Instructor) – resolves the model provider, generates completions, and validates output

This structure ensures your inference logic is modular, testable, and predictable.

📂 Repository Structure

Here’s how the repo’s main components are organized:

src/
├── core/           # Boilerplate core classes (AtomicUnit, renderer, client)
├── modules/        # Shared utilities (vector store helpers, validation utils)
├── prompts/        # Jinja2 prompt template files
└── schemas/        # Pydantic schema definitions
examples/           # Usage samples (basic, LangGraph, Prefect pipelines)
tests/              # Unit and integration tests
docs/ specs/        # Extended specifications and docs

The core, prompts, and schemas folders embody the atomic execution pattern. The examples/ folder contains concrete patterns you can use in real projects — from basic extraction tasks to multi-agent LangGraph configurations.

⚙️ Getting Started (Quickstart)

Clone the repo and install dependencies:

git clone <repo-url>
cd atomic-inference-boilerplate
conda activate atomic      # or your Python env
pip install -r requirements.txt
cp .env.example .env       # configure API keys
python examples/basic.py   # run a basic example

This bootstraps the boilerplate and executes a simple inference unit from the examples/ directory.

🧪 Example: Define & Run an Inference Unit

Each atomic unit is defined with:

a template,
an output schema, and
optional model choice.

A simple example:

from src.core import AtomicUnit
from pydantic import BaseModel

class ExtractedEntity(BaseModel):
    name: str
    entity_type: str

extractor = AtomicUnit(
    template_name="extraction.j2",
    output_schema=ExtractedEntity,
    model="gpt-4o-mini"
)

result = extractor.run({"text": "Apple Inc. is a technology company."})
print(result)  # ExtractedEntity(name='Apple Inc.', entity_type='company')

Here, the unit receives a text prompt, formats the Jinja2 template, executes the LLM call via LiteLLM, and validates the structured output against the ExtractedEntity schema. No loose strings — everything is typed and predictable.

🤖 Scaling to Real Workflows

Rather than replacing a workflow or orchestration framework, this boilerplate plugs into them. For instance:

📌 LangGraph Integration

Examples like langgraph_single_agent.py and langgraph_multi_agent.py demonstrate how atomic units become the execution layer behind orchestration decisions made by LangGraph. Higher layers decide what to do next, while atomic units decide how to perform each inference step.

📌 Prefect Pipelines

In extract-transform-load style pipelines (e.g., document processing), atomic units can extract metadata, detect structure, and chunk content — each step isolated, typed, and testable.

This separation of concerns improves maintainability and accelerates development. Instead of ad-hoc prompts scattered across your codebase, you get a clear, reusable pattern for every LLM interaction.

🧠 Why Atomic Inference Matters

In modern LLM applications, teams rapidly face challenges like:

Prompt logic tangled with business logic
Dirty text outputs requiring fragile parsing
Changing LLM providers or models
Hard-to-test inference steps

The atomic-inference-boilerplate tackles these by:

enforcing template + schema separation
imbuing type safety by design
enabling provider abstraction
fostering modularity and reuse

This approach mirrors best practices seen in software architecture (like atomic design in UI or modular microservices), but applied to the inference layer of AI systems.

🏁 Conclusion

If you’re building AI applications with anything beyond throwaway prototypes — where inference must be reliable, validated, maintainable, and scalable — then structuring your inference logic matters.

This boilerplate is a strong candidate for the core execution layer of your LLM pipelines. Whether you embed it inside workflow frameworks like Prefect, orchestrators like LangGraph, or custom pipelines, you get:

predictable and testable inference steps
clear separation between prompting and logic
extensibility to multiple providers

Give it a try and share your patterns on dev.to! Let’s build better AI workflows.

My repo:
https://github.com/chnghia/atomic-inference-boilerplate

I Built an Agentic AI Boilerplate (Agent-First, Conversation-First)

nghiach — Sun, 28 Dec 2025 04:16:10 +0000

Most current AI applications treat AI prompts as a simple plugin. You have a form, a prompt, and a linear workflow to generate text or assets.

However, I believe this is not enough. The future of software is not just about generating content. It is about Agents that can think, plan, and coordinate tasks. I built this boilerplate to explore that future.

What do I mean by “Agentic”?

There is a lot of buzz around the word "Agent." Let me clarify my definition.

It is not just a chatbot: A chatbot just replies to text.
It is not just a workflow: A workflow is a static path (A to B to C).

To me, a true Agent must be "alive." It needs state to know its current status. It needs memory to remember context. It must make decisions on its own. Finally, it must have the ability to call other agents to help solve complex problems.

Core philosophy of this boilerplate

I built this system based on three main principles:

Agent-first, not feature-first
Usually, we build features (like an "Export to PDF" button). In this boilerplate, we build an Agent that knows how to export a PDF. The agent is the core capability, not the UI buttons.
Conversation-first, no dashboards
Complex internal tools often have messy dashboards. I believe the best interface is a conversation. You talk to the system, and the system acts. The UI should focus on the chat, not on static tables and charts.
Explicit orchestration (state machine over magic)
I do not like "magic" loops that run forever without control. I prefer explicit orchestration. I use state machines to define exactly what the agent can and cannot do. This makes the system predictable and easy to debug.

High-level architecture

I wanted a stack that is modern, fast, and scalable. Here is the architecture:

FastAPI: This serves as the API and the runtime environment for the backend.
LangGraph: This is the brain. I use it to orchestrate the agent's logic and manage the state.
SSE (Server-Sent Events): We don't use simple request/response. The backend pushes events to the frontend in real-time.
Frontend as a “Viewer”: It acts as the communication interface. It focuses on interacting with the user through messages, prompt cards (Generative UI), tool calls, and errors.

What this repo gives you (and what it doesn’t)

What it gives:

Clean skeleton: A solid structure for an agentic system.
Orchestrator pattern: A clear way to manage how agents talk to each other.
Structured design: A specific place to add your sub-agents and tools.
Event-driven execution: A full model for handling real-time events.

What it doesn’t:

No UI framework: It provides a basic viewer, not a UI component library.
No prompt magic: You still need to write good prompts.
No SaaS features: It does not include billing, user management, or subscription logic.
Not a chatbot starter: If you just want a "Hello World" bot, this is overkill.

Who this boilerplate is for

This is for you if:

You are building a Personal Agent, an internal AI tool, or Enterprise AI.
You want full control over your agent's logic and flow.
You hate "prompt spaghetti" (messy code mixed with prompts).

This is NOT for you if:

You are a low-code user.
You just want a standard chatbot running in 5 minutes.

How I’m using it (or planning to)

I am using this boilerplate as the foundation for my Personal Agentic Hub. My goals are:

Journaling / Bookmarking Agent: An agent that organizes my notes and links automatically.
Long-running Agents: Agents that can keep memory for days or weeks to help me track long-term projects.

Repo & next steps
The code is open-source and available here: 👉 https://github.com/chnghia/agentic-boilerplate

My Roadmap:

Adding support for Sub-agents.
Implementing long-term Memory.
Optimizing for Local-first models.

I would love to hear your thoughts. Feel free to leave feedback, start a discussion, or open a Pull Request!