Forem: Nishant prakash

Stop stuffing tools into your agent 😤

Nishant prakash — Wed, 08 Apr 2026 08:25:21 +0000

There is a point in almost every agent project where the excitement starts to fade.

At first, it feels magical. You wire an LLM to a few Python functions, wrap them as tools, and suddenly your assistant can calculate, search, transform, and automate. But then the project grows. A few tools become ten. Ten become thirty. Business logic starts mixing with agent logic. Logging becomes messy. Reuse becomes painful. One agent needs the same tools as another, so you copy code. Then you copy it again. And somewhere in that process, your “smart system” quietly turns into a pile of tightly coupled Python.

That is exactly where MCP starts to make sense.

Model Context Protocol (MCP) is an open standard for exposing tools, resources, and prompts to LLM applications in a structured way. The official docs describe it as a standardized way for AI apps to connect to external systems, and even compare it to a “USB-C port for AI applications.” (Model Context Protocol)

And once that clicks, a very important design shift becomes obvious:

Your tools do not have to live inside your agent code anymore.
You can build them once, run them as an MCP server, and let agents consume them cleanly from the outside. (LangChain Docs)

That is the idea this post is about.

I’ll show you how I built a tiny MCP server with math and dice tools, added logging to observe tool calls, and then plugged that server into a LangChain agent exposed via FastAPI. Along the way, the architecture changed from “my agent has tools” to something much cleaner:

my tools live in their own server, and my agent just uses them.

Why this matters more than it looks

Imagine a normal backend team.

One person owns business logic. Another owns APIs. Another owns observability. Now imagine if every API consumer copied that business logic into their own codebase. That would quickly turn into chaos.

That’s essentially what happens when we embed tools directly inside agent code.

The issue isn’t that it breaks, it’s that it doesn’t scale.

With MCP, things get cleaner:

the agent focuses on reasoning
the tool server handles execution
logging and latency stay observable
tools can be reused across agents
you can evolve tools without touching the agent

That separation is exactly what MCP brings to the table, and libraries like langchain-mcp-adapters make this integration seamless(LangChain Docs).

So what is MCP, in plain English?

Let’s strip away the buzzwords.

When an LLM needs to do something real, like querying a database, calling an API, or running a workflow, we usually define those as tools inside the agent code.

MCP changes that idea:

What if tools were exposed through a standard protocol instead?

Now, the same tool server can be used by multiple agents, clients, or even frameworks(FASTMCP).

A simple way to think about it:

MCP server → where tools live
MCP client → connects to the server
Agent → decides when to use tools

It sounds like a small shift, but it’s the difference between a quick demo and a system you can actually scale.

Building a tiny MCP server (and making it observable)

To understand MCP, I didn’t start with anything complex.

I built a small server with just two kinds of tools:

basic math operations
a dice roll

Instead of pasting the full code here, you can check the clean version directly:

Simple MCP server: server_simple.py

At its core, it looks like this:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("math-server")

@mcp.tool()
def add(a: float, b: float) -> float:
    return a + b

That’s all it takes to expose a function as a tool.

Now here’s the important part:

These tools are not inside your agent
They are running in a separate server

Running the MCP server

You can start the server using:

fastmcp run server.py --transport http --port 9090

This exposes your tools over HTTP at:

http://127.0.0.1:9090/mcp

Now any agent (or client) can connect to it.

The moment things start to feel real

At this point, everything worked.
But something was missing.

I could call tools…
but I couldn’t see what was happening inside them.

And that’s when logging becomes non-negotiable.

Adding logging (turning it into a real service)

Instead of rewriting everything, I added a simple decorator to log:

when a tool starts
what inputs it received
what it returned
how long it took

You can check the full version here:

Logged MCP server: server.py

The core idea looks like this:

def log_tool(func):
    def wrapper(*args, **kwargs):
        print(f"START {func.__name__} {kwargs}")
        result = func(*args, **kwargs)
        print(f"END {func.__name__} -> {result}")
        return result
    return wrapper

And then:

@mcp.tool()
@log_tool
def roll_dice(faces: int = 6) -> int:
    ...

What this gives you

Now when your agent calls a tool, you don’t just get a result,
you get visibility:

TOOL START | roll_dice | faces=20
TOOL END   | roll_dice | result=15

And this is where MCP starts to click.

Your “tool layer” is no longer hidden inside your agent.
It’s running as a separate, observable service.

Why this step matters

This small setup already gives you:

tools defined independently
a server that can be reused
logging and latency visibility
a clean boundary between reasoning and execution

And we haven’t even touched the agent yet.

That’s where things get interesting next.

Now! Can I plug this server into an agent?

And the answer is yes.

LangChain now provides an MCP adapter library, langchain-mcp-adapters, which lets agents consume tools directly from MCP servers. Its MultiServerMCPClient can connect to one or more MCP servers, and by default it is stateless: each tool invocation opens a fresh MCP session, executes, and cleans up. (LangChain Docs)

That stateless behavior turned out to match my use case nicely.

The agent side: FastAPI + LangChain + MCP

Now comes the satisfying part.

Instead of embedding tools inside the agent, I made the agent connect to the MCP server over HTTP.

You can check the full working code here:

Agent (FastAPI + MCP): agent.py

What the agent really does

At a high level, the agent is surprisingly simple.

It:

connects to the MCP server
fetches available tools
initializes an LLM
lets the agent use those tools when needed

The key piece looks like this:

client = MultiServerMCPClient({
    "math": {
        "transport": "http",
        "url": "http://127.0.0.1:9090/mcp",
    }
})

tools = await client.get_tools()

That’s it.

MCP tools → automatically become usable by the agent

Adding the agent on top

Then we plug those tools into a LangChain agent:

agent = create_agent(
    model=llm,
    tools=tools,
    system_prompt="You are a helpful assistant that can use tools."
)

And expose it via FastAPI:

@app.post("/chat")
async def chat(request: ChatRequest):
    result = await agent.ainvoke({
        "messages": [
            {"role": "user", "content": request.query}
        ]
    })

    return {"response": result["messages"][-1].content}

And here’s the shift

Notice what’s missing.

There is no math logic here.
No add, no divide, no roll_dice.

The FastAPI app simply says:

here is my LLM
here is my MCP client
give me the tools
let the agent use them

What the full flow looks like

Let’s say the user sends:

Roll a 20 sided dice and then add 5 to it

The request travels like this:

user hits the FastAPI /chat endpoint
the agent receives the query
the agent decides it needs a tool
LangChain calls the MCP adapter
the MCP adapter calls the MCP server over HTTP
roll_dice(faces=20) executes
result comes back
the agent uses that numeric output in the next tool call
add(a=15, b=5) executes
final answer is returned to the user

And because the server has logging, you can actually watch this happen.

Here is the kind of trace I saw:

2026-04-08 12:34:34,822 | INFO | [7cb8650b] TOOL START | roll_dice | args=() kwargs={'faces': 20}
2026-04-08 12:34:34,822 | INFO | [7cb8650b] TOOL END   | roll_dice | result=15 | 0.03ms

2026-04-08 12:34:36,231 | INFO | [149d5f59] TOOL START | add | args=() kwargs={'a': 15.0, 'b': 5.0}
2026-04-08 12:34:36,231 | INFO | [149d5f59] TOOL END   | add | result=20.0 | 0.03ms

This is one of those moments where the architecture suddenly feels real.

You are not just “calling functions from an LLM.”
You are watching an agent orchestrate a tool server.

The part that stayed with me

What stood out wasn’t the dice roll or the API.

It was the separation.

the MCP server owns the tools
the agent handles reasoning
the API manages interaction

Each piece has a clear role and can evolve independently.

MCP challenges the habit of packing everything into one place and instead gives you a cleaner way to separate intelligence from execution.

And once you see that, it’s hard not to think:
Why was all of this in one file to begin with?

What’s next

This is just the starting point.

In upcoming blogs, I’ll go deeper into:

making MCP tools more robust
improving reliability and performance
adding security and access control
and turning this into something closer to production-ready

Because building tools is one thing,
building safe, scalable, and reliable tool systems is where things get really interesting.

I Thought TOON Was Hype. Then I Tested It…

Nishant prakash — Sat, 22 Nov 2025 19:03:09 +0000

If you’re even slightly deep into the world of agentic AI, you’ve probably seen it: blog after blog mentioning something called TOON. The posts were short. Some offered impressive claims like “cuts JSON token usage in half.” Others just mentioned it as a rising serialization format for LLMs, especially in agent-based workflows.

Curious but unsatisfied, I decided to take a different route. I didn’t just want to read about TOON, I wanted to build with it. I wanted to see what made it special, how it compares to formats like JSON and CSV, and where it actually shines. And of course, I wanted to share what I found in a way that makes it easier for you to understand too.

This blog is that story.

Agentic AI in One Paragraph

If you’re working with agentic AI frameworks like CrewAI, LangGraph, or AutoGen, you already know the basics: agents powered by LLMs collaborate, passing structured information between each other to complete tasks. Think of each agent like a focused worker, one might gather data, another might summarize it, and another might generate recommendations.

But here’s the catch: every time they exchange data, it costs tokens. And in LLMs, tokens mean time, money, and performance limits.

That’s where TOON comes in.

So... What Is TOON, Really?

TOON stands for Token-Oriented Object Notation. It’s a new serialization format that’s designed specifically for LLM communication. Like JSON, it can represent nested, structured data. But unlike JSON, it avoids all the extra syntax, no curly braces, no repeated field names, and no need for quotes around every string.

Here’s a side-by-side comparison to make that clearer:

A Simple JSON:

[
  {"id": 1, "name": "Alice", "score": 85},
  {"id": 2, "name": "Bob", "score": 92}
]

Same in TOON:

data[2]{id,name,score}:
  1,Alice,85
  2,Bob,92

TOON is compact, readable, and tailor-made for LLMs. But I wanted to know: how much better is it, really?

The POC I Built

Before diving into results, here are the exact two scripts used for benchmarking. These are available publicly so readers can clone and run them:

Test 1 (Complex Nested JSON): https://github.com/trickste/toontoon/blob/main/toon_test_1.py
Test 2 (Flat JSON): https://github.com/trickste/toontoon/blob/main/toon_test_2.py

These scripts serialize the same dataset into different formats and measure how many tokens each representation consumes when passed through an LLM.

To really understand TOON, I built a two-agent proof of concept powered by Ollama (LLaMA 3.1) running locally. The idea was simple:

Agent 1 (Data Generator): produces structured data in Python
Agent 2 (LLM Analyzer): consumes that data in various formats and explains it

But here’s the twist: I fed the data into Agent 2 using six different serialization formats:

JSON
JSON (Compact)
YAML
XML
CSV
TOON

And for each one, I measured:

How many tokens the format used
Whether the LLM could interpret it clearly

How the Data Looks Across Formats

Below are simplified examples to help you visualize how flat and nested JSON convert into TOON.

Flat JSON Example

JSON

[
  {"id": 1, "name": "Alice", "score": 85},
  {"id": 2, "name": "Bob", "score": 92},
  {"id": 3, "name": "Charlie", "score": 78}
]

TOON

data[3]{id,name,score}:
  1,Alice,85
  2,Bob,92
  3,Charlie,78

Complex Nested JSON Example

JSON

[
  {
    "id": 1,
    "name": "Alice",
    "score": 85,
    "contact": {"email": "alice@example.com", "phone": "123-456"},
    "tags": ["math", "science"],
    "projects": [
      {"title": "AI Lab", "year": 2022},
      {"title": "Robotics", "year": 2023}
    ]
  },
  {
    "id": 2,
    "name": "Bob",
    "score": 92,
    "contact": {"email": "bob@example.com", "phone": "789-012"},
    "tags": ["physics"],
    "projects": [
      {"title": "Quantum Lab", "year": 2021}
    ]
  }
]

TOON

data[2]{id,name,score,contact,tags,projects}:
  1,Alice,85,{'email': 'alice@example.com', 'phone': '123-456'},['math','science'],[{'title':'AI Lab','year':2022},{'title':'Robotics','year':2023}]
  2,Bob,92,{'email': 'bob@example.com', 'phone': '789-012'},['physics'],[{'title':'Quantum Lab','year':2021}]

This is where things get interesting. You can see immediately that for simple rows, TOON is incredibly compact... but as soon as nested structures appear, TOON begins embedding Python-like lists and dictionaries inside its rows.

The Results: Token Counts by Format

Test 1: Complex Nested Data (contacts, tags, projects)

Format	Token Count
CSV	58
TOON	118
JSON Compact	102
YAML	135
XML	178
JSON	203

Test 2: Simple Flat Data (id, name, score)

Format	Token Count
CSV	23
TOON	33
JSON Compact	38
YAML	51
XML	71
JSON	77

🧠 Key Takeaway: CSV Isn’t Always Enough

Wait... CSV is cheaper than TOON in both cases?

Yes, but there’s a catch.

CSV is great for flat data. But as soon as your data includes nested objects, lists, or hierarchies, CSV becomes... lossy. You end up flattening objects into strings. You lose type safety. You lose clarity.

In contrast, TOON maintains full structure, while being much leaner than JSON.

JSON repeats every key
XML doubles them with open and close tags
YAML saves space but isn’t built for tabular rows
TOON gives you schema + data, with nearly CSV-level efficiency

That’s why TOON is rising fast in agentic tools. It’s not about beating CSV, but about replacing JSON in places where token count actually matters.

What I Learned (And What You Can Try)

This POC taught me far more than just how different serialization formats compare. It revealed how LLMs actually behave when they are fed structured data, and why some formats may be popular even when they don’t always win the token race.

Here are the real lessons that stood out.

TOON shines when structure matters

If your data has any level of hierarchy, TOON keeps the meaning intact while still being much more compact than JSON. It lets you:

preserve nested objects
keep column-like readability
maintain type hints
avoid repeated keys

This balance explains why TOON is becoming a popular choice in agentic systems.

CSV gives the fewest tokens, but at a real cost

CSV had the lowest token counts in both tests. That part is true.

But CSV comes with limitations that make it hard to use in multi-agent AI workflows:

it cannot represent nested structures
lists and objects get flattened into strings
type information is lost
you cannot reliably reconstruct the original JSON
LLMs sometimes misinterpret flattened mixed content

So CSV wins in raw token count, but loses the moment your agents need anything more than simple rows.

JSON Compact is a practical middle ground

Compact JSON performed far better than formatted JSON. It:

keeps full structure
removes all whitespace
tokenizes efficiently

It is still more verbose than TOON for flat data, but more predictable for nested data.

TOON is optimized for LLM clarity, not just token savings

When LLMs read TOON, the schema line gives them:

number of items
field names
expected structure

This helps models interpret the data more reliably than CSV or XML.

In real agent workflows, clarity often beats a small token savings.

Nested data changes everything

In the nested test, TOON did not produce the smallest token count because the inline lists and dicts added overhead. This showed a valuable insight:

TOON works best for uniform row-like structures. For deep nesting, compact JSON may be more efficient.

This is important because it means serialization choices should match your data shape, not simply rely on trends.

The real takeaway

Each format has strengths:

CSV: smallest tokens, weakest structure
TOON: great structure-to-token ratio, very LLM friendly
JSON Compact: predictable, solid for nested data
YAML: readable, moderate token usage
XML: expressive but token-heavy

And the biggest thing we learned:

TOON is becoming popular not because it always produces the smallest token count, but because it gives you structure, compactness, and LLM-friendly clarity in one place.

If your agents need to share structured data and reason about it reliably, TOON is a strong contender.

Tools I Used

Here’s the stack I used for this POC:

Python 3.10+
Ollama running LLaMA3:latest locally
tiktoken for token counting
python-toon (fallback inline encoder included)
CSV / JSON / YAML / XML modules

The agents are simple but modular, you can replace Agent 1 with an API call or Agent 2 with a more sophisticated analysis.

Final Thoughts

This POC showed that serialization shapes how agents think. TOON isn’t always the smallest, but it delivers a strong balance of structure and clarity for agentic workflows. Test different formats with your own data and see what your LLM responds to best.

Thanks for reading!

I Was Done Getting Answers - So I Built RAG That Asks Questions Too

Nishant prakash — Wed, 12 Nov 2025 18:12:44 +0000

“What if a RAG system could not only fetch information but also reason about it, critique itself, and write a report, all autonomously?”
That question sent me down a rabbit hole that ended with Data-Inspector, a proof-of-concept Agentic RAG pipeline built with Ollama, LangChain, Tavily, and Streamlit.

The Spark — From RAG to Agentic RAG

Traditional RAG systems are brilliant at retrieval and response, but not at reasoning or reflection.

They typically:

Retrieve documents relevant to a query.
Feed them to a large language model (LLM).
Generate an answer that sounds confident, even when it’s wrong.

The model reads, but it doesn’t think.

So I began wondering: what if we could assign roles inside the RAG flow?
One agent fetches data, another summarizes, another synthesizes, another critiques, like a research team working in harmony.

That’s how Data-Inspector was born, a system that doesn’t just “search and answer,” but “reads, reasons, and reviews.”

What Exactly Is Agentic RAG?

Before diving into code, let’s unpack what this buzzword really means.

Agentic RAG (Retrieval-Augmented Generation) is an evolution of the classic RAG pipeline.
While traditional RAG enhances an LLM with external knowledge, Agentic RAG gives that process a mind of its own.

From Static Pipelines to Autonomous Reasoners

In standard RAG, you have a single, linear pipeline:

Retrieve → Generate

It’s powerful but static, there’s no reflection, no iteration, and no specialization.

Agentic RAG transforms this static chain into a network of intelligent roles, each responsible for one cognitive task:

Retrieve → Understand → Synthesize → Critique → Generate → (Loop back if needed)

Every role acts as an agent, capable of reasoning over its inputs, producing structured outputs, and handing them off to the next stage.

The Key Principles Behind Agentic RAG

Role-based Autonomy Each agent (retriever, summarizer, critic, etc.) has a clearly defined job and communicates via structured data (JSON, markdown).

This modularity allows independent improvement of each skill, like retraining just your summarizer agent for better factual grounding.

Reflection Loops Agentic systems don’t stop at the first output. They evaluate and refine.

This is what turns a “talkative assistant” into a “thoughtful collaborator.”

Dynamic Knowledge Access
Instead of relying only on a static vector database, agentic systems can trigger live searches, query APIs, or even plan multi-step reasoning chains.
Transparency & Explainability
Each stage produces interpretable intermediate artifacts, summaries, reviews, critique logs, making the system auditable and debuggable.

Common Architectures of Agentic RAG

Architecture Type	Description	Example Use
Planner–Executor Loop	A planning agent decomposes a task, executors handle retrieval and summarization.	Workflow orchestration in research assistants.
Critic–Refiner Loop	The system critiques its own output and regenerates it.	Self-RAG, Self-Refine, Reflexion.
Multi-Agent Collaboration	Multiple specialized agents work in a pipeline, passing structured outputs downstream.	Data-Inspector😛

The approach I took, multi-agent collaboration, felt the most natural.
Each Python class became a self-contained professional: retriever, summarizer, synthesizer, and critic, all orchestrated by a pipeline.

Architecture Overview — A RAG System with Personality

Data-Inspector/
├── agents/
│   ├── retriever.py       # Retrieval
│   ├── summarizer.py      # Summarization
│   ├── synthesizer.py     # Knowledge fusion
│   └── critic.py          # Review / Reflection
├── rag/
│   ├── chunker.py         # Document processing
│   └── vectorstore.py     # Vector memory (optional)
├── pipeline.py            # Agentic orchestration
└── ui_streamlit.py        # Interactive interface

Each component acts like a neuron in a cognitive system, independent yet collaborative.

Retrieval — Learning to Find Relevant Knowledge

The Retriever is powered by the Tavily API. It’s the system’s scout, locating relevant information for the query.

class WebRetriever:
    def search_urls(self, query):
        res = self.client.search(query=query, max_results=self.max_sources)
        return res.get("results", [])[: self.max_sources]

Unlike traditional RAG’s static embeddings, this retrieves live knowledge, keeping the system temporally aware and factually updated.

Chunking — Learning to Read Like a Human

HTML pages are noisy. The chunker.py module cleans and splits them into coherent text segments.

def prepare_chunks(raw_html):
    cleaned = clean_text(raw_html)
    chunks = chunk_text(cleaned)
    return chunks

Breaking long text into overlapping chunks lets the summarizer think locally while preserving context globally, just like a human scanning through paragraphs.

Summarization — Turning Reading Into Understanding

Each chunk passes through the SummarizerAgent, guided by a structured system prompt.

SYSTEM_SUMMARIZER = """
You are a precise technical summarizer...
Return JSON with: key_points[], methods, evidence[], limitations[]
"""

Sample output:

{
  "title": "RAG vs Fine-Tuning",
  "key_points": ["RAG adapts faster", "Fine-tuning offers deeper control"],
  "limitations": ["Depends on retrieval quality"]
}

All agents speak in JSON, a shared language that prevents context drift and ensures machine-readable collaboration.

Synthesis — Connecting the Dots

The SynthesisAgent merges multiple summaries into a unified comparative analysis.

def synthesize(self, query, summaries):
    bulletized = "\n".join([f"- {s['title']}: {', '.join(s['key_points'][:5])}" for s in summaries])
    prompt = f"System:{self.system}\nUser: Query: {query}\n{bulletized}"
    return self.llm.invoke(prompt)

Here, the model evolves from “reader” to “analyst,” forming relationships between insights and organizing them logically.

Critique — Giving the System a Conscience

The CriticAgent inspects the synthesized narrative and calls out weak logic or missing perspectives.

def review(self, query, synthesis, summaries):
    prompt = f"System:{self.system}\nUser: Query: {query}\nSYNTHESIS:\n{synthesis}"
    out = self.llm.invoke(prompt)
    return json.loads(out)

Output example:

{
  "missing_perspectives": ["Data bias"],
  "weak_arguments": ["Unsupported claims about fine-tuning benefits"],
  "overall_risk": "medium"
}

This reflective loop transforms a basic RAG pipeline into a self-aware reasoning system.

Report Generation — From Thought to Thesis

Finally, all insights are compiled into a Markdown report via pipeline.py.

report_prompt = f"""
System:{SYSTEM_REPORT}

User: Query: {query}
SYNTHESIS: {synthesis}
CRITIC REVIEW: {review}
Write final report in Markdown.
"""
report_md = self.report_llm.invoke(report_prompt)

The result reads like an academic mini-paper:

Executive summary
Comparative analysis
Decision framework
Risks and gaps
References

The system doesn’t just compute, it articulates.

Why Agentic RAG Outperforms Traditional RAG

Feature	Traditional RAG	Agentic RAG (Data-Inspector)
Architecture	Single linear chain	Multi-agent collaboration
Learning Behavior	Retrieval + generation only	Retrieval + reasoning + reflection
Error Handling	None — one-shot generation	Built-in self-critique loop
Explainability	Opaque output	Transparent intermediate JSONs
Adaptability	Static embeddings	Dynamic web retrieval + modular agents
Output Depth	Fluent but shallow	Analytical, reference-backed synthesis

Agentic RAG = Traditional RAG + Cognition.
It elevates retrieval-augmented generation into reason-augmented generation.

Lessons Learned

Prompts are contracts. Each agent must have a clear, bounded responsibility, otherwise, outputs collapse into noise.
Autonomy is discipline disguised as freedom. Structured interaction enables creativity without chaos.
Critique breeds truth. The CriticAgent was the breakthrough, the moment the system began questioning itself, quality skyrocketed.

Looking Ahead

Agentic RAG hints at a future where models won’t just generate answers but will collaborate intelligently.
When Data-Inspector finished its first report, it didn’t feel like I’d run code, it felt like I’d led a discussion with a team of invisible colleagues.

Explore the Project

GitHub: Data-Inspector — Agentic RAG Demo

Run it locally:

pip install -r requirements.txt
streamlit run app/ui_streamlit.py

Final Reflection

What began as a question: “Can RAG think critically?” -> evolved into an experiment in digital reasoning.
And maybe that’s the trajectory AI will take next:

from systems that answer questions to systems that question their own answers.

No OpenAI API? No Problem. Build RAG Locally with Ollama and FastAPI.

Nishant prakash — Thu, 06 Nov 2025 14:09:22 +0000

I built a fully local Retrieval-Augmented Generation (RAG) system that lets a Llama 3 model answer questions about my own PDFs and Markdown files, no cloud APIs, no external servers, all running on my machine.

It’s powered by:

Streamlit for the frontend
FastAPI for the backend
ChromaDB for vector storage
Ollama to run Llama 3 locally

The system ingests documents, chunks and embeds them, retrieves relevant parts for a query, and feeds them into Llama 3 to generate grounded answers.

🧠 Introduction – Why Go Local?

It started with a simple frustration, I had a bunch of private PDFs and notes I wanted to query like ChatGPT, but without sending anything to the cloud.
LLMs are powerful, but they don’t know your documents. RAG changes that, it gives the model a “working memory” by feeding it relevant chunks of your data each time you ask a question.

So, I decided to build my own mini-ChatGPT for personal docs, everything self-hosted, modular, and transparent.
The goals were simple:

Upload docs → Ask questions → Get answers with citations.
Stay offline and private.
Learn the moving parts of a RAG pipeline by hand, not just through frameworks like LangChain.

Project Repository: github.com/trickste/raga

🏗 Architecture Overview

The setup runs in two clear phases: ingestion and querying.

The Streamlit UI lets users upload documents and ask questions interactively.
The FastAPI backend handles everything else, text extraction, embedding, search, and invoking the LLM.

This separation makes it easy to debug, extend, and even swap components later (e.g., replace Chroma with Pinecone, or Llama 3 with Mistral).

Code Walkthrough

1. Ingestion – Reading and Chunking Documents

The ingestion pipeline starts by reading PDFs or Markdown files and turning them into clean text chunks.

def process_file(file_bytes: bytes, filename: str):
    if filename.endswith(".pdf"):
        text = extract_text_from_pdf(file_bytes)   # via PyMuPDF
    elif filename.endswith((".md", ".markdown")):
        text = extract_text_from_markdown(file_bytes)
    else:
        text = file_bytes.decode("utf-8", errors="ignore")

    chunks = chunk_text(text)      # split into ~500-char chunks
    num_added = add_texts(chunks, source=filename)
    return num_added

Chunking was trickier than expected.
Too long, and the model forgets details. Too short, and it loses context.
After experimenting, I settled around 300–500 words per chunk with slight overlaps to maintain continuity.

for i in range(0, len(words), CHUNK_SIZE - OVERLAP):
    chunk = " ".join(words[i : i + CHUNK_SIZE])
    chunks.append(chunk)

That overlap (~10–15%) turned out to be a small tweak that made a big difference in retrieval quality.

2. Vector Storage – Semantic Search with ChromaDB

Each chunk is embedded using SentenceTransformers (all-MiniLM-L6-v2) and stored in a local ChromaDB collection.

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("langrag_docs")

def add_texts(texts, source):
    embeddings = embedding_model.encode(texts, normalize_embeddings=True)
    collection.add(
        documents=texts,
        metadatas=[{"source": source}] * len(texts),
        ids=[f"{source}_{uuid4().hex}" for _ in texts],
        embeddings=embeddings.tolist()
    )

Each document is now represented as a vector in a 384-dimensional space.
When a query comes in, we embed the question and retrieve the top-k most similar chunks.

results = collection.query(query_texts=[user_query], n_results=3)
retrieved_chunks = results["documents"][0]

This is where the “retrieval” in Retrieval-Augmented Generation happens.

3. Querying – Feeding the Context to the LLM

Once we have the most relevant chunks, we build a grounded prompt and send it to Llama 3 via Ollama’s local API.

def generate_answer(query, context):
    messages = [
        {"role": "system", "content": (
            "You are a document QA assistant. "
            "Answer strictly using only the CONTEXT below."
        )},
        {"role": "user", "content": f"CONTEXT:\n{context}\n\nQUESTION: {query}"}
    ]
    payload = {"model": "llama3:latest", "messages": messages, "stream": True}
    with requests.post("http://localhost:11434/api/chat", json=payload, stream=True) as r:
        ...

This prompt discipline was critical.
If you don’t tell the model to stick to the context, it will happily hallucinate.
By enforcing “If it’s not in the context, say you don’t know,” we drastically improved trustworthiness.

4. The FastAPI Backend – Tying It All Together

FastAPI glues it all: ingestion, querying, and LLM invocation.

@app.post("/add_docs")
async def add_docs(file: UploadFile = File(...)):
    file_bytes = await file.read()
    num_chunks = process_file(file_bytes, file.filename)
    return {"message": f"Added {num_chunks} chunks from {file.filename}"}

@app.post("/query")
async def query_docs(req: QueryRequest):
    query = req.query.strip()
    results = collection.query(query_texts=[query], n_results=3)
    context = "\n\n".join(results["documents"][0])
    answer = generate_answer(query, context)
    return {"answer": answer}

This clean separation of endpoints made debugging painless.
Every step logs info: embeddings added, chunks retrieved, distances scored, and LLM latency.

5. Streamlit Frontend – A Minimal, Interactive UI

The frontend makes it fun, drag and drop a file, type a question, and watch it respond.

st.file_uploader("Upload PDF or Markdown", type=["pdf", "md"])
query = st.text_input("Ask a question about your documents:")

if st.button("Get Answer"):
    res = requests.post(f"{API_URL}/query", json={"query": query})
    st.write("**Answer:**", res.json()["answer"])

It’s only ~30 lines of Streamlit, but it transforms the project into a usable app.

Lessons Learned

1. Chunking Is an Art

Small, overlapping chunks worked better than large ones.
Breaking text at semantic boundaries (paragraphs or sections) gave cleaner retrievals.

2. Quality of Retrieval Beats Quantity

Feeding too many chunks diluted answers.
3 relevant chunks > 10 vague ones.

3. Prompt Grounding Changes Everything

Explicitly instructing the LLM not to make things up was the single most effective fix for hallucination.

4. Local Models Are Ready for Prime Time

Running Llama 3 via Ollama felt just as smooth as using an API, but faster, cheaper, and private.
And yes, no API keys or rate limits, WOOHOO !!!

5. Observe Everything

Logging every stage, chunk sizes, retrieval scores, final prompts, made debugging feel scientific rather than guesswork.

Conclusion – My Own “ChatGPT for PDFs”

At the end of this build, I had a working Local RAG Assistant, a tiny offline system that could read, index, and reason about my documents.
It runs entirely on my laptop, keeps my data private, and helped me deeply understand how modern LLM pipelines actually work under the hood.

There’s plenty of room to grow:

Add source citations in answers.
Support more file types (Word, HTML, etc.).
Experiment with different embedding models.
Add caching or user authentication for a real-world app.

But most importantly, it taught me how retrieval, embeddings, and prompt engineering combine to make language models truly useful.

If you’ve been thinking about building something similar, just start.
It’s incredibly rewarding to see your own files talk back intelligently.

Happy building and may your vectors always find their nearest neighbors.

Teaching AI to Take Initiative – Building a Self-Thinking App with LangGraph and Ollama

Nishant prakash — Tue, 04 Nov 2025 16:38:38 +0000

Introduction – When AI Becomes a Recruiter

Imagine uploading a job description and a candidate’s resume, and an AI instantly evaluates the match, gives ATS scores, writes tailored feedback, and even generates interview questions.
No spreadsheets. No manual parsing. Just structured, explainable intelligence.

That’s what I set out to build an Agentic AI Resume Evaluator powered by LangGraph, Ollama, and FastAPI.

Project Repository: github.com/trickste/ntrvur

In this blog, I’ll walk through how I stitched together an end-to-end agentic system that autonomously reasons, uses tools, and collaborates with sub-agents to produce actionable hiring insights.

By the end, you’ll see how agentic architectures, local LLMs, and good old Python orchestration come together to make AI do work, not just talk.

What Is Agentic AI, Really?

Agentic AI means giving AI the ability to think, decide, and act toward a goal, much like a human agent.
Instead of “just predicting text,” the AI plans its steps:

“Extract candidate details → Compute ATS score → Evaluate → Review → Merge → Return final report.”

That requires:

Tools – for specific actions (e.g., ATS computation, text extraction).
Memory & State – to carry information between steps.
Reasoning Loop – to know when to invoke which tool.

That’s where frameworks like LangChain and its orchestration layer LangGraph shine, they let you turn an LLM into a multi-step problem solver.

The Tech Stack – Tools Behind the Magic

Before diving into code, let's introduce the key players in this adventure:

Component	Purpose
LangGraph	Builds and executes the agent workflow as a directed graph. Each node performs a task (ATS, experience extraction, etc.).
LangChain (Community)	Provides integrations and structured tool definitions (like `StructuredTool.from_function`).
Ollama	Runs local LLMs (like `llama3` or `phi3`) offline, via `ollama.Client`.
FastAPI	Wraps everything in a clean REST API (`/api/evaluate`) for users to upload JDs & resumes.
scikit-learn	Handles ATS scoring via TF-IDF cosine similarity.
PyMuPDF (fitz)	Extracts text from PDF resumes reliably.

With this stack, I had the ingredients for an AI that could think (LLM via Ollama), remember/learn (via LangChain’s memory or retrieval components), act (use tools or retrieve info, potentially via MCP or LangChain tools), and be deployed (FastAPI serving a LangGraph-managed workflow). Now it was time to design the brain of the operation – the agent’s architecture.

Designing an Agentic Workflow (Architecture)

The system runs like a small team of AI specialists working in sync. Once a JD and resume are uploaded, FastAPI triggers a LangGraph workflow, a series of smart nodes that extract the candidate’s name, compute the ATS score, and infer years of experience. A local Ollama LLM then wraps these findings into structured JSON.

From there, the Evaluator Agent analyzes the match and drafts interview questions, while the Reviewer Agent critiques and improves them. A final Synthesizer merges both perspectives into one coherent report, a clean orchestration of reasoning, review, and refinement.

Each part operates independently but shares structured state through LangGraph.

Let’s unpack the pieces.

Step 1: Agentic Evaluation with LangGraph

LangGraph was the backbone of my orchestration.
I defined a StateGraph with typed state fields using Pydantic, and created sequential nodes that act like tools in a workflow.

# agentic_evaluator.py
class EvaluationState(BaseModel):
    jd_text: str
    resume_text: str
    ATS_SCORE: float | None = None
    YEARS_OF_EXPERIENCE: int | None = None
    CANDIDATE_NAME: str | None = None

Nodes as Actions

Each node modifies and returns the updated state:

def ats_node(state: EvaluationState):
    result = ats_tool(state.jd_text, state.resume_text)
    print(f"[ATS] → {result}")
    return state.model_copy(update=result)

def experience_node(state: EvaluationState):
    result = experience_extractor_tool(state.jd_text)
    print(f"[Experience] → {result}")
    return state.model_copy(update=result)

Finally, a small LLM finalizer via Ollama neatly packages all extracted values as JSON:

def llm_finalize_node(state: EvaluationState):
    llm = ChatOllama(model="llama3", temperature=0)
    data = {
        "ATS_SCORE": state.ATS_SCORE,
        "YEARS_OF_EXPERIENCE": state.YEARS_OF_EXPERIENCE,
        "CANDIDATE_NAME": state.CANDIDATE_NAME,
    }
    prompt = f"Given this data, return valid JSON:\n{json.dumps(data, indent=2)}"
    response = llm.invoke(prompt)
    return json.loads(response.content)

LangGraph made it effortless to visualize this reasoning loop as a flowchart.
Each node is deterministic, while the last one invokes reasoning, the true “thinking” step.

Step 2: Evaluator Agent – LLM as the Hiring Analyst

Once the structured MCP (Model Context Protocol)-like output was ready, I passed it into an Evaluator Agent, an Ollama-powered LLM acting as a hiring analyst.

It uses templated system and user prompts stored as Markdown:

app/prompts/evaluator_system.md
app/prompts/evaluator_user.md

The system prompt strictly enforces a JSON schema with ATS score, resume feedback, and 10–15 interview questions.

messages = [
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": user_prompt}
]
llm = OllamaLLM()
raw = llm.chat_json(messages)

This agent reads the JD and resume, scores the match, and auto-generates a structured evaluation.

Step 3: Reviewer Agent – The Senior Interviewer

Next, the Reviewer Agent steps in, another Ollama instance prompted with:

The JD and resume text
Candidate experience level
The evaluator’s generated questions

Its role: review the quality of those questions, detect missing areas, and produce a refined list.
This agent embodies human-in-the-loop reasoning, automated quality control.

review_json = run_reviewer(
    jd_text=jd_text,
    resume_text=resume_text,
    questions=questions,
    years_of_experience=years_of_experience
)

The output includes:

{
  "QUESTION_REVIEW": {
    "summary": "...",
    "missing_topics": ["CI/CD", "Cloud Costing"],
    "improvement_suggestions": ["Add a behavioral question"]
  },
  "UPDATED_QUESTIONS": ["Explain blue/green deployments?", "How do you manage secrets in CI/CD?"]
}

Step 4: Synthesizer – The Final Merge

Finally, I wrote a merge_evaluator_and_reviewer function to combine both agents’ results:

def merge_evaluator_and_reviewer(evaluator_json, reviewer_json):
    candidate_name = next(iter(evaluator_json.keys()))
    data = evaluator_json[candidate_name]
    updated = {
        "payload": {
            candidate_name: {
                "ATS_SCORE": data["ATS_SCORE"],
                "RESUME_FEEDBACK": data["RESUME_FEEDBACK"],
                "FINAL_QUESTIONS": list(dict.fromkeys(
                    data["INTERVIEW_QUESTIONS"] + reviewer_json["UPDATED_QUESTIONS"]
                )),
                "QUESTIONS_REVIEW_SUMMARY": reviewer_json["QUESTION_REVIEW"]
            }
        }
    }
    return updated

The result is a rich, structured payload ready for dashboards or downstream workflows.

Step 5: FastAPI – Making It Real

To expose the system, I used FastAPI for a clean HTTP interface:

@router.post("/evaluate", response_model=EvaluateResponse)
async def evaluate(jd_file: UploadFile, resume_pdf: UploadFile):
    jd_text = (await jd_file.read()).decode("utf-8")
    resume_bytes = await resume_pdf.read()
    resume_text = extract_text_from_pdf(resume_bytes)

    mcp_output = run_agentic_evaluation(jd_text, resume_text)
    eval_json = run_evaluator(jd_text, resume_text, mcp_output["CANDIDATE_NAME"])
    review_json = run_reviewer(jd_text, resume_text, eval_json["INTERVIEW_QUESTIONS"], mcp_output["YEARS_OF_EXPERIENCE"])

    return merge_evaluator_and_reviewer(eval_json, review_json)

The beauty? Upload a JD and a resume, and within seconds you get:

{
  "payload": {
    "John Doe": {
      "ATS_SCORE": 84.6,
      "RESUME_FEEDBACK": { "Python": "Strong", "Docker": "Moderate" },
      "FINAL_QUESTIONS": [ "Explain container orchestration?", "Describe CI/CD best practices?" ]
    }
  }
}

What I Learned

1. Agentic AI is orchestration, not magic.

Breaking down tasks into nodes and state transitions makes the system explainable and debuggable. LangGraph turns “AI reasoning” into visual, modular code.

2. Local LLMs are surprisingly capable.

Running llama3 and phi3 locally through Ollama gave me full control no latency, no API keys. And switching models was one-line config.

3. Strict JSON prompts save hours.

Enforcing schema at the prompt level eliminated parsing headaches. (Still, I built coerce_json() to auto-clean malformed outputs.)

4. Agents need reviewers too.

Having a second LLM review the first’s work brought real reliability a taste of collaborative multi-agent systems.

5. FastAPI makes deployment feel native.

Integrating async routes and file uploads made serving AI models feel as easy as any microservice.

Conclusion – Agentic AI for Real-World Automation

This project taught me that Agentic AI isn’t just theory, it’s a practical blueprint for intelligent systems.
By combining LangGraph (structured reasoning), Ollama (local intelligence), and FastAPI (deployment glue), I built a hiring-assistant AI that’s fast, explainable, and private.

But more than that, it changed how I think about AI apps, not as “chatbots,” but as collaborative systems of agents, each with a role, a goal, and a reasoning process.

If you’re looking to dive deeper, here are the docs that shaped my journey:

LangGraph Documentation
Ollama – Run local LLMs easily
LangChain Tools and Agents
FastAPI – Lightning-fast web APIs
Model Context Protocol – Emerging standard for LLM-tool integration

Repo Snapshot

app/
├── core/config.py
├── routers/evaluate.py
├── services/
│   ├── agentic_evaluator.py
│   ├── evaluator_agent.py
│   ├── reviewer_agent.py
│   ├── synthesizer_agent.py
│   └── ollama_client.py
├── prompts/
│   ├── evaluator_system.md
│   ├── evaluator_user.md
│   ├── reviewer_system.md
│   └── reviewer_user.md
└── utils/json_safety.py

Final Thought:
Agentic AI is not about creating a single genius model, it’s about creating a team of smart, cooperative components.
And once they start working together you realize, automation can actually think.

Happy building, fellow tinkerers!