Forem: Aman Pandey

I found a critical CVE in a top AI agent framework. Here's what it taught me about how we're all building agents wrong.

Aman Pandey — Sun, 19 Apr 2026 17:24:29 +0000

Nobody told me the scariest part of building AI agents isn't the hallucinations.
It's the attack surface you're quietly shipping to production while obsessing over your prompt.

I found out the hard way.

The vulnerability that should not have existed

While contributing to OpenHands (one of the top open-source AI agent frameworks),
I discovered a path traversal vulnerability now officially CVE-2025-68146 sitting quietly in production. The kind of bug that makes you go silent for a second before typing into Slack.

The core issue: the agent runtime wasn't properly sanitizing file paths during execution. An attacker could craft a request that escaped the intended sandbox and read arbitrary files from the host. In an agentic system that calls tools, writes code, and accesses filesystems this is catastrophic.

It was patched and shipped in OpenHands v1.8.2.

Why AI agent codebases are a security nightmare

I run a LangGraph-based agentic system handling 1,000+ executions daily at 99.9% uptime. Here's what that experience teaches you: the agentic paradigm introduces an entirely new class of security problems the community is not talking about enough.

1. Tool use expands your attack surface exponentially

Every tool you hand your agent is a potential injection vector. Shell tools,filesystem tools, HTTP tools — each needs its own threat model. Most teams bolt these on without thinking about what happens when the LLM's output is adversarial.Prompt injection from web content the agent reads is real. It's happening right now in production systems.

# What your "harmless" shell tool looks like to an attacker
tool_input = llm_output["command"]  # <- never trust this directly
subprocess.run(tool_input, shell=True)  # <- shell injection waiting to happen

2. Sandboxing is not optional — it's the whole point

The CVE I found was a sandbox escape. The system thought it was contained.
If you're running agent-generated code without a hard container boundary (Docker,
gVisor, Firecracker), you're not running a sandbox. You're running a prayer.

The fix we shipped: strict path normalization + allowlist validation before any file operation. Three lines. Months of exposure.

# Before (🤦)
def execute_file_op(path: str):
    return open(path).read()

# After (🔒)
import pathlib

ALLOWED_ROOT = pathlib.Path("/workspace").resolve()

def execute_file_op(path: str):
    resolved = (ALLOWED_ROOT / path).resolve()
    if not str(resolved).startswith(str(ALLOWED_ROOT)):
        raise PermissionError(f"Path escape attempt: {path}")
    return resolved.read_text()

Simple. Boring. Critical.

3. You're probably logging things you shouldn't be

Agent frameworks love verbose logs. Trajectory dumps, tool call outputs, intermediate LLM responses all sitting in your observability stack. I've seen teams accidentally log API keys, PII, and internal file contents through their agents. Your agent's "brain" is leaky by default.

The uncomfortable truth

We're in a gold rush for AI agents. Teams are shipping to production faster than the security community can audit. The "move fast" energy is real - I get it, I live it but the CVE database doesn't care about your demo day deadline.

The OpenHands team moved fast once I reported it. Patched within the coordinated disclosure window. But the vulnerability existed for a while before anyone caught it.That's a structural problem, not a team problem.

If you're building with AI agents in production: do a 30-minute threat model on your tool layer. What can your agent read? Write? Execute? What happens if its input is adversarial? The answers will probably surprise you.

Drop a comment genuinely curious:

Have you thought about prompt injection through your agent's tool layer?
Are you sandboxing code execution? What's your setup?
What's the scariest thing you've seen an agent do in production?

One Million Tokens. I Still Don't Think Most People Understand What That Actually Means.

Aman Pandey — Thu, 19 Mar 2026 16:32:47 +0000

GPT-5.4 dropped on March 5th. The internet did its thing a thousand hot takes, a hundred "this changes everything" threads, the usual cycle.

I want to try something different. I want to actually explain what this number means, because I've spent the last few weeks thinking about it and I think most of the coverage missed the point entirely.

Start Here: What Is a Token?

Before we get to a million of them, let's be precise about what we're counting.

A token is not a word. It's closer to a syllable a chunk of text that the model processes as a single unit. "Running" might be one token. "Unbelievable" might be two. Numbers, punctuation, spaces those are tokens too. As a rough rule of thumb, 1,000 tokens is about 750 words.

So when OpenAI says 1,000,000 tokens that's roughly 750,000 words of context that GPT-5.4 can hold in its working memory at once.

750,000 words.

The entire Harry Potter series is about 1.08 million words. So GPT-5.4 can almost hold the entire Harry Potter series in its head, and reason across all of it at the same time. Not read it once and forget. Hold it. Actively.

That's what we're talking about.

The Problem Previous Models Had That Nobody Talked About Enough

Every LLM before this had a context window problem. It was real and it was annoying and most people who used these tools professionally hit it constantly.

Here's how it showed up in practice. You'd load a long document into GPT-4. Ask questions about section 7. It would answer well. Ask a question that required connecting section 2 to section 14 to the appendix — and it would hallucinate, hedge, or quietly ignore the parts it couldn't hold anymore.

The technical term is "lost in the middle." Research from Stanford in 2023 showed that language models are dramatically better at using information from the beginning and end of their context window than information buried in the middle. Give a model a 100-page document and the crucial fact is on page 50? It might as well not be there.

A 1M token window doesn't automatically solve "lost in the middle." But it changes the shape of the problem. When you have so much room that you don't have to make hard choices about what to include, the retrieval architecture can be more deliberate. OpenAI has clearly been working on this early testing suggests GPT-5.4 holds its attention across long documents more reliably than its predecessors.

Not perfectly. But more reliably.

The Benchmark That Actually Matters Here

The headline score is 83% on something called GDPVal.

Most AI benchmarks are kind of useless for normal people to interpret. "Achieved 94% on MMLU" okay, but what does that mean? MMLU tests knowledge recall. That's a library, not a thinker.

GDPVal is different in an important way. It measures performance on tasks that have real economic value — the kind of work actual companies pay actual humans to do. Legal analysis. Financial modeling. Code review. Research synthesis. The tasks that make up knowledge work.

83% on that benchmark puts GPT-5.4 at or above human expert level.

I want to be careful here, because "human expert level" is doing a lot of work in that sentence. Human experts vary enormously. The benchmark comparison is against a specific sample. There are categories within it where GPT-5.4 is weaker. This is not me saying "AI is now as good as a lawyer."

But 83% on GDPVal is not a number you can handwave away either. The previous best score on that benchmark was in the mid-70s. This is a real jump. And the combination of that reasoning capability with a million token context window changes what you can actually ask the model to do.

What This Actually Changes, Concretely

Let me give you three specific things that become possible now that weren't practical before.

First: Whole-document legal review.

Previously, if you had a 400-page merger agreement and you wanted an AI to flag every clause that created liability exposure across the entire document you'd have to chunk it. Feed it in sections. Hope the model's understanding of section 3 was still active when it got to section 287. Hope it could connect the indemnification clause to the arbitration clause two hundred pages later.

Now you can feed the whole thing. Ask for a coherent analysis across the entire document. Get something that treats it as a single artifact rather than 30 disconnected chunks.

This is not replacing lawyers. It's changing what the first pass of due diligence looks like.

Second: Codebase-level reasoning.

Most serious software projects have codebases with hundreds of thousands of lines of code. Asking an AI to refactor a function is useful. Asking it to understand how a change in this module propagates to that service, which interacts with those three APIs, which was written by a team that made these specific assumptions that required the model to hold the entire codebase in context.

GPT-4 could do this for small projects. GPT-5.4 can do this for medium-to-large ones. The senior engineers I've talked to about this are the ones who seem most shaken by it, honestly. They're the people whose value has always been knowing the whole system. That's what just got cheaper.

Third: Longitudinal research synthesis.

Say you want to understand what the academic literature on, I don't know, attention mechanisms in transformer models actually says — not the popular summaries, the actual papers, the debates, the contradictions, the methodological criticisms. That might be 200 papers. 300. You can now feed GPT-5.4 a significant chunk of a research literature and ask it to synthesize, identify contradictions, find the questions nobody's answering.

Literature reviews take PhD students months. I'm not saying AI does this as well as a careful human scholar. I'm saying the gap is smaller than it was and getting smaller.

The Part I'm Genuinely Uncertain About

I want to be honest about something.

Every time a new model drops, there's a wave of "this changes everything" content and then a wave of "actually it's not that impressive" content. Both are usually wrong in specific ways. The first wave overstates; the second wave tests the model on tasks it's not designed for and concludes it's overhyped.

GPT-5.4 is genuinely impressive. The context window increase is real and useful. The GDPVal score is real and meaningful.

But I keep coming back to one thing: context window and raw capability are different axes. A model can hold a million tokens and still reason poorly about what's in them. Retrieval is not understanding. Holding information in context is not the same as having good judgment about what matters.

The question I don't think anyone has a clean answer to yet is: as the context window grows, does the quality of attention across that window hold up? Does having access to a 700,000-word document actually lead to better reasoning, or does it lead to the model confidently using irrelevant information it wouldn't have encountered in a smaller window?

Early evidence suggests GPT-5.4 handles this better than expected. But this is still an open question in a real way, and I'd be suspicious of anyone who tells you with confidence how it fully plays out.

What I Actually Did With It

I want to give you something concrete because I think "here is what this model scored on benchmarks" is only so useful.

I took the last three years of quarterly earnings calls from five tech companies Meta, Microsoft, Google, Amazon, Apple — transcribed and timestamped, roughly 900,000 words total. Fed the whole corpus into GPT-5.4 and asked it to identify every instance where an executive made a specific forward-looking claim about AI investment returns, then tell me how those claims aged quarter over quarter.

The result was not perfect. It missed a few things. It occasionally confused two executives with similar speaking styles. There were two moments where it summarized a position slightly stronger than what was actually said.

But it also identified a pattern I hadn't noticed: Google's Sundar Pichai has been notably more hedged about AI monetization timelines in his language since Q3 2024, while simultaneously becoming more specific about infrastructure numbers. That contrast confident about capex, careful about revenue is not something I would have easily spotted reading these calls one at a time over three years.

That's the kind of thing that's now practical to do in an afternoon.

The Honest Conclusion

I don't think the world ended when GPT-5.4 launched. The world didn't end when GPT-4 launched either. These things accumulate rather than rupture.

But I do think this release is quietly significant in a way that the noise around it obscures. It's not significant because of the context window number. It's significant because of what that number makes possible in combination with the capability level which is that a tool you can actually have a coherent analytical conversation with about very large bodies of information now exists, is accessible, and is getting cheaper every month.

That matters for some jobs more than others. It matters a lot for jobs that are essentially about synthesizing large amounts of text into insight analysts, researchers, lawyers, certain kinds of writers, certain kinds of consultants.

If your job is that kind of job, I think the honest thing to say is: pay attention. Not because GPT-5.4 is coming for you tomorrow. But because the version of this that comes out in 18 months is going to be a lot better than this one, and this one is already genuinely useful.

The context window expanding is almost beside the point. The capability sitting behind it is the part worth watching.

If this was useful, clap 50 times — it literally takes two seconds and it changes who Medium shows this to. And if you want the next one in your inbox, subscribe below.

I got tired of writing 30 lines of LangChain boilerplate every time. So I published a fix.

Aman Pandey — Fri, 06 Mar 2026 20:09:53 +0000

Every time I started a new project that needed RAG, I wrote the same 30 lines.

Load documents. Split them. Embed them. Store them. Build a retriever. Wire up a prompt template. Build a chain. Handle the response format. Add reranking later when results were bad. Add GraphRAG even later when cross-document queries failed. Add a watchdog when the index went stale.

Every single project From scratch Every time.

I got tired of it. So I built ragbox-core and published it to PyPI.

pip install ragbox-core

from ragbox import RAGBox

rag = RAGBox("./docs")
print(rag.query("What is the vacation policy?"))

3 lines. Everything else runs automatically.

What "automatically" actually means

When you point RAGBox at a folder, here's what runs without you touching it:

Document parsing — PDFs, text files, PowerPoints, Python files with AST parsing. It figures out the file type and routes accordingly.

Chunking — late chunking with context awareness, not naive 1000-token splits. The chunk boundary problem is real and most tutorials ignore it.

Embedding + FAISS indexing — Sentence-BERT embeddings, FAISS ANN index, TTL-cached so repeat queries hit cache instead of re-embedding.

Knowledge graph construction — the non-obvious one. RAGBox runs entity extraction on every document using an LLM, builds a Leiden-clustered knowledge graph, and persists it. This is what makes cross-document queries work.

Dual-mode routing — simple factual query goes fast path, skips the graph, ~12ms. Complex relationship or multi-hop query goes deep path: graph traversal, cross-encoder reranking, multi-query expansion.

Self-healing watchdog — background process watches the source folder. File changes? Re-chunks, re-embeds, updates the graph. Index never goes stale.

The thing that actually makes cross-document reasoning work

Most RAG tutorials give you vector search. Vector search is great for factual lookups.
It fails on questions like:

"Who does Maria Santos report to?" — requires connecting two documents
"What caused the Q4 revenue miss and who was responsible?" — requires 3+ documents
"How did the infrastructure outage relate to the deployment decision?" — requires causal reasoning across docs

Vector search retrieves the most semantically similar chunks. It doesn't reason about relationships between entities across documents. GraphRAG does.

Here's the honest benchmark result:

Relationship Questions (Cross-Document)

"Who does Maria Santos report to?"
  RAGBox:  0.767
  Vanilla: 0.959   ← vanilla wins here

"Which executive is responsible for both security and compliance?"
  RAGBox:  0.836
  Vanilla: 0.819   ← RAGBox wins here

Multi-Hop Questions (3+ Documents)

"Relationship between deployment strategy and the SEV1 incident?"
  RAGBox:  0.000
  Vanilla: 0.802   ← vanilla wins badly

"Plan to grow from $185M to $250M ARR?"
  RAGBox:  0.614
  Vanilla: 0.609   ← effectively tied

I published these results including the ones where RAGBox loses badly. Because if you're deciding whether to use a library, you need real numbers, not cherry-picked wins.

Honest summary: vanilla ChromaDB beats RAGBox on simple factual lookups and some multi-hop queries where graph extraction fails. RAGBox wins when the answer genuinely requires connecting entities across documents. Know what you're optimizing for.

The decisions that weren't obvious

Why Cross-Encoder reranking?

Bi-encoder similarity is fast but blunt — it scores query-document similarity in embedding space. Cross-encoders read the query and document together and produce a fine-grained relevance score. Slower, but dramatically more precise.

RAGBox uses bi-encoder for retrieval speed and ms-marco Cross-Encoder for reranking the top-k results. Wrong results at 5ms are worse than right results at 12ms.

Why Leiden instead of Louvain?

Leiden guarantees well-connected communities. Louvain can generate disconnected communities in practice. For document knowledge graphs, this shows up in multi-hop queries where the traversal path matters.

Why not just wrap LangChain?

I tried. When something goes wrong in a LangChain chain, the traceback is useless. RAGBox is a direct implementation — every component is inspectable, every failure has a clear source.

Why publish the comparison table that includes where you lose?

Because I'm a library user too. The COMPARISON.md in the repo has the full side-by-side including where LlamaIndex or LangChain is the right call. Use the right tool.

When to use this vs. when not to

Use RAGBox if:

You want a working RAG system today, not after three days of wiring LangChain
You need cross-document reasoning without building GraphRAG from scratch
You're building internal tools, prototypes, or MVPs
You want honest benchmarks you can reproduce yourself

Don't use RAGBox if:

You need custom retrieval pipelines with specific SLAs
You're building a commercial product and need to control every component
Your queries are purely simple factual lookups — vanilla vector search will be faster

Reproduce the benchmarks yourself

git clone https://github.com/ixchio/ragbox-core
cd ragbox-core
export GROQ_API_KEY="gsk_..."   # free tier works
python benchmarks/run_benchmark.py

15 questions across 8 interconnected documents. 5 factual, 5 relationship, 5 multi-hop. Scored with sentence-transformer cosine similarity. Real LLM calls, no mocks.

If you get different results, open an issue. I want to know.

→ pip install ragbox-core

→ github.com/ixchio/ragbox-core

→ pypi.org/project/ragbox-core

MIT license. PRs welcome. If it saves you the boilerplate, give it a star.