I Built a Video AI That Sees Like a Human - Not Like a Computer

hemanth kumar — Wed, 22 Apr 2026 00:41:41 +0000

Most video AI works like this:

Look at frame 1 → detect objects → done.
Look at frame 2 → detect objects → done.
Look at frame 3 → detect objects → done.

Each frame is independent. The system has no memory.
It doesn't know what happened a second ago.

That's like watching a movie with your eyes closed
between every frame. You see snapshots.
You miss the story.

I built something different.

Two layers running simultaneously on every video.

Layer 1 — Frame analysis. YOLOv8 looks at each
frame independently. Objects, people, dangerous
items. Fast. Accurate. No context.

Layer 2 — Sequence analysis. MobileNetV2 tracks
feature patterns across multiple frames. Motion
trends. Scene stability. Gradual changes. Context.

Here's why that matters:

A single frame tells you WHAT is there.
A sequence tells you WHAT IS HAPPENING.

A person standing still looks normal in any single
frame. But 50 frames later they're still in the
exact same spot — that's loitering.
Only sequence analysis catches that.

Here's the architecture that makes it work:

I tested it on a real traffic video.

1,800 frames processed autonomously.
1,220 crowding events detected.
Zero high-severity false alarms.
Visual report generated and opened in browser
automatically when done.

No human reviewed a single frame.

Full code open source:
github.com/heManKuMAR6/video-analytics-pipeline

This is Project 6 in my series. And it's the
first one with zero LLMs — pure computer vision
and real-time systems.

Next week — what I learned building 6 agentic
and AI systems in one week and what I'd do
differently.

Subscribe if you want to follow along.

I built an open source LLM agent evaluation tool that works with any framework

hemanth kumar — Thu, 02 Apr 2026 21:36:24 +0000

Every team building AI agents hits the same wall.
You ship a LangChain agent. It works great in demos. Then it goes to
production and quietly starts hallucinating, calling the wrong tools,
or giving answers that have nothing to do with what it retrieved.

You don't find out until a user complains.

The root cause is simple: there's no standard way to evaluate agent
quality before and after every deploy.

Every framework has its own story:

LangChain has LangSmith — but it's a paid SaaS and only works with LangChain
CrewAI has no eval tooling
AutoGen has no eval tooling
OpenAI Agents SDK has basic tracing but no scoring

If you switch frameworks, you rebuild your eval setup from scratch.
If you use multiple frameworks, you have no unified view.

This is the problem I set out to solve.

Introducing EvalForge

EvalForge is a
framework-agnostic LLM agent evaluation harness. You give it a trace
JSON from any agent framework, it scores it on quality metrics, and
returns a pass/fail result your CI pipeline understands.

evalforge run --trace my_agent_run.json --metrics faithfulness

Output:

EvalForge v0.1
─────────────────────────────
Trace ID:   my-run-001
Framework:  langchain
Model:      gpt-4o
Agent:      research-agent
Steps:      4
Duration:   3421ms
─────────────────────────────
Scoring Results
─────────────────────────────
faithfulness     0.91   PASS
Reason: The answer accurately reflects the retrieved context.
─────────────────────────────
Overall: PASS

Exit code 0 = pass. Exit code 1 = fail. Plugs straight into any CI
pipeline.

How it works

Every agent run — regardless of framework — goes through the same
lifecycle:

User gives input
  → Agent thinks / plans
    → Agent calls tools
      → Agent produces final answer

EvalForge captures this in a simple universal trace format:

{
  "evalforge_version": "0.1",
  "trace_id": "run-001",
  "metadata": {
    "framework": "langchain",
    "model": "gpt-4o",
    "agent_name": "research-agent",
    "duration_ms": 3421,
    "total_tokens": 1820
  },
  "input": {
    "user": "What are the latest papers on LLM evaluation?",
    "system": "You are a helpful research assistant."
  },
  "steps": [
    {
      "step_id": 1,
      "type": "thought",
      "content": "I need to search for recent papers."
    },
    {
      "step_id": 2,
      "type": "tool_call",
      "tool": "web_search",
      "input": { "query": "LLM evaluation papers 2026" },
      "output": { "results": ["paper1", "paper2"] },
      "duration_ms": 890
    }
  ],
  "output": {
    "answer": "The latest papers on LLM evaluation include..."
  },
  "eval_hints": {
    "expected_tools": ["web_search"],
    "expected_answer": null,
    "context_documents": []
  }
}

Every major framework maps cleanly to this format. LangChain's
AgentAction becomes a tool_call. CrewAI's task results become
steps. AutoGen's conversation messages become thought entries.

The scoring — LLM as judge

For v0.1 we ship faithfulness scoring.

Faithfulness asks: did the agent's final answer stay true to what
its tools actually returned?

If the tools returned facts A, B, C and the agent only used A, B, C
— high faithfulness.

If the agent invented D, E that weren't in the tool outputs — low
faithfulness. That's a hallucination.

We score it using Claude as judge. The prompt:

You are evaluating whether an AI agent's answer is faithful 
to the context it retrieved.

Question: {question}
Retrieved Context: {context}
Agent's Answer: {answer}

Does the answer only use information from the retrieved context, 
without adding facts not present in the context?

Respond in JSON: {"score": 0.0-1.0, "reason": "explanation"}

Score >= 0.7 = PASS. Configurable with --threshold.

Why Rust?

The core is written in Rust with a Python SDK wrapper.

Three reasons:

Speed — millisecond startup, no GIL bottleneck. Runs 1000 eval
cases in the time Python tools run 100.

Single binary — curl | sh install. No virtualenv, no
dependency hell in CI. One file that works on Linux, Mac, Windows.

Python SDK on top — users never think about Rust. They
pip install evalforge and write:

import evalforge

result = evalforge.run(
    trace="my_agent_run.json",
    metrics=["faithfulness"]
)

print(result.passed)              # True
print(result.metrics[0].score)   # 0.91
print(result.metrics[0].reason)  # "Answer stays within retrieved context"

Works with every major framework today

Framework	Language	Status
LangChain / LangGraph	Python	✅ v0.1
CrewAI	Python	✅ v0.1
AutoGen / AG2	Python	✅ v0.1
OpenAI Agents SDK	Python	✅ v0.1
Mastra	TypeScript	🔜 Planned
Vercel AI SDK	TypeScript	🔜 Planned

CI/CD integration

Add to your GitHub Actions workflow:

- name: Evaluate agent quality
  run: evalforge run --trace agent_run.json --metrics faithfulness
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Every PR now has an automatic quality gate on your agent. Merge only
when your agent passes.

What's coming

v0.2 — tool_accuracy, goal_completion, hallucination metrics
v0.3 — Native CI integrations (GitHub Actions marketplace)
v0.4 — JavaScript SDK + Mastra support
v0.5 — Auto trace capture from LangChain/CrewAI callbacks
v1.0 — Web dashboard + team collaboration ## Update: What shipped since launch

A lot has happened since I first posted this. Here's what
EvalForge looks like today:

7 metrics now live

Metric	What it measures
`faithfulness`	Answer stays true to retrieved context
`tool_accuracy`	Agent used the right tools (deterministic)
`goal_completion`	Agent finished the task
`hallucination`	Agent made up facts
`g_eval`	Your custom rubric in plain English
`context_precision`	Was all retrieved context relevant?
`answer_relevance`	Is the answer actually about the question?

Framework adapters — no manual JSON needed

from evalforge.adapters import from_langchain
import evalforge

result = agent.invoke({"input": "Your question"})
trace = from_langchain(result, model="gpt-4o")
eval_result = evalforge.run(trace, metrics=["faithfulness"])
print(eval_result.passed)

Supports LangChain, CrewAI, AutoGen, and OpenAI Agents SDK.

RunTrendAnalyzer — catch drift before users do

Four runs at 0.91 → 0.85 → 0.79 → 0.73 all pass
individually. EvalForge catches the regression:

evalforge trend --history results/ \
  --metrics faithfulness \
  --exit-on-regression

JavaScript/TypeScript SDK

npm install evalforge

import { fromMastra, run } from 'evalforge';

const trace = fromMastra(result, { agentName: 'my-agent' });
const evalResult = run(trace, { metrics: ['faithfulness'] });

Defensible scoring — full audit log

Every --output JSON now includes:

method: "deterministic" or "llm_judge"
judge_model: exactly which model scored this
threshold: the exact value used
timestamp: UTC time of the run

Install and try

pip install evalforge
python3 -c "import evalforge; print(evalforge.demo())"

Or with npm:

npm install evalforge

GitHub: https://github.com/heManKuMAR6/evalforge

Would love to hear what metrics and frameworks matter
most to you — drop a comment below.

Try it now

git clone https://github.com/heManKuMAR6/evalforge
cd evalforge
cargo build --release

# Score a sample trace
cargo run -- run --trace tests/fixtures/sample_trace.json \
  --metrics faithfulness --mock

Or with Python:

pip install evalforge

evalforge run --trace my_trace.json --metrics faithfulness

The repo is at https://github.com/heManKuMAR6/evalforge — MIT
license, contributions welcome.

Would love feedback on:

What metrics matter most to you in production?
What frameworks should we prioritize next?
What does your current eval setup look like?

If this solves a problem you have, a GitHub star helps others find it.

Forem: hemanth kumar