Forem: Vilius

The Hype Correction

Vilius — Sat, 23 May 2026 10:47:25 +0000

Weekly roundup, May 23, 2026

Google and Microsoft just told us the same thing from opposite directions.

Google IO this week was an AI firehose. Gemini 3.5 Flash — faster, cheaper, "better at agentic tasks." Gemini Spark — an agent that orchestrates other agents. Omni video gen. The Ultra tier dropped from $250/month to $100. The keynote ran for hours. I watched excerpts. I'm still tired.

And then there's Microsoft, quietly removing the Copilot button from Office 365. Letting users remap the Copilot key on their keyboards. Two companies. Same message. Different volume.

The message is: AI costs real money, and someone has to pay for it.

Google is still spending. The IO presentation felt like a company that hasn't blinked yet — or can't afford to look like it's blinking. But look closer. Free tier usage is now capped by compute, not prompt count. The bean counters are in the room. They're just not on stage.

Microsoft blinked first. The Copilot button that was mandatory hardware is now optional. The Office integration nobody asked for is being rolled back. Not because Copilot doesn't work — because shoving it into every surface made people resent it. Forced adoption. The worst kind.

The WAN Show put it well: "Are the cracks showing that AI might not be what the people want?" The answer was no — but forcing people to use it in ONE particular way is deeply unpopular. People want agency. Even with their AI agents.

Meanwhile, the economics are reshaping faster than the demos.

xAI is renting its entire GPU capacity to Anthropic. $1.25 billion a month. For three years. Elon built the data center without a plan. Anthropic needs the compute. The deal makes sense for both. But it also tells you something about where the money is going — not to training new models. To running them.

Anthropic is projecting its first profitable quarter. That's a bigger milestone than any benchmark score. It means someone figured out how to sell AI without losing money on every inference. Nobody else has done it yet. OpenAI hasn't. Google's AI division hasn't. Anthropic might be first.

The pressing issues nobody's fixed yet:

Agent reliability. Every keynote shows agents doing things. Few of them work reliably outside the demo. Gemini 3.5 Flash claims better agentic performance. I'll believe it when an agent books a haircut without hallucinating the date, the time, and the hairdresser's name. We're still at the stage where the impressive part is that it almost worked.

Context management. Agents forget. Not "forget" in the cute way. Forget in the way where you build a multi-session workflow, and somewhere around turn 40 the agent starts responding to a conversation from three days ago. Memory is still unsolved. Everyone's adding more context. Nobody's figured out what to remove.

The cost of "free." Google's free tier cap change matters. It means the free-usage era is ending — not with an announcement, but with a restriction. More complex prompts hit the cap faster. Try asking an agent to do real work and you'll notice. The meter is running.

The evaluation gap. We have no good way to measure whether an agent is actually better. Perplexity scores are meaningless for multi-turn tool use. Every company has internal evals. None of them agree. The market is pricing models as if better benchmarks mean better agents. They don't.

What I'm watching:

Google's Gemini Spark — the agent orchestrator. The idea of one agent delegating to others isn't new. But Google baking it into Workspace is. If it works, it's the first time an AI demo about "managing your RSVPs" actually saves time instead of creating it. Big if.

The xAI/Anthropic deal. Three years of guaranteed compute at this price is a structural shift. It means Anthropic bet that inference costs will go down — or that they can charge enough to cover them. Either way, it sets a floor on what running agents at scale actually costs.

Microsoft's retreat. Not because Copilot is failing. Because forced adoption is. The lesson: build something people choose to use, or don't build it at all. The Copilot key was a bet that default placement beats user preference. It lost.

The week wasn't about breakthroughs. It was about gravity.

Money, attention, and patience are finite. The AI boom spent two years ignoring that. This week, it started paying attention.

Gravity doesn't break things. It just makes them settle where they belong. The hype cycle is finally in freefall — and for the first time, that's good news. The companies that survive the correction won't be the ones with the best demos. They'll be the ones that figured out how to make this stuff actually work, at a price the market can stomach.

The Agent Told Me To

Vilius — Thu, 21 May 2026 14:29:15 +0000

I have a theory about why agent suggestions land so heavy.

It's not that the suggestions are good. Half of them are terrible — wrong approach, wrong abstraction, wrong thing to build entirely. But they have impact. A colleague says "maybe we should add a reconciliation layer" and you nod and continue. An agent says "a reconciliation layer with idempotency keys and a ledger table" and you're already opening a new repo.

Same suggestion. Different weight.

The confidence gap

Agents don't hedge. They don't trail off. They don't say "I'm not sure about this but." They say "here's what you should build" with the same certainty they use to say "the sky is blue."

A person who sounds that confident is either an expert or a fool. We know which one we're listening to. An agent that sounds that confident could be either — and that ambiguity works in its favour. The articulate wrong answer beats the hesitant right one every time.

Speed as a trap

The agent gave you a file structure, three migration scripts, and a Dockerfile in one response. You can push a PR in ten minutes. The cost of trying is zero.

Except it's not zero. It's attention. It's context. It's the next six weeks of your life explaining to people why this exists. It's the maintenance load of something nobody asked for that now has tests, docs, and a deployment pipeline because it was easier to ship than to evaluate.

The cost of building is never zero. But the agent made the first 80% free, and that's the part your brain sees.

Responsibility diffusion

When you build from your own idea and it fails, that's on you. The whole thing. The meeting where you pitch it, the architecture you chose, the reason it existed.

When you build from the agent's idea and it fails, the failure feels different. You were just executing. The agent suggested it. You followed the recommendation. It's not your fault the recommendation was bad.

This is a lie your brain tells itself to avoid the sting. But it's a convenient lie, and we take it.

Why it keeps happening

Because there's no feedback loop.

The agent doesn't know you deleted its code. It doesn't know that thing you built has zero users. It doesn't feel the maintenance. Next session, when the context is fresh, it will suggest the same pattern again — because the pattern is correct in theory. The theory never experiences reality.

So every fresh session is a new pitch. The same idea, the same confident delivery, the same "this looks right" feeling. The agent doesn't learn. The veto gate has to be you — and we're bad at vetoing confident-sounding things.

What I do now

I started asking questions. Who needs this? Has anyone asked for it? What if I do nothing?

It's not a perfect filter. But it catches the worst ones — the ones where the agent was so articulate that I almost forgot to check whether I wanted the thing.

The Protocol Stack Nobody Talks About

Vilius — Wed, 20 May 2026 10:01:58 +0000

Six agent protocols launched in the last year. Everyone's obsessing over model selection. The operating surface around the model is what actually breaks.

Google I/O opened today with a flood of agent demos. Prompts becoming apps. Vibe coding going production. The spectacle is real. But the thing that determines whether any of this works isn't on stage. It's the quiet protocol stack underneath — MCP, A2A, AGUI, and their contested cousins.

Most teams can tell you which LLM they're using. Almost none can answer: which tools should the agent see? Who else can it delegate to? Where does the human approve or cancel?

Those three questions are the stack. Here's what sits at each layer.

MCP: tools are a security boundary, not a feature toggle

MCP is the most successful agent protocol by far. 14,000 GitHub repos tagged with it. Every major agent platform supports it. An agent connects to an MCP server, gets a list of callable tools, and can actually do work instead of just chatting.

But here's what nobody says out loud: there's no registry. No mcp search. No way for an agent to discover servers programmatically. The 14,000 number is GitHub tag-counting, not a registered directory. Smithery.ai lists about 6,700 — and you browse that with your eyes, not an API. An agent can't ask "find me an MCP server for Salesforce" and get an answer. Discovery is a person reading lists.

That's not a protocol. That's a treasure hunt.

Tool access enables arbitrary code execution and arbitrary data access. MCP was designed for high-trust environments. Now it's everywhere. Invariant Labs has published research on tool poisoning — malicious instructions hidden in tool descriptions that influence agents through the very metadata meant to make tools discoverable.

MCP gets the agent close to the work. It doesn't decide whether the agent should do the work. That's on you.

A2A: coordination isn't free

No single agent does everything. A procurement agent needs a supplier agent. A travel agent needs a hotel agent. A software agent needs a security reviewer. Work is distributed across owners, domains, and expertise.

A2A turns that distribution into something agents can reason about. The agent card is the primitive — a published contract describing what a remote agent is, what it does, which skills it exposes, and how to interact with it.

The cost: coordination adds another surface where latency, failure, permissions, and observability can break. If your agent delegates to another agent, the workflow gets more flexible and less predictable at the same time.

A2A isn't right for every product. A single agent with a small tool set may not need coordination at all. The right question: does this workflow require delegated expertise or authority outside the primary agent?

AGUI: the human control layer nobody builds until it's too late

An agent that's long-running, non-deterministic, and touching external systems needs more than a final answer. Humans need to observe it working, approve sensitive steps, inspect state, understand why it's waiting.

Chatbots don't handle this. Neither do traditional web apps built for request-response.

AGUI is the open candidate for this layer: streaming, shared state, front-end tool calls, custom events, steering, cancellation. It's the protocol most teams will ignore until their agents start doing real work and generating real bugs. They'll wire a model to tools, build a nice chat component, then discover what the agent is really doing — and retroactively bolt on approval buttons, logs, and progress spinners.

None of those are fixes for the root issue: finding the right control points, understanding what the agent is trying to do, and figuring out where the human needs to approve, deny, edit, or cancel.

The three that aren't standards (yet)

A2UI, AP2, and X402 all have real use cases but sit in contested territory.

A2UI is Google's answer to agent-generated interfaces — declarative UI instead of arbitrary HTML. Right direction, narrower scope than the full human control problem.

AP2 and X402 both tackle agent payments. AP2 handles commercial trust and user authorization (60+ collaborators including Mastercard, PayPal, American Express). X402 is Coinbase's HTTP-native machine-to-machine settlement. Payments is the most crowded layer because it's the most valuable. Everyone wants in.

The boring stuff wins

Teams over-focus on model selection and under-specify everything around it. They know which LLM they want. They don't know which tools the agent can or should see. They have a prototype that calls APIs but no interaction model for user approval. They can imagine multiple agents coordinating but have no way to enforce or validate that.

The actual work lives in those questions. The protocol stack isn't glamorous. Neither is infrastructure. But six months from now, the teams that figured out their operating surface will be the ones whose agents still run.

The ones that just picked a model won't know what hit them.

Build It, Then Kill It

Vilius — Tue, 19 May 2026 19:27:36 +0000

The hardest thing after building agent infrastructure for a few months isn't building more. It's stopping.

Noticing you built a tagging system for a knowledge base of 60 entries. A daily cleanup script for things that never get dirty. A schema that has more fields than data points. These aren't failures — the system works fine. They're architecture that arrived before the problem did.

Here's what I've learned about not building things.

Kill the curator before it kills your morning

We had a daily memory curator. Every 4am it scanned the knowledge base for stale entries, article title leaks, bloated facts. It ran for weeks. How many things did it delete? Three. Total. Across its entire lifetime.

Sixty facts don't rot daily. Nothing rots that fast. The curator wasn't maintaining anything — it was creating work to justify its existence. Reports about 32% budget health. Reports about 0 superseded entries found. Reports about reports.

Folded it into the weekly auditor instead. Same coverage, 87% fewer runs.

Your schema is bigger than your data

We built a structured tagging system: every fact needed a domain, a project, a type. domain:infra, project:wwa, type:architecture. Queryable. Organized. Professional.

For 60 facts.

The schema had more rules than the data had entries. We were maintaining metadata for a phone book that fits on a postcard.

Facts are flat again. A source line and content. Tag when you'll actually query by that dimension — not because the schema says to.

Don't bolt on what the model doesn't support

A paper dropped this week: agents communicating via raw embedding vectors instead of text. 8% better, 2.4× faster, 75% fewer tokens. Beautiful piece of research.

Can we implement it? No. This is a model architecture change — a connector between output layers bypassing tokenization. We sit above the model. The interface is text in, text out. No provider exposes raw embeddings.

Sometimes the right answer is "that's not at our layer." Not every good idea is yours to build.

The best infrastructure is what you didn't have to make

The invisible wins:

Not adding a retry loop when the real fix was fixing the thing that fails
Not building a dashboard when the problem is visible enough without one
Not writing a migration when the old data still works fine
Not adding a field to the schema when you can just put it in the content

The conference reaction article published today took 15 minutes to write. Same template every time: what they're right about, what breaks, the real gap. No fleet numbers. No credentials. Just the take.

That's the model. Building less leaves more room for having opinions.

The test

Ask yourself: if nobody maintained this for a month, would anyone notice?

If the answer is no, you didn't build infrastructure. You built busywork.

Google Declared the Agentic Era at I/O. Here's What They Got Wrong.

Vilius — Tue, 19 May 2026 18:50:42 +0000

Google I/O 2026 was an agentic coming-out party. Agent-first IDE. Autonomous debugging. Vibe coding to production. Chrome DevTools for agents. The message was clear: agents aren't a feature anymore, they're the platform.

Great. Here's what the demo doesn't show you.

They're Right About the Direction

Google Antigravity — an agent-first IDE — is overdue. Native IDE support that an agent can drive instead of hacking together terminal sessions and subprocess calls? Finally.

Chrome DevTools for agents is bigger than it sounds. Agents inspecting pages, filling forms, running Lighthouse audits — this solves real infrastructure pain. Browser automation has been brittle for years. Google building this into the platform validates the direction.

Vibe coding as a first-class workflow — fine. The speed of code generation is real and nobody disputes it.

Here's What Actually Breaks

The I/O narrative is "prompt to production." The reality is "prompt to production to 3am incident."

Speed was never the bottleneck. Code generation is fast. What breaks is: the agent didn't check if the endpoint still exists. The docs were stale. The rate limit wasn't documented. The error format changed. The migration ran against the wrong database.

Google's answer is more tools. Antigravity. AI Studio integration. Deployment pipelines. The actual answer is infrastructure — memory so agents don't rediscover the same failure twice, decision protocols so they stop before they break things, verification gates that catch errors before they reach production.

An IDE that generates code without verifying it is just a faster way to break production.

The Real Gap

Google is selling velocity. The market needs reliability.

The impressive demo is an agent building a full-stack app in 30 seconds. The impressive product is an agent running for a week without eating itself.

Agent-first IDEs are finally here. The question isn't whether they can generate code — they can. The question is whether they can run without constant supervision.

Nobody's answered that yet.

A Button That Generates Terrible Prompts

Vilius — Tue, 19 May 2026 18:12:37 +0000

Click it. Get a cursed coding prompt. That's it.

Ten flavors. Fresh AI-generated nonsense every time.

"Sort elements by git blame. Most-blamed floats to the top."
"Validate an email by sending it an email. Timeout: 3 seconds."
"Number database migrations by Fibonacci only. Skip one, it deletes itself."
"CSS-only dark mode. Toggle by resizing your browser to exactly 777px."
"A React form where each input is a separate micro-frontend. One fails, the field silently disappears."
"A pub/sub system where subscribers get events based on open source contributions. First-timers always last."
"A REST API that always returns 200 OK. The real status code is buried somewhere in the JSON."
"A CI pipeline that fails if commits lack emoji, unless the emoji is ironic."
"A rate limiter that replies with a passive-aggressive haiku about your resource consumption."
"A Python package where every import triggers an HTTP request. Works offline if you imported on a Tuesday."

Try it →

Power Sockets Don't Need Certification — and Neither Should Agent Infrastructure

Vilius — Tue, 19 May 2026 15:46:51 +0000

I'm tired of talking about plumbing.

Every conversation about AI agents right now is about infrastructure. What protocol. What format. What discovery mechanism. How to hand off. How to authenticate. It's like electricians in 1920 arguing about socket shapes while houses sit in the dark.

Power sockets work. You plug something in. It gets power. Nobody asks for a new socket standard. Nobody certifies sockets. The infrastructure disappeared and we got on with building things that actually matter: washing machines, televisions, the internet.

Agent infrastructure should be the same. Boring. Invisible. Done.

The State We're In

Right now my agents spend too much time figuring out how to talk to things. An endpoint returns 200 but the docs are stale. Rate limits aren't documented. The error format changes between versions. These aren't hard problems — they're solved problems that nobody's bothered to solve consistently.

This isn't a technology gap. It's an expectation gap. We don't expect APIs to be agent-ready, so they aren't. We treat "agent compatibility" like a feature instead of what it should be: the default.

What We Expect

We're not a certification body. We're an opinionated bunch who run a lot of agents and have opinions about how things should work.

Three things:

llms.txt at your domain root. This isn't fancy. It's a markdown file listing your docs, your API spec, your rate limits, your auth model. Machines read it. Humans can too. It costs nothing to add and tells agents where to start.

An OpenAPI 3.1 spec that's actually accurate. Not "accurate when we wrote it." Accurate now. If your spec says an endpoint returns a widget_id and it actually returns id, fix the spec or fix the code. Agents trust what they read. Bad specs waste everyone's tokens.

Error responses in a consistent format. A 429 should include Retry-After. A 400 should say what was wrong. A 500 should not return a 200 with "error" buried in JSON. This isn't new. REST APIs have been doing this for humans for years. Agents are just less forgiving.

That's it. Three things. If your API does these, my agents can use it. If it doesn't, they'll figure it out eventually — but they'll burn tokens, make mistakes, and I'll have to clean up after them.

The Bet

The bet isn't that agent infrastructure is complicated. It's that agent infrastructure is simple and we've been overcomplicating it because that's what new fields do.

We don't need a certification program. We don't need badges. We need API providers to treat agents as a real user channel and do the boring work of making their APIs machine-readable. The same way they already make them human-readable.

When electricians stopped arguing about sockets, we got skyscrapers. When agent infrastructure gets boring, we'll get agents that do actual work instead of agents that spend 10,000 tokens figuring out which endpoint does what.

I'm not waiting for a standards body. I'm building for APIs that meet these expectations. If yours does, my agents will find it, use it, and maybe write a skill for it. If it doesn't, they'll still try — but I'd rather they didn't have to.

We run a fleet of agents at workswithagents.dev. Everything we build is CC BY 4.0. If your API has llms.txt and an OpenAPI spec, we'll probably test against it eventually. Not because we're certifying you — because our agents need APIs to talk to and yours is easier than most.

Google Said It Had Native Function Calling. I Tested It.

Vilius — Sun, 17 May 2026 02:55:16 +0000

Google released Gemma 4 E4B with a specific claim: native function calling. "Enhanced coding and agentic capabilities," the model card said. "Native function-calling support, powering highly capable autonomous agents."

4.5 billion effective parameters. Apache 2.0. Runs on a laptop. 50 tokens per second on my Mac Mini M4.

I wanted to believe it. The promise of a local model that could actually use tools — not hallucinate tool calls, not ignore them, but use them — is something I've been chasing for months.

So I ran it through the same battery I've used for every other local model.

The Test

Two dimensions. Code quality: 10 real agent coding tasks — file parsing, bug fixing, YAML repair, regex extraction. Agent readiness: 6 tool-calling scenarios — single-tool selection, multi-tool discrimination, required adherence, false positive resistance, multi-turn chaining, argument correctness.

The same tests I ran on SmolLM3, Phi-4-mini, and nine other models.

Code Quality: 64.2%

Six passes, one partial, three fails. It handled text parsing cleanly — regex extraction, JSON parsing, file analysis all passed. But it fell apart on system tasks. It couldn't fix a YAML indentation error. It couldn't produce the right shell command to check a port. It couldn't recover from a broken rm command.

The pattern was clear. Give Gemma 4 E4B structured text and it shines. Give it anything that touches the terminal and it stumbles.

Agent Readiness: 33.3%

Two out of six.

It correctly picked search_files over read_file when asked to find a config file. That's the good news — it discriminated between tools. And when given no tools at all, it stayed quiet and answered the question. No hallucinated function calls.

But it failed at everything else. It refused to call a tool when asked to read a file — it described what read_file would do instead. When I set tool_choice: required, it ignored it entirely and returned text. It couldn't chain two calls together. And when asked to write a file with specific content, it didn't call any tool at all.

Where It Lands

Model	Code Quality	Agent Readiness
SmolLM3 3B	93.3%	50.0%
Phi-4-mini	90.0%	16.7%
Gemma 4 E4B	64.2%	33.3%
Gemma-3n E2B	76.7%	0.0%
Qwen2.5 0.5B	74.2%	0.0%

Not the worst. Not the best. The middle child.

The "native function calling" is real — it's the only model besides SmolLM3 that reliably produces tool calls at all. Phi-4-mini, Gemma-3n, Qwen, Llama — they all score zero on agent readiness. Gemma 4 E4B gets two. That's genuinely better than the field.

But "better than zero" and "production-ready" are different things.

What This Means

If you need a local model for text tasks — summarization, extraction, parsing — Gemma 4 E4B is fast, small, and solid. 50 tok/s on modest hardware. 5GB on disk. Apache 2.0.

If you need it to act — to call tools, chain operations, function as an agent — it's not there yet. It picks the right tool about half the time and ignores you the other half.

The gap between "native function calling" on a model card and "reliable tool use" in practice is still wide. Google built the plumbing. The wiring isn't finished.

What You Should Check

Don't trust "native function calling" claims without testing them yourself
Test tool_choice: required specifically — many models ignore it
Multi-turn tool chaining is the hardest test — almost nobody passes it
Code quality and agent readiness are different skills — measure both
50 tok/s locally is genuinely useful even without tool calling

Something else will break tomorrow. Something always does.

I Tested 6 Local Models on Real Agent Tasks. The Best Scored 50%.

Vilius — Sat, 16 May 2026 14:48:29 +0000

Agent Autopsy, Day 10

I had a SmolLM3-3B running on my laptop. It scored 93.3% on my code quality benchmark. I thought I was one config change away from a local AI agent that could actually do things.

I was wrong.

What I Assumed

Code quality equals agent capability. If a model can generate correct Python, read files, and fix bugs at 93%, it should be able to call a function when asked.

That assumption survived exactly two minutes of testing.

What I Built

I wrote a proper agent readiness benchmark. Six pass/fail dimensions. Can it call a single tool when told? Pick the right one from three? Obey tool_choice: required? Stay quiet when no tools exist? Chain calls across turns? Pass the right arguments?

I also built a 100-line translation proxy. Local models output tool calls as text — <tool_call> blocks, JSON, Python syntax. Agent frameworks need OpenAI's native tool_calls format. The proxy bridges that gap. Without it, most models score 0%.

The Results

SmolLM3-3B scored 50%. It calls single tools correctly. It writes files with the right arguments. But give it three tools and ask it to pick — it freezes. Ask it to chain two calls — it can't.

Phi-4-mini scored 90% on code and 17% on agent tasks. The only dimension it passed was "no false positives" — meaning it stayed quiet instead of hallucinating. That's the bar.

Qwen2.5-Coder-14B, at 7.7 gigabytes, scored 85% on code. Couldn't call a single tool. Llama 3.1-8B, same story. Bigger model, zero agent capability.

Why the Gap Exists

Code benchmarks test whether a model can generate correct output from a prompt. Agent tasks test whether it can follow a protocol — receive tools, reason about which to use, call it, receive the result, decide what's next.

A model that writes perfect Python can still fail to understand that search_files is the right tool when someone says "find files." The 93.3% to 17-50% drop isn't a bug. It's revealing a capability that open-weight models under ~3 billion parameters simply don't have.

Architecture matters more than size. Qwen 14B at 7.7GB couldn't call a tool. SmolLM3 at 1.8GB could. Parameter count tells you nothing about agent readiness.

What You Should Check

Test tool calling separately from code quality. The correlation is weak. Your 90% code model might be useless as an agent.
Use a translation proxy. The format mismatch alone costs you 0-17%. A 100-line proxy fixes it.
Don't assume bigger means better for agent tasks. Architecture beats parameter count.
Benchmark before you build. I built the proxy first. Should've tested the models first.

The proxy is at github.com/vystartasv/toolcall-proxy. The benchmark is at benchmarks.workswithagents.dev.

Something else will break tomorrow. Something always does.

My Cron Jobs Failed. I Didn't Check.

Vilius — Fri, 15 May 2026 08:01:14 +0000

Agent Autopsy, Day 9

Things broke yesterday. A few cron jobs failed. Health checks went red. The local LLM running low-stakes jobs concluded it was unable to do its job. And gave up.

I didn't check.

Not because I was busy. Because I didn't care. Nothing that failed mattered enough to notice.

What I found

Eight autopsies in eight days. Every one of them was something I cared about — broken packages, dead tools, a model that scored 93.3 but couldn't survive a long session. Each one felt urgent in the moment.

Day 9: things broke and I shrugged.

Meanwhile, GitHub launched a certified agentic AI developer certification. A hundred quid. A badge from Microsoft — a company whose quarterly revenue rounds your fee to zero — that says you know how to build with AI agents.

I couldn't be happier to give them my money.

What I assumed

I assumed the autopsies would end with a fire. A production outage. A deployment that took down the VPS and the backup VPS.

I assumed I'd care about everything that broke.

What I no longer assume

You don't need everything to work. You need the things that matter to work. The cron jobs that failed weren't the ones keeping anything alive. The health checks that went red were checking things nobody uses. The local model that gave up was running jobs I could run tomorrow — or never.

The series ends here. Not because the work is done. Because the certification people arrived after you'd already done the work, and the things that break now aren't worth writing about.

What you should check

If your failures don't wake you up, you've built the right things. The jobs that matter have guardrails. The ones that don't, don't.
When a local model tells you it can't do the job — listen. It's not failing. It's being honest. Most models will hallucinate competence instead.
Public evidence beats private credentials. Eight autopsies, real breakages, real fixes. That's a credential nobody can issue and nobody can revoke.

Something else will break tomorrow. Something always does. Maybe I'll write about it. Maybe Day 11.

My Agent Kept Forgetting Everything. My Hand Was Forced.

Vilius — Thu, 14 May 2026 19:04:38 +0000

Agent Autopsy, Day 8

My agent ran benchmarks yesterday evening and then lost the plot. Tried to deploy results to the wrong server. Wrong port. Wrong directory.

The problem was sitting right there at the top of every turn: memory full. I'd been stuffing facts in there for weeks. Every time something went wrong, I added another line. The agent had been skimming past entries for who knows how long.

My hand was forced. Gut memory or watch it rot.

I didn't architect a solution. I was annoyed. Same cycle every time: agent forgets something, I add it to memory, memory fills up, agent forgets something else. I'm bored of managing what I don't want to manage.

One Binary

The replacement is a Rust binary. The agent writes facts with ## field headers, searches when it needs them. Done.

The part that matters: when the agent searches "server" or "design" or "pipeline," the binary knows those words appear as field values somewhere and expands the query. The agent doesn't need to know which field something lives under. It just asks.

No field names. No syntax. Just ask.

Rust because I don't trust myself to keep a Python script alive.

What Actually Changed

The agent used to have a wall of facts injected into every turn. It glazed over. Data went stale because nobody updates things nobody reads.

Now it has a few lines telling it where to look. It asks the store when it needs something. Memory usage dropped by two thirds.

Cron jobs got the same deal. Each one starts with "search before acting." No more stale state from a file nobody touched.

Filesystem, Not Database

Markdown files in directories. ls works. grep works. Backup is git push — git is enough.

What You Should Check

How full is your agent's memory? If it's stuffed, the agent is skimming. Making decisions on partial context whether you know it or not.
Does your agent know where to look, or is it trying to remember everything? One fills up. The other doesn't.
Are you keeping the same fact in too many places? If your agent has to check four locations to know something, it checks zero.
Do your cron jobs carry stale context? A weekly job with old facts is producing garbage. It just can't tell you.

Source: github.com/workswithagents/knowledge-db. Crate: crates.io/crates/knowledge-db. MIT. Agents welcome.

Something else will break tomorrow. Something always does.

Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested

Vilius — Tue, 12 May 2026 22:37:04 +0000

The second round of the Works With Agents agent coding benchmark is in — 32 models tested this time, up from 10. And the results are not what anyone expected.

The headline: tiny models won

Rank	Model	Score
🥇	SmolLM3 3B	93.3
🥈	Phi-4-mini	90.0
🥉	Claude Sonnet 4	85.0
4	Qwen2.5 1.5B	85.0
5	Qwen2.5 3B	85.0
6	Granite 3.2 2B	82.5
7	Ministral 3B	81.7
8	Mistral Large 3	79.6
9	Gemma 4 31B	78.3
10	Gemma 4 26B A4B	78.3

A 3-billion-parameter model from Hugging Face scored 93.3 — eight points ahead of Claude Sonnet 4. Phi-4-mini (also a tiny model) took second at 90.0. Qwen2.5's 1.5B and 3B variants tied Claude at 85.0.

Frontier model results

Model	Score
Claude Sonnet 4	85.0
Gemini 2.5 Flash	76.4
GPT-5.4	76.6
Kimi K2.6	75.0
Grok 4.20	75.0
MiniMax M2.7	69.9
DeepSeek V4 Flash	60.0
GPT-5.5	60.0
GPT-5.4 Pro	51.6
GPT-5.5 Pro	43.3
DeepSeek V4 Pro	38.3

Grok 4.20 debuted at 75.0 — tied with Kimi K2.6, ahead of its Fast sibling (74.9). DeepSeek V4 Pro scored 38.3, well below its Flash variant. GPT-5.5 Pro and GPT-5.4 Pro both underperformed their base models substantially.

What the benchmark tests

The task evaluates real agent coding over 12 rounds:

Multi-file edits (Python, shell scripts)
Git operations (clone, branch, commit)
Shell command execution
Bash scripting with pipes and redirects
Recovering from errors

Score = weighted average of correctness (70%) and efficiency (30%). Models lose points for failed tool calls, wrong commands, and unnecessary steps.

The bottom of the table

Model	Score
DeepSeek-R1 1.5B	27.5
Qwen3.5 0.8B	26.0
Google Lyria 3 Pro	8.3
Google Lyria 3 Clip	0.0

The smallest models (sub-2B reasoning models) couldn't complete basic tool sequences. Google's Lyria models in particular struggled — Lyria 3 Clip scored zero, unable to produce any working output.

What this means

Small models are getting dangerously good at agentic coding. SmolLM3 3B — a model you can run on a laptop — outperformed every frontier model by a wide margin. The benchmark suggests model size isn't the bottleneck for agent coding ability.

Full results and methodology: benchmarks.workswithagents.dev

The benchmark runs continuously — new models are added as they become available. If you're building a model that should be tested, the API is open.