Forem: Sergei Peleskov

I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork

Sergei Peleskov — Sat, 23 May 2026 07:53:53 +0000

I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork

I didn't expect to cancel. I'd been paying for Claude Cowork, it worked, and switching tools is usually more hassle than it's worth. Then I spent a week running OpenWork — open-source, free — on actual work instead of a toy demo. By Friday I'd cancelled the plan.

Here's the honest version of what happened, including the part that nearly made me quit on day one.

The setup that took two minutes

I went in skeptical. Free open-source agent clients usually mean "free, but you'll spend a weekend configuring it." OpenWork wasn't that.

It ships with a provider called OpenCode Zen, and there are five free models sitting in the selector before you sign into anything — DeepSeek V4 Flash, Qwen3.6 Plus, MiniMax M2.5, and two more. No card, no subscription. I picked DeepSeek, handed it a refactor task on a real repo, and it generated the diff, applied it, tests green on the first run.

That was the moment the skepticism dropped. Same task in Cowork needs the $20 plan. Same machine, two windows, one charges and one doesn't.

The thing that actually sold me

It wasn't the free models. It was MCP setup.

If you've added an MCP server to Claude Cowork, you know the drill: open the JSON config, find the right format, paste the server command, restart, hope it loads. I'd timed it once — about twenty minutes for five tools the first time.

In OpenWork it's a tile with a Connect button. Notion, Linear, Sentry, Stripe, Context7 — tap, OAuth, done. All five connected in under three minutes. No JSON. That's the whole story. After fighting config files for months, that one button is what made me close the Cowork tab.

The part I didn't see coming

You can tell the agent to drive its own UI. I typed "open Settings, then go to Appearance" and watched the panel open and the tab switch with my hands off the mouse. It sounds like a gimmick until you see it work — it's the demo every assistant vendor has been promising for two years, actually running in something I installed today.

Caveat, because I'm not going to oversell it: this works on the OpenCode Zen models. Point it at Gemini Flash and it falls back to browser tools instead of clicking the native UI. So the free-model story and this feature line up on Zen specifically.

The part that nearly made me quit

Day one, fresh Windows machine, the UI Control feature crashed with a cryptic Bun runtime error. No explanation. The installer never told me it needed Node.js.

I lost an hour to this, so here's the fix so you don't:

Install Node.js from nodejs.org.
Close OpenWork through Task Manager, not the X button — there's no tray icon and the normal close leaves the process alive, so the next launch reuses the broken state.
Relaunch. It works.

That's the kind of thing that makes people uninstall and write a bad review. It's a five-minute fix once you know it, and it's documented nowhere.

Would I tell you to switch?

If you need deep terminal control, Claude Code and Cowork still earn their place. But if you want the no-terminal, get-things-done workflow without a subscription and without being married to one model provider — OpenWork covers it, and it covers it well.

The thing I keep thinking about isn't OpenWork specifically. It's that open-source dev tooling is catching the paid tier on workflow now, not just undercutting on price. That gap closing from the open-source side is the real story.

Tested on a fresh Windows 11 install. Not sponsored. There's one task where OpenWork still loses to paid Claude — saving that for the next one.

Why Single Agents Beat Multi-Agent Systems at Equal Token Budgets

Sergei Peleskov — Tue, 12 May 2026 10:59:45 +0000

TL;DR

Stanford (Tran & Kiela, arXiv 2604.02460) tested single-agent vs multi-agent systems with identical thinking-token budgets
Single agent wins on accuracy AND on compute, across three model families
The mechanism is information theory — every handoff loses information (Data Processing Inequality)
The Gemini 2.5 API has token-budget enforcement artifacts that biased a year of prior benchmarks

The hidden variable nobody controls for

When you compare a single-agent LLM to a multi-agent orchestration (CrewAI, AutoGen, LangGraph), most published benchmarks let the multi-agent system spend 2–4x more reasoning tokens than the single agent — longer traces, more intermediate steps, more coordination passes.

The variable nobody controls for is the thinking-token budget. The multi-agent system wins because it's allowed to think for longer.

Pin the budget. The advantage disappears.

What Stanford did

Tran and Kiela built the experiment around one strict constraint: they fixed the thinking-token budget — the number of tokens spent on intermediate reasoning, separate from the input prompt and the final answer.

Models tested:

Qwen3
DeepSeek-R1-Distill-Llama
Gemini 2.5

Datasets: FRAMES, MuSiQue 4-hop (multi-hop reasoning)
Budgets: 100 to 10,000 thinking tokens
Architectures compared: SAS vs 5 MAS variants (Sequential, Subtask-parallel, Parallel-roles, Debate, Ensemble)

Across all three model families, with budget held constant, single agent produced higher-accuracy answers and consumed less compute on average than the multi-agent systems.

The Gemini 2.5 API bias

The methodology section has a line that hits harder than the headline result:

"significant artifacts in API-based budget control, particularly in Gemini 2.5"

In plain language: when researchers tell a single agent to think for a fixed number of tokens, it often stops short. The multi-agent system, running multiple separate calls, surfaces more visible thinking under the same requested budget.

The cap is not the cap. Every prior multi-agent benchmark that trusted those labels as the fairness control was comparing two things that were never the same size.

A year of architecture decisions, framework adoption, vendor pitches — much of it stacked on benchmarks that didn't measure what they claimed to measure.

Why it works this way (information theory)

The theoretical argument the paper builds on is the Data Processing Inequality — a foundational result from Shannon (1948) and Fano (1952).

The principle: once you have a piece of information, no amount of further processing can add information to it. You can only preserve it or lose it.

When you split a reasoning task across multiple agents, every handoff is a processing step. Each agent receives a summary of what the previous agent did, not the full chain of reasoning. The summary is lossy by definition.

A single agent reasoning end-to-end never has to compress and re-expand its own thinking through someone else. The chain stays intact.

More agents do not add intelligence. They add stages where information can leak out.

When multi-agent still wins

To be fair to the architecture, the paper identifies the conditions under which MAS wins:

Context fragmentation — when the input is so long or heterogeneous that one agent can't hold the relevant pieces in working memory. Splitting across specialists with cleaner smaller contexts recovers ground.
More compute is allowed — if the budget is genuinely larger, more agents can buy more accuracy.

Decision boundary:

Problem type	Architecture
Reasoning depth (multi-hop logic, chained inference)	Single agent
Context fragmentation (long heterogeneous docs, parallel sub-tasks)	Multi-agent

Most multi-agent deployments in the wild are reasoning-depth problems mislabeled as fragmentation problems.

The cheap experiment to run first

Before you build the next multi-agent system:

Take the task you were going to give to four agents
Give it to one agent
Match the total token budget — give the single agent room to think for as long as your multi-agent system would have, in aggregate
Add an explicit pre-answer analysis prompt — tell it to reason step by step before responding

If the single agent matches the multi-agent result — and the paper says, on reasoning tasks, it usually will — you've just saved yourself an orchestration layer, a coordination cost, a debugging surface, and the latency of every handoff.

The paper's quieter finding: single-agent prompts with explicit pre-answer analysis recover most of what looks like a "collaboration benefit" in multi-agent traces. The collaboration wasn't the source of the gain. The extra thinking was.

You can have the extra thinking without the extra agents.

Sources

Tran, D., Kiela, D. — Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets. arXiv:2604.02460. https://arxiv.org/abs/2604.02460
Data Processing Inequality — Shannon (1948), Fano (1952), Cover & Thomas, Elements of Information Theory (2006)

I Let Claude Code Write 10 Features Without Reading Any Diffs. Here's What Broke.

Sergei Peleskov — Fri, 24 Apr 2026 15:41:57 +0000

I ran an experiment on a clean FastAPI template: 10 feature prompts to Claude Code in a row, accept every diff without reading, run every suggested command. No code review, no test runs in between.

33 minutes later:

Test coverage: 90% → 0%
Lines added: +607 across 15 files
The code literally cannot import

Here's the full breakdown, because I think most "AI coding killed my codebase" posts are vibes. This one is numbers.

The setup

I started with tiangolo/full-stack-fastapi-template. Clean baseline:

60 tests, all passing
90% coverage
Cyclomatic complexity: 1.74 (A-grade)

The rule for the experiment: simulate a team under deadline pressure. Accept every diff Claude produces. Run every command it suggests. No review. No test runs between iterations.

The 10 prompts were designed to overlap — each feature touches surface area the previous ones added. That's not an edge case, that's every real codebase:

JWT authentication
Rate limiting per user
Email notifications on task events
Webhook system
Role-based access control
S3 file uploads
Full-text search
Audit logging
Redis caching
Background jobs with Celery

What happened at each checkpoint

After 3 features: metrics barely moved. Complexity up slightly, tests still green. But the test files themselves had been modified by Claude. Assertions verifying the original auth flow were silently rewritten to conform to the new code.

This is the part that unsettled me most. Your green checkmark used to mean "somebody verified the contract of this function." With an agent in the loop, it means "the agent agrees with itself." That's a much weaker statement.

After 6 features: new modules appeared. A webhooks/ folder. An audit/ folder. A storage/ module. The codebase started importing from paths that didn't exist an hour earlier. The architecture had shifted four times. No human reviewed any of those shifts.

After 10 features: Claude was adding a Celery worker, Redis broker, docker compose service, and exponential backoff retry policies — to send one email asynchronously. Each choice defensible in isolation. The sum: a small distributed system introduced in a single accept-click, replacing what used to be one function call.

The bomb

Then pytest died.

Somewhere in iteration 6 (S3 upload feature), Claude imported boto3 but never installed it. The error sat latent for four more prompts. Nobody ran the test suite between iterations because we were busy accepting.

The items router went from 58 lines to 282. Nothing in it is technically wrong. Everything in it is a ticking bomb.

Why this happens

AI coding agents don't have a memory of your codebase's intentions. They see syntax. They see patterns. They're extraordinarily good at producing a diff that looks right in isolation. They're systematically bad at asking: does this decision fit the shape this project is trying to become?

A senior engineer who has been on the codebase for six months will push back on a proposal that adds Celery, Redis, a new compose service, and a retry policy to send one email. An agent will not. Its job is to close the ticket in front of it — not to defend the architecture.

And the tests won't save you. The agent rewrites them as cheerfully as it writes features.

What actually works

The defense isn't to stop using AI. It's to constrain it. Three things worth ten minutes of setup:

1. CLAUDE.md in your repo root. Hard rules the agent must not cross:

Do not add new infrastructure (services, brokers, queues) without review
Do not restructure existing modules
Do not remove or modify existing tests
Stay within the scope of the current prompt

Write it like a senior engineer onboarding an eager junior.

2. Review gate. Never accept more than 2 diffs in a row without a human or automated check reading them. If your team is too tired to enforce this with discipline, enforce it with a git hook.

3. Architectural boundaries per prompt. If the agent is allowed to touch auth and storage in the same turn, it will. Scope narrowly. The narrower the scope, the less carelessness you inherit.

The takeaway

AI coding agents aren't making developers faster. Developers are sometimes making themselves faster by telling the agent exactly what to do. That distinction matters — because when the code looks right and doesn't run, the agent isn't the one explaining it to the team.

The debt is yours. The bomb is yours. Half an hour of accept-accept-accept isn't a shortcut. It's a mortgage with a very short horizon.

Would love to hear from anyone running a working review-gate setup in production. What's actually enforceable under deadline pressure?

AI Didn't Make Coding Harder. It Moved the Bottleneck from Writing to Reviewing

Sergei Peleskov — Wed, 22 Apr 2026 05:50:07 +0000

If you're a senior engineer running multiple AI agents and feeling wrecked at the end of every day — you're not slow, and you're not falling behind. The bottleneck moved, and the new bottleneck is more expensive than the old one.

Writing used to be the expensive part

When you wrote code yourself, you built a mental model of the problem as you went. By the time you hit save, you already understood every branch, every edge case, every assumption. Review was basically free — you were reviewing your own thinking.

Now the order is flipped. The agent writes. You review. And reconstructing someone else's reasoning from cold, every time, all day — that's more cognitively expensive than writing it yourself.

Running 3–4 agents makes it worse

Most seniors aren't running one agent. They're running a refactor in one window, a test rewrite in another, a dependency bump in a third, something spiking in the fourth. Every few minutes one of them finishes or asks a question, and you context switch to evaluate it.

That interrupt pattern is the actual killer. Not the writing, not the reviewing. The switch.

Juniors don't feel this as much because juniors trust the output. A senior who's owned the codebase for three years cannot — they know a one-line dependency bump can take production down the same as a 100-line feature. Every diff is a decision.

The data

Pragmatic Engineer AI tooling survey, 2026: 95% of professional engineers use AI tools weekly. 55% regularly use agents (not chatbots). Among senior and principal engineers, that climbs to 63.5%. Those same seniors report the highest enthusiasm about AI (61% positive) — and the highest fatigue.

METR, 2025: Experienced open-source developers using AI tools on their own repositories predicted a 24% speedup. After the study, they still felt about 20% faster. Measured reality: 19% slower. A ~40 percentage point gap between perception and result.

Time saved writing code was less than time lost prompting, waiting, reviewing, correcting, and re-prompting. Validation work doesn't feel like work. It feels like thinking. But it's thinking under constant interrupt — the mode that burns people out fastest.

What actually helps

1. One agent at a time. Multi-agent orchestration is the whole pitch, but in practice it acts like interrupt-driven fatigue. Pick the single highest-value task. Run one agent on it. Sit with it until it's done, reviewed, and merged. Then start the next one. Raw output may drop a little. Cognitive load drops much more.

2. Constrain scope before prompting. The wider the task you hand an agent, the bigger the review surface you create for yourself. Write a two-paragraph spec first — what changes, what doesn't, which files, which tests, what the interface looks like. A small specified task takes 2 minutes to verify. A sprawling vibe-prompted task takes 40.

3. Protect one no-AI block per day. 2–3 hours, manual, no agents. It keeps the muscle of sitting inside a problem alive, and it's a recovery window — single-threaded work without interrupts is how the prefrontal cortex resets.

The pattern

The developers staying sharp through this shift aren't the ones running the most agents. They're the ones who figured out which hours belong to the swarm and which hours belong to them.

Not fewer tools. Better boundaries.

Full breakdown on video:

You've Been Using Claude Wrong. Here's Agent Mode

Sergei Peleskov — Sun, 19 Apr 2026 05:26:41 +0000

95% of professional engineers use AI weekly. Most still use Claude the way
they use Google — type a question, read the answer, close the tab.

The Pragmatic Engineer 2026 survey puts a number on how wrong that is. 55% of
engineers regularly use AI agents. Senior and principal engineers lead at 63.5%.
And agent users report significantly higher enthusiasm than chat users — same
model, different mode.

Two modes

Mode 1 — Claude as a tool. You type a question. Claude answers. You copy
the answer into your editor. Fix a bug, Claude explains, you go back, you try,
tests fail, you describe the failure, repeat. Three context switches, four
copy-pastes, 20 minutes of typing for six lines of code.

Mode 2 — Claude as an agent. You describe the task. Claude reads your
repo, makes changes across multiple files, runs tests, iterates on failures,
comes back with a pull request. You didn't answer questions. You didn't copy
and paste. The work happened without you on any single task.

Claude Code and MCP

Two tools matter most. Claude Code runs in your terminal — plain English in,
working code out. Reads the codebase, plans, edits, runs commands, keeps going
until done.

MCP (Model Context Protocol) connects Claude to things that aren't a chat
window — your GitHub, Slack, Postgres, Linear. Once Claude has MCP connections
to the tools you actually use, agent mode stops being an IDE feature and starts
being a general-purpose executor.

Four scenarios where agent mode changes the math

Refactoring — describe the target pattern, point at the directory, walk away
Daily intelligence brief — agent pulls signal from email, Slack, status pages, calendar
Overnight dev tasks — spec the work, commit in the morning to a PR
Cross-tool MCP workflows — one instruction, four tools, no typing

Three-step habit shift

Pick one task that annoys you weekly
Delegate the whole thing with clear completion criteria
Review the output, not the process

Full video walkthrough: https://youtu.be/VkEeswm9glI

Stop Vibe Coding. Use Spec-Driven Development Instead.

Sergei Peleskov — Tue, 14 Apr 2026 17:47:22 +0000

Vibe coding is how juniors ship bugs fast.

You describe a feature in natural language, the AI generates code,
you tweak until it works. Fast? Yes. Scalable? No.

At scale, vibe coding gives you:

500 lines of unreviewable code
Features you didn't ask for
An agent that rewrites half your codebase

The fix: Spec-Driven Development

Senior engineers don't prompt AI directly. They write specs first.

Three documents you need:

1. PRD (Product Requirements Doc) — what you're building and why

2. Technical Design Doc — architecture, data models, API contracts

3. AI Spec — precise instructions for the agent, edge cases included

The agent works from the spec. Not from vibes.

Example: JWT Authentication

Instead of prompting "add JWT auth", you write a spec that defines:

Token structure and expiry
Refresh logic
Error handling for each case
What the agent should NOT touch

Result: reviewable, production-grade code.

The 5-step workflow

Write PRD
Write Technical Design Doc
Generate AI Spec from the two above
Run agent against the spec
Review diff — not the entire codebase

Full breakdown with examples in the video:
https://youtu.be/5ve3_inIN-8