<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Maxim Saplin</title>
    <description>The latest articles on Forem by Maxim Saplin (@maximsaplin).</description>
    <link>https://forem.com/maximsaplin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F248483%2F1cf75ff4-cb65-4592-b2a8-e2dba0d25fe5.jpeg</url>
      <title>Forem: Maxim Saplin</title>
      <link>https://forem.com/maximsaplin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/maximsaplin"/>
    <language>en</language>
    <item>
      <title>Long-Horizon Agents Are Here. Full Autopilot Isn't</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Mon, 30 Mar 2026 06:21:06 +0000</pubDate>
      <link>https://forem.com/maximsaplin/long-horizon-agents-are-here-full-autopilot-isnt-5bo7</link>
      <guid>https://forem.com/maximsaplin/long-horizon-agents-are-here-full-autopilot-isnt-5bo7</guid>
      <description>&lt;p&gt;A good sanity check for long-horizon agents is not a benchmark. It is a task that is easy to verify and hard to fake.&lt;/p&gt;

&lt;p&gt;That is why I still like my small &lt;a href="https://github.com/maxim-saplin/hyperlink_button" rel="noopener noreferrer"&gt;hyperlink_button&lt;/a&gt; experiment so much. On paper, it sounds trivial: a Streamlit control that looks like a text link but behaves like a button. In reality, it is exactly the kind of task that exposes whether an agent can actually work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0znpajhq65p9nh1o2821.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0znpajhq65p9nh1o2821.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The task is small enough that you can tell if it succeeded. But it is also awkward enough to matter: Python on the Streamlit side, React/TypeScript on the frontend side, packaging, integration, docs, testing, and all the usual places where “looks plausible” is not the same as “works.”&lt;/p&gt;

&lt;p&gt;That is why I think this kind of project is a better test than a flashy benchmark. The real question is not whether a model can emit code. The real question is whether the workflow around it can keep it honest: make it read the right docs, implement the actual requirement, and prove it did not cheat.&lt;/p&gt;

&lt;p&gt;That question feels especially relevant right now, because early 2026 has been full of confident claims that long-horizon agents crossed a real threshold.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://metr.org/" rel="noopener noreferrer"&gt;METR&lt;/a&gt; has been tracking AI progress in terms of how long a task an agent can complete, not just how well it performs on narrow benchmarks. &lt;a href="https://sequoiacap.com/article/2026-this-is-agi/" rel="noopener noreferrer"&gt;Sequoia’s “2026: This is AGI”&lt;/a&gt; proposed a deliberately practical definition: AGI is the ability to “figure things out.” And &lt;a href="https://www.anthropic.com/research/measuring-agent-autonomy" rel="noopener noreferrer"&gt;Anthropic’s “Measuring AI agent autonomy in practice”&lt;/a&gt; added real deployment data: longer Claude Code runs, more strategic auto-approval, and a shift from step-by-step approval toward active monitoring and interruption.&lt;/p&gt;

&lt;p&gt;At the same time, the major product teams all published their own frontier stories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cursor.com/blog/scaling-agents" rel="noopener noreferrer"&gt;Cursor wrote about scaling long-running autonomous coding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/engineering/building-c-compiler" rel="noopener noreferrer"&gt;Anthropic had a team of parallel Claudes build a C compiler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openai.com/index/harness-engineering/" rel="noopener noreferrer"&gt;OpenAI described how Codex was used to grow an agent-first codebase to around a million lines&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only read the headlines, you land in one of two lazy positions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Either developers are cooked.&lt;/li&gt;
&lt;li&gt;Or the whole thing is smoke and mirrors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think both reactions miss what is actually changing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real breakthrough is operational
&lt;/h2&gt;

&lt;p&gt;The most important shift is not that models suddenly became autonomous software teams. The more interesting shift is that they can now operate inside real environments.&lt;/p&gt;

&lt;p&gt;They can use a CLI. They can inspect files and logs. They can run code. They can read docs. They can check whether a change actually worked. They can keep iterating inside a feedback loop instead of handing a blob of code back to a human and hoping for the best.&lt;/p&gt;

&lt;p&gt;That is a much bigger change than “better autocomplete” or “bigger context.”&lt;/p&gt;

&lt;p&gt;It also explains why software is the natural first home for long-horizon agents. Software is unusually legible, testable, and reversible. You can run something, compare outputs, inspect logs, and decide whether the result is acceptable. In many other domains, verification is just as hard as doing the work in the first place.&lt;/p&gt;

&lt;p&gt;That is one reason &lt;a href="https://www.anthropic.com/research/measuring-agent-autonomy" rel="noopener noreferrer"&gt;Anthropic’s autonomy data&lt;/a&gt; is so interesting. The pattern is not “experienced users blindly trust agents more.” It is subtler than that. They approve more automatically, but they also interrupt more strategically. The oversight style changes.&lt;/p&gt;

&lt;p&gt;That matches my own experience almost exactly.&lt;/p&gt;

&lt;p&gt;The mature workflow is not “approve every action forever.”&lt;/p&gt;

&lt;p&gt;It is “let the system move, but stay close enough to redirect it when it starts drifting.”&lt;/p&gt;

&lt;h2&gt;
  
  
  The flagship demos were real. They were also unusually favorable.
&lt;/h2&gt;

&lt;p&gt;I do think the big public demos matter. But I also think they are easy to misread.&lt;/p&gt;

&lt;p&gt;The interesting part of &lt;a href="https://cursor.com/blog/scaling-agents" rel="noopener noreferrer"&gt;Cursor’s post&lt;/a&gt; is not that a swarm of agents can brute-force software into existence. The interesting part is that coordination turned out to be hard, flat self-coordination was brittle, and simpler planner/worker structure worked better than more clever schemes.&lt;/p&gt;

&lt;p&gt;The interesting part of &lt;a href="https://www.anthropic.com/engineering/building-c-compiler" rel="noopener noreferrer"&gt;Anthropic’s C compiler experiment&lt;/a&gt; is not just “an LLM built a compiler.” It is that the agents were operating in a world with unusually strong feedback: serious tests, known-good oracles, structured tasks, and a domain with decades of prior art. &lt;a href="https://www.modular.com/blog/the-claude-c-compiler-what-it-reveals-about-the-future-of-software" rel="noopener noreferrer"&gt;Chris Lattner’s review&lt;/a&gt; and &lt;a href="https://vizops.ai/blog/agent-scaling-laws/" rel="noopener noreferrer"&gt;Pushpendre Rastogi’s analysis&lt;/a&gt; are valuable precisely because they make that visible.&lt;/p&gt;

&lt;p&gt;And &lt;a href="https://openai.com/index/harness-engineering/" rel="noopener noreferrer"&gt;OpenAI’s harness engineering post&lt;/a&gt; may be the clearest articulation of the new role split: humans steer, agents execute. The environment, observability, repository docs, architecture rules, and feedback loops become first-class engineering artifacts.&lt;/p&gt;

&lt;p&gt;That does not make these demos fake.&lt;/p&gt;

&lt;p&gt;It does make them easier to interpret correctly.&lt;/p&gt;

&lt;p&gt;They are not proofs that software teams can be replaced by autonomous agent swarms. They are proofs that strong harnesses, rich feedback, and explicit structure can now unlock a surprising amount of useful work.&lt;/p&gt;

&lt;p&gt;That is a big deal. It is just a different deal than the headlines suggest.&lt;/p&gt;

&lt;p&gt;There is also a simpler reason these demos were unusually favorable: they were not blank-slate tasks. Browsers sit on top of standards, reference implementations, and mountains of prior art. Compilers sit on top of decades of specifications, tests, literature, and engineering patterns. Even when the outcome is new, the terrain is already heavily mapped.&lt;/p&gt;

&lt;p&gt;That matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two orchestration patterns, neither of them magic
&lt;/h2&gt;

&lt;p&gt;After the talk, I found it useful to separate two broad ways people currently try to orchestrate long-running agent work.&lt;/p&gt;

&lt;p&gt;The first is the &lt;a href="https://github.com/snarktank/ralph" rel="noopener noreferrer"&gt;Ralph pattern&lt;/a&gt;: fresh agent instances in a loop, with memory externalized into git history, progress files, and task state. It is crude, but honest. Each run starts with clean context.&lt;/p&gt;

&lt;p&gt;The second is LLM-native orchestration, where a lead agent manages subagents or teammates inside a shared workflow. &lt;a href="https://code.claude.com/docs/en/agent-teams" rel="noopener noreferrer"&gt;Claude Code agent teams&lt;/a&gt; are a good example: separate contexts, shared tasks, direct inter-agent messaging, and an explicit lead.&lt;/p&gt;

&lt;p&gt;In theory, the second model should feel much smarter.&lt;/p&gt;

&lt;p&gt;In practice, my own experiments did not convince me that prompt-level orchestration is the real unlock.&lt;/p&gt;

&lt;p&gt;What I saw was much messier. The manager often wanted to become an executor. It would stop and ask for confirmation. It would ignore the delegation policy. In some runs it violated the brief completely and fell back to the exact CSS or JS workaround I had explicitly ruled out.&lt;/p&gt;

&lt;p&gt;That does not mean subagents are useless.&lt;/p&gt;

&lt;p&gt;It means orchestration is still fragile.&lt;/p&gt;

&lt;p&gt;Right now it feels more like a product and training problem than something you can solve by writing a sufficiently stern prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually worked better
&lt;/h2&gt;

&lt;p&gt;The patterns that helped were much less romantic.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Give the model a CLI.&lt;/li&gt;
&lt;li&gt;Give it docs within reach.&lt;/li&gt;
&lt;li&gt;Run a preflight check before it writes code.&lt;/li&gt;
&lt;li&gt;Make verification cheap.&lt;/li&gt;
&lt;li&gt;Prefer headless checks over fragile visual wandering.&lt;/li&gt;
&lt;li&gt;Use parallelism only when tasks are truly independent.&lt;/li&gt;
&lt;li&gt;Add a QA-style handoff before the real human handoff.&lt;/li&gt;
&lt;li&gt;Observer, watch out for drift.&lt;/li&gt;
&lt;li&gt;Interrupt and intervene.&lt;/li&gt;
&lt;li&gt;Brace for impact - 100% there will be bugs and deficiencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That changed the economics of the work.&lt;/p&gt;

&lt;p&gt;Once the agent could run code, inspect outputs, and verify behavior directly, it stopped acting like a pure code generator and started acting more like an operator. Not an autonomous engineer. Not a magical coworker. More like a very fast worker inside a good harness.&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;The value is not just “the model got smarter.”&lt;/p&gt;

&lt;p&gt;The value is that the model can now participate in a loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I still don't buy the full autopilot story
&lt;/h2&gt;

&lt;p&gt;At the far end of the spectrum sits the software-factory vision, or what Simon Willison described in his write-up of StrongDM as &lt;a href="https://simonwillison.net/2026/Feb/7/software-factory/" rel="noopener noreferrer"&gt;the Dark Factory&lt;/a&gt;: agents writing code, agents testing code, agents reviewing code, with humans mostly stepping out of the implementation loop.&lt;/p&gt;

&lt;p&gt;I find that direction fascinating.&lt;/p&gt;

&lt;p&gt;I also think it clarifies how much infrastructure is required before “no human review” sounds remotely plausible.&lt;/p&gt;

&lt;p&gt;In my own work, fully unattended runs still tend to produce something functionally OK but awkward, sloppy, or strangely overcomplicated. They may satisfy a narrow verifier while violating the spirit of the task. They may finish the easy 95% and quietly give up on the hard 5%. They may pass checks and still feel wrong.&lt;/p&gt;

&lt;p&gt;That is not a theoretical objection.&lt;/p&gt;

&lt;p&gt;That is what I keep seeing.&lt;/p&gt;

&lt;p&gt;And honestly, it also matches the broader pattern in public demos. The output can be impressive, useful, and real while still being rough, unstable, or harder to trust than the headline implies.&lt;/p&gt;

&lt;p&gt;That is why I think the most useful conclusion is narrower than the hype, but stronger than the skepticism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real state of long-horizon agents
&lt;/h2&gt;

&lt;p&gt;Long-horizon agents are real. They already change how software gets built.&lt;/p&gt;

&lt;p&gt;But the practical value today comes less from autonomous software teams and more from supervised software operations: strong specs, strong harnesses, cheap verification, explicit context, and active steering.&lt;/p&gt;

&lt;p&gt;The fully autonomous rocket-to-Mars version still disappoints me.&lt;/p&gt;

&lt;p&gt;The version where I launch five agents in parallel, let them work on bounded tasks, and then challenge the result like a tough lead or QA engineer is already genuinely useful.&lt;/p&gt;

&lt;p&gt;That, to me, is the real state of agentic engineering in early 2026.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>Ran out of Cursor tokens and switched to GitHub Copilot: Side-by-Side</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Wed, 18 Feb 2026 17:38:27 +0000</pubDate>
      <link>https://forem.com/maximsaplin/ran-out-of-cursor-tokens-and-switched-to-github-copilot-side-by-side-2n5p</link>
      <guid>https://forem.com/maximsaplin/ran-out-of-cursor-tokens-and-switched-to-github-copilot-side-by-side-2n5p</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;, April 1 (and this is not a joke). Insider Preview version is way more usable and capable as of now. Throughout February and March I have seen a flow updates and most of the below concerns I've brought up are now fixed. Noticed a few Microsoft employee views in my LinkedIn in Feb, could it be this blog post turned into a backlog? :)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;DISCLAIMER!&lt;/strong&gt; The best AI coding tool is the one available to you, that gives you the best model and reasonable token limits. From the text below it might look like GitHub Copilot is a horrible product - it's not. I use Copilot and I'm productive. It's just an irritating experience when I switch from Cursor. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The banner is a screenshot from my Cursor 2025 retrospective with almost 1T tokens used - I guess one might call me a heavy user. I've been using &lt;a href="https://dev.to/maximsaplin/-cursorsh-a-competitor-to-github-copilot-58k4"&gt;it&lt;/a&gt; since 2023 and it happens to be my favourite VSCode fork. I tried different AI assisted IDEs: Kiro, Antigravity, Windsurf, Project IDX; used VSCode extensions such as Continue, Cody.&lt;/p&gt;

&lt;p&gt;When my monthly token limit in Cursor ran out last December, I've been spending more time with GH Copilot (the Insider Preview version with the newest features). Before that I occasionally used Copilot and mostly followed its progress from media/posts and my colleagues' discussions. It's hard to miss the major AI Coding assistant which Copilot is. Since 2023 I have formed an opinion that GH Copilot is an inferior product compared to Cursor which lagged by ~6 months. Recently the gap in new feature releases in Copilot has narrowed yet the execution is not great.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I don't like about Copilot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plan Mode&lt;/strong&gt; is a gray piece of misery compared to Cursor's implementation. I use it a lot in Cursor but see no reason to use it in Copilot. When I tried it for the first time in GH I didn't even understand that the plan was provided - it was just a few paragraphs of text produced by a subagent and clicking the 'Proceed' button just switched the mode to 'Agent' and pasted 'Proceed' text into chat. All of that seemed like a waste of tokens on subagent that did many tool calls and provided a very generic response. In Cursor you get a detailed and structured &lt;code&gt;.MD&lt;/code&gt; plan; there's a 'Build' button allowing you to spawn a new agent in a new dialog (with a different model of choice and a clean context); or you can proceed implementing it in the same thread.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggzn7kkbnixkxmcw4ce0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggzn7kkbnixkxmcw4ce0.png" alt="Cursor Plan Mode" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dialog features are poor&lt;/strong&gt; (and it's the core of UX). For example, you can't clone dialogs or branch out from certain messages in the middle - something I used a lot in Cursor to manage the ever growing threads and context overflows. There are a few more conveniences around overall UX that are missing in GH and keep the experience irritating (e.g., jumpy prompt input, adding a selected piece of a file to the dialog was not instantly apparent due to a faint animation, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1houarebxp4bh4xdgr7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1houarebxp4bh4xdgr7.png" alt="Branching out in Cursor" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;There's no manual dialog summarisation&lt;/strong&gt;, only automatic. Here's how I got trapped by this "feature"... In the middle of a chat (and I had no idea how big the chat was, since there was no token counter; otherwise I'd have branched it into a new thread) I typed "Proceed". After the implementation started and I saw a few tool calls summarisation kicked in and the agent got lost and "What do you want me to proceed with?".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntlhrzbovxo6frb9kmix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntlhrzbovxo6frb9kmix.png" alt="Cursor summarise" width="552" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Token counter missing for too long&lt;/strong&gt;. Insider preview has added this feature at the end of January.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://github.com/microsoft/vscode-copilot-release/issues/7823" rel="noopener noreferrer"&gt;issue&lt;/a&gt; requesting the feature in Copilot has been sitting since April 2025 and collected many reactions. Cursor had the context window usage indicator since I can't remember when.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shorter context windows&lt;/strong&gt;. For example, GPT-5 family has 272K input limit and Anthropic's Claude models by default allow for 200K total context size. I had this perception that in Copilot my dialogs hit the summarisation threshold sooner than in Cursor - turns out there's a reason for that. Why have these low defaults?&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hrr3imnf00syh19jgho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hrr3imnf00syh19jgho.png" alt="Copilot Context Window sizes" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Gemini 3 Pro instability&lt;/strong&gt;. My favourite model of November randomly threw errors in longer dialogs - trying Again didn't help; I had to drop those dialogs or switch models. Never noticed this instability in Cursor.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GitHub instructions&lt;/strong&gt; look inferior to Cursor's rules. For example, there are no semantic rules - where an agent pulls relevant instructions automatically. I even had to do a small &lt;a href="https://dev.to/maximsaplin/cursor-like-semantic-rules-in-github-copilot-b56"&gt;workaround for that handy feature&lt;/a&gt;. Recently Insider Preview added support of Agent Skills which does exactly that, yet&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Piling-up legacy in prompts management&lt;/strong&gt;. There are instructions, chat modes, different approaches to prompts - recently when doing a cleanup in our teams repo where GH Copilot was used there were a lot of questions around "how do I do my guardrails properly". A good example in my opinion is how Cursor dropped its Rules discipline making Agent Skills the default choice and instantly provided a &lt;a href="https://cursor.com/docs/context/skills#migrating-rules-and-commands-to-skills" rel="noopener noreferrer"&gt;migration path&lt;/a&gt; for existing Cursor rules/commands.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This also gives another example of a half-baked feature in Copilot. Agent Skills in Copilot are automatic only - the model decides when the skill is pulled into the thread. And for some reason there's no way to explicitly reference the skill. We used &lt;code&gt;/spec&lt;/code&gt; and &lt;code&gt;/task&lt;/code&gt; slash commands for Spec-Driven development, and those are called explicitly. When introducing Agent Skill Cursor added both option to use those - automatic or via slash commands.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Missing Multi-model parallel agents&lt;/strong&gt; - Cursor allows you to pick several models to process a single prompt; each one creates a Git worktree and you can proceed working in the worktree you liked the most. Copilot has a Background agent feature allowing you to spin up a new GH Copilot CLI agent - while it also relies on a worktree it doesn't give the same convenience.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtk1fh4rfctvp6bka8ii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtk1fh4rfctvp6bka8ii.png" alt="Cursor Parallel Agents" width="800" height="776"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Getting newer models can be slow&lt;/strong&gt;. GH announcements of model availability in Copilot come the same day the model is introduced. Yet it's often opt-in when Copilot subscription admins enable new models manually. In the case of Cursor I learn about &lt;a href="https://www.linkedin.com/posts/maxim-saplin_i-have-github-copilot-and-cursor-corporate-activity-7388911064475926528--Qze?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAY52t4BLtN4gJKk-YVpWKb4ZkU3sVysR8w" rel="noopener noreferrer"&gt;new model releases from its model picker&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No choice of reasoning effort for models&lt;/strong&gt;. For example, for GPT-5.2 there's only a single line in the picker, while in Cursor there are 8 options ( low, medium, high, xhigh, and then the same four with the -fast suffix, which is twice as expensive but faster). Technically, one can switch reasoning effort to "High" for OpenAI models, though only under experimental setting "Chat: Responses Api Reasoning Effort", which is a bit awkward and hard-to-reach feature.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhzrfg5hq9j3cnx474be.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhzrfg5hq9j3cnx474be.png" alt="Cursor, different variants of model reasoning" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Restoring checkpoints can be unreliable&lt;/strong&gt;. I ended up with a broken solution a few times when going back in chat history. Frankly, it is not always reliable in Cursor either; sometimes agents tend to make changes bypassing standard edit tools. It just seems GH checkpoint restoring was less reliable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System prompts seem awkward and less effective&lt;/strong&gt;. For instance, in Copilot I often get the agent responding with a "Plan" section after it completes a long thread. Essentially it fills the top of its report with a scroll of what the plan was. Who cares when job is done? Very confusing after switching from Cursor. Besides, when using Copilot in CLI it often gets the intent wrong and doesn't produce the right command, requiring further interaction.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxeqpn2id12mw7nhsukn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxeqpn2id12mw7nhsukn.png" alt="Copilot acknowledging plan" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The recent Cursor release of subagents is yet to be matched by Copilot&lt;/strong&gt;. The UX is better; the whole orchestration seems more polished. See below how in Cursor I kicked off parallel agents in their own worktrees which in turn kicked off subagents - all in one click. Compare to the very simplistic GH variant:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sciviag551zbkfp51uc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sciviag551zbkfp51uc.png" alt="Parallel Agents + Subagents" width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6dhc78pkdsye9s9ghfn4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6dhc78pkdsye9s9ghfn4.png" alt="GH Subagents" width="800" height="1400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Models in Copilot &lt;a href="https://github.com/orgs/community/discussions/171733" rel="noopener noreferrer"&gt;can't view image files&lt;/a&gt;&lt;/strong&gt; - you can only paste an image into chat; this way they do see images, otherwise they are blind. Use case? Using ADB to take screenshots and saving them in PNG for further inspection - it took me hours running failing verification loops before I realized Copilot lacked that trivial ability. Cursor does this well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt1f4nh7zh9u1u9ioqxq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt1f4nh7zh9u1u9ioqxq.png" alt=" " width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Like about Copilot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;(Long awaited) Token counter gives a breakdown&lt;/strong&gt;. It's curious to observe how &lt;a href="https://www.linkedin.com/posts/maxim-saplin_while-you-blinked-ai-consumed-all-of-software-activity-7425782154564972544-pbhZ?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAY52t4BLtN4gJKk-YVpWKb4ZkU3sVysR8w" rel="noopener noreferrer"&gt;agentic coding has recently leaped forward&lt;/a&gt; due to verification - you can easily check how much tool call results occupy in the dialog.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kbl5kb2xri4x5x8uksz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kbl5kb2xri4x5x8uksz.png" alt="Token Counter in GH" width="444" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You can inspect prompts&lt;/strong&gt; - under "Output &amp;gt; GitHub Copilot Cha"t you can view very detailed LLM traces. For example, you can see what sort of prompts are used to wrap your interactions, might be useful, especially if you like tinkering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffoh5wj714nt1tumo3ec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffoh5wj714nt1tumo3ec.png" alt="GH Copilot Prompt Inspection" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Open about standard tools&lt;/strong&gt; - there's no UI in Cursor to control standard tool selection, only MCP ones. If you are up for tinkering you can configure tool bundles, can see their exact names. For example, I often explicitly ask GH to use the &lt;code&gt;runSubagent&lt;/code&gt; tool to delegate to subagents - works like a charm for bigger tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndmaj6vir9gbv1938vhl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndmaj6vir9gbv1938vhl.png" alt="Tool selection in GH" width="800" height="542"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kinda open-source&lt;/strong&gt; - while the back-end part has not been open-sourced, the extension has been. Besides, many AI coding assistant features have been merged into &lt;code&gt;vscode&lt;/code&gt; directly, making the creation of third-party extensions much easier. Though it's a pity that GH Copilot always requires a sign-in locking out of true local LLM use - the ticket for that is very popular and has been sitting for almost a year.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Easier installation of MCP&lt;/strong&gt; - I found the integration in GH easier (button click); with Cursor I had to update config files.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ecosystem and integration with GitHub&lt;/strong&gt; - you have Copilot integrated in GH web app; you can easily assign issues to Cloud agents via you phone while browsing GitHub; the extension is accessible in plenty of IDEs (though people say non-VSCode IDEs struggle with feature parity). They have recently added support for &lt;a href="https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/" rel="noopener noreferrer"&gt;Claude Code and Codex&lt;/a&gt; allowing you to run other major coding agents through a GH subscription. The breadth and outreach of Copilot is great.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6tr1jaeivigo54zznq62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6tr1jaeivigo54zznq62.png" alt="Claude Code" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More tokens&lt;/strong&gt; - it feels like GH's premium requests model allows for more usage compared to Cursor's token-based pricing. Unfortunately there's no user-facing dashboard in Copilot to draw a clear comparison.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  From the Creators of SharePoint...
&lt;/h2&gt;

&lt;p&gt;Pun intended. Corporate touch adds a certain flavour making software disgusting. SharePoint or Dynamics CRM are in my view classical examples - ugly UI, slow. The ".aspx" extensions in URLs remind of decades-old ASP.NET Web Forms used to build them.&lt;/p&gt;

&lt;p&gt;Somehow GitHub Copilot follows in the steps of other corporate products... It often feels like software that is created by people who (a) don't use it and (b) don't care. A product built by a &lt;a href="https://www.youtube.com/watch?v=SXM728bzYTE" rel="noopener noreferrer"&gt;slideware company&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Just recently this "don't care" approach &lt;a href="https://github.com/microsoft/vscode/issues/292452" rel="noopener noreferrer"&gt;has surfaced&lt;/a&gt; when a user discovered an exploit to bypass billing. That was hilarious! A vulnerability report was submitted privately to Microsoft Security Response Center; the folks there told that billing wasn't their responsibility and advised to create a ticket on a public GitHub repo - where everyone could see the exploit and free-ride Microsoft on tokens. And even after that the GH issue got closed automatically by some AI bot. A few days later it was re-opened after the exploit received public attention and media coverage.&lt;/p&gt;

&lt;p&gt;Copilot vs Others might be a yet another Harvard Business School case study on how a large established company turns slow and loses touch with the market, while more nimble and energetic startups build better products.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor's Apple Magic
&lt;/h2&gt;

&lt;p&gt;"It just works" often comes to my mind when I use Cursor. There aren't that many options and toggles. They like building minimalist and refined UI (one of the reasons I don't like GitHub - because it's often ugly to my eye). A small example, Copilot in CLI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficqejbl8pw8h2rp6g6mw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficqejbl8pw8h2rp6g6mw.png" alt=" " width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Vs. Cursor:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjnnvogi8jo4ro0tcizvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjnnvogi8jo4ro0tcizvh.png" alt=" " width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's a bit of closedness and secrecy at AnySphere. Take for example their &lt;a href="https://cursor.com/blog/composer" rel="noopener noreferrer"&gt;Composer release&lt;/a&gt; where they compare their model to an unnamed best-on-the-market model and vaguely describe what they did - not even mentioning what the context window size for the new model is. Or how they implemented the "use your own API key" feature when they process all LLM requests on their back-end making use within a closed perimeter impossible.&lt;/p&gt;

&lt;p&gt;Apple vs. Microsoft, iOS vs. Android, startup vs. enterprise - all those analogies sum up my impressions when comparing Cursor to Copilot.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>githubcopilot</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Long-horizon agents: OpenCode + GPT-5.2 Codex Experiment</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Thu, 22 Jan 2026 16:13:07 +0000</pubDate>
      <link>https://forem.com/maximsaplin/long-horizon-agents-opencode-gpt-52-codex-experiment-1f4h</link>
      <guid>https://forem.com/maximsaplin/long-horizon-agents-opencode-gpt-52-codex-experiment-1f4h</guid>
      <description>&lt;p&gt;Sequoia Capital has recently published a &lt;a href="https://sequoiacap.com/article/2026-this-is-agi/" rel="noopener noreferrer"&gt;blog post&lt;/a&gt; arguing that AGI has been achieved because "Long-horizon agents are functionally AGI". About the same time Cursor team has &lt;a href="https://cursor.com/blog/scaling-agents" rel="noopener noreferrer"&gt;published&lt;/a&gt; their experiments with long-running agents that coded a web browser from scratch. &lt;/p&gt;

&lt;p&gt;And my &lt;a href="https://www.linkedin.com/posts/maxim-saplin_year-2025-might-have-changed-the-substance-activity-7417638248820412416-B2tE" rel="noopener noreferrer"&gt;recent reflections&lt;/a&gt; of the past year made me realize what a huge stride has AI coding made over the course of just one year.&lt;/p&gt;

&lt;p&gt;Along the lines of agentic coding and long-horizon execution, here's my recent experiment using &lt;a href="https://opencode.ai" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt; and GPT-5.2 Codex (predominantly at high reasoning level, sometimes switching to medium and xhigh)...&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakkfil5xrvm6c62s23j0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakkfil5xrvm6c62s23j0.png" alt="Cursor screenshot" width="800" height="637"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Approach:&lt;/strong&gt; the main dialogue (or session in terms of OpenCOde) is the an orchestrator agent; you explicitly ask it to delegate individual tasks to sub-agents (OpenCode uses &lt;code&gt;task&lt;/code&gt; built in tool for that ), verify them, and integrate the results. Why? Cause we don't want to hit the context window limit of the model. Though it could be an interesting experiment, relying on one single long thread with compaction happening from time to time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; rewrite a previously vibe-coded provider for litellm which implements a cascade of requests to several LLMs (implementing strategies, such Mixture-of-Agents or &lt;a href="https://github.com/karpathy/llm-council" rel="noopener noreferrer"&gt;LLM Council&lt;/a&gt; strategies) before returning a final response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zl17lvvmy9mafnztzaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zl17lvvmy9mafnztzaj.png" alt="Cost and Token Stats" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;About 4 hours of pure agent work time
&lt;/li&gt;
&lt;li&gt;Orchestrator session — $4.13, 157k tokens of dialogue length by the end of the task
&lt;/li&gt;
&lt;li&gt;16 sub-agent sessions — $9.73
&lt;/li&gt;
&lt;li&gt;Total spent $13.86, about 2M tokens
&lt;/li&gt;
&lt;li&gt;26 files changed in Git
&lt;/li&gt;
&lt;li&gt;Only 5 tests written (some Kiro+Sonnet/Opus would probably have gone wild and generated a hundred test doing no real work) — all green
&lt;/li&gt;
&lt;li&gt;The app works — the provider executes multiple llm queries aggregating the final respond, the Streamlit dashboard shows the recorded traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fht6y1y49haf7ejq17d89.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fht6y1y49haf7ejq17d89.png" alt="Demo Run" width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While doing he work agents did plenty of tools calls, scrawl the code-base, made file edits and most importantly tested the changes being made (often the changes didn't work and the agents had to fix what was broken):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5k7qbc7g2cgnbuno312.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5k7qbc7g2cgnbuno312.png" alt="Tool Use Stats" width="800" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For these ~4 hours of agent time, it took about half an hour of human effort and ~10 user messages. 6 major human-in-the-loop touchpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discuss the scope, formulate a requirements .MD
&lt;/li&gt;
&lt;li&gt;Kick off the work by explicitly asking to delegate to sub-agents and make sure the tests are green
&lt;/li&gt;
&lt;li&gt;Ask to run a real case with actual LLM interaction
&lt;/li&gt;
&lt;li&gt;At xhigh resoning level, ask to analyze real LLM interaction test case failure and give a fix plan
&lt;/li&gt;
&lt;li&gt;Run the fix loop with a real LLM interactions &lt;/li&gt;
&lt;li&gt;Finishing touches asking to fix the failing tests and tidy up the docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The orchestrator/subagents approach has effectively allowed to fit in 2 million tokens worth of work into 157K token long main thread with the orchestrator - there's still room given that GPT-5.2 Codex has a 400K context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P.S&amp;gt;&lt;/strong&gt; I &lt;a href="https://www.linkedin.com/posts/maxim-saplin_last-week-opencode-httpsopencodeai-activity-7420047824526131200-essq?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAY52t4BLtN4gJKk-YVpWKb4ZkU3sVysR8w" rel="noopener noreferrer"&gt;liked&lt;/a&gt; OpenCode a lot, more that I liked Codex.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>agents</category>
    </item>
    <item>
      <title>Cursor-like Semantic Rules in GitHub Copilot</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Thu, 08 Jan 2026 21:22:58 +0000</pubDate>
      <link>https://forem.com/maximsaplin/cursor-like-semantic-rules-in-github-copilot-b56</link>
      <guid>https://forem.com/maximsaplin/cursor-like-semantic-rules-in-github-copilot-b56</guid>
      <description>&lt;p&gt;Both GitHub Copilot and Cursor offer ways to define guardrails for agents in the form of &lt;a href="https://docs.github.com/en/copilot/how-tos/configure-custom-instructions/add-repository-instructions" rel="noopener noreferrer"&gt;Instructions&lt;/a&gt; and &lt;a href="https://cursor.com/docs/context/rules" rel="noopener noreferrer"&gt;Rules&lt;/a&gt; respectively. On the surface they look the same - just different names for a feature for customizing how AI assistants adapt to your project, be it unit test creation, documentation, or maintaining certain parts of the codebase.&lt;/p&gt;

&lt;p&gt;Yet when I turned to GitHub Copilot, I discovered that Instructions are very different conceptually - you define a single file that gets applied to a given repo, folder, or file extensions. In other words, the idea is that you are supposed to (a) have a large .MD file covering lots of topics and (b) rely on relevancy determined by file locations/names.&lt;/p&gt;

&lt;p&gt;This approach seems problematic in many ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's an LLM anti-pattern, bloating the model's context with huge blocks of text without the ability to organize instructions into smaller, targeted documents&lt;/li&gt;
&lt;li&gt;It's not convenient, instruction relevance is determined by file name pattern matching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cursor's approach seems much better. The official docs propose breaking down Rules into files no longer than 500 lines. Besides, each Rule has a header section (frontmatter metadata) describing the scope of the rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;---
description: "Standards for code quality, linting, and modern API usage in Flutter."
globs: lib/**/*.dart, test/**/*.dart
---
# Flutter Code Quality &amp;amp; Modernization
## 1. Run the Analyzer
After making substantive changes to Dart code, **ALWAYS** run `flutter analyze` to catch errors, warnings, and deprecations.
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These targeted, small, semantic Rules were something I lacked when switching to GitHub Copilot. I liked how Cursor can match rules based on task in the dialog, not file location. Yet I quickly found an easy workaround - use &lt;a href="https://github.com/maxim-saplin/nothingness/blob/main/.github/copilot-instructions.md" rel="noopener noreferrer"&gt;&lt;code&gt;copilot-instructions.md&lt;/code&gt;&lt;/a&gt; as a registry of smaller instructions/rules. Besides, it can serve as a shim for existing Cursor rules, making it easier for the coexistence of guardrails used by both AI assistants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Nothingness - GitHub Copilot Instructions
This is a Flutter media controller application. Consult the relevant rule files in `.cursor/rules/` when working in their domains.

## Rules Index
| Rule File | When to Consult |
|-----------|-----------------|
| `flutter-best-practices.mdc` | Writing/modifying Dart code. Covers linting, modern APIs, deprecations. |
| `testing-standards.mdc` | Adding features, models, services, widgets, screens. Covers test organization &amp;amp; mocking. |
| `documentation.mdc` | Adding architecture components or complex logic. Covers doc structure. |
| `flutter-commands.mdc` | Running Flutter CLI commands. Covers sandbox permissions. |
| `github-actions-polling.mdc` | Working with CI/CD workflows. Covers polling strategies &amp;amp; failure handling. |
| `rule-creation.mdc` | Creating/modifying rules in `.cursor/rules/`. Covers format &amp;amp; best practices. |

## Agent Behavior
1. **Context efficiency**: Don't load all rules—consult only those relevant to the current task
2. **Run validation**: Always run `flutter analyze` after Dart changes
3. **Reference docs**: Point to existing documentation rather than re-explaining
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It turns out modern models fine-tuned for agentic flows are quite curious and tend to follow up on relevant leads they find in the context:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb46j1jtxcbmyef5v0wzz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb46j1jtxcbmyef5v0wzz.png" alt=" " width="800" height="660"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>githubcopilot</category>
      <category>cursor</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>AI Dev: Plan Mode vs. SDD — A Weekend Experiment</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Thu, 04 Dec 2025 17:13:48 +0000</pubDate>
      <link>https://forem.com/maximsaplin/ai-dev-plan-mode-vs-sdd-a-weekend-experiment-f8e</link>
      <guid>https://forem.com/maximsaplin/ai-dev-plan-mode-vs-sdd-a-weekend-experiment-f8e</guid>
      <description>&lt;p&gt;Three months ago, I tested &lt;a href="https://dev.to/maximsaplin/ai-dev-testing-kiro-3b5j"&gt;Kiro's Spec-Driven Development (SDD)&lt;/a&gt; workflow and walked away impressed but frustrated. The AI built 13,000 lines of Rust code with 246 tests... that took 30 minutes to run, checked God-knows-what, left CI/CD broken beyond repair, and produced a codebase I couldn't maintain. Fast-forward to this weekend: I built a complete mobile app using Cursor + Gemini 3 Pro + Flutter—structured, maintainable, and shipped in one evening plus half a day.&lt;/p&gt;

&lt;p&gt;The difference? Let's unpack...&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Have Done
&lt;/h2&gt;

&lt;p&gt;Built a Flutter app targeting Android and macOS (mainly for UI debugging) from scratch -&amp;gt; &lt;a href="https://github.com/maxim-saplin/nothingness:" rel="noopener noreferrer"&gt;https://github.com/maxim-saplin/nothingness:&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It shows currently playing media, provide media controls (pause, next, etc.), displays spectrum analyzer using mic &lt;/li&gt;
&lt;li&gt;Used Cursor + Gemini 3 (and some GPT 5.1 and Opus 4.5), mostly Plan and Agent modes&lt;/li&gt;
&lt;li&gt;Added 6 Cursor rules acting as Guardrails and Guidelines for agents&lt;/li&gt;
&lt;li&gt;26 Unit/integration tests&lt;/li&gt;
&lt;li&gt;Focus on Docs:

&lt;ul&gt;
&lt;li&gt;I didn't save the MDs produced by plan mode&lt;/li&gt;
&lt;li&gt;Yet I asked to follow a simple discipline adding important tech decisions to the &lt;code&gt;docs/&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;Had a separate Cursor rule for docs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Set-up and validated GH MCP is working, agents can autonomously build CI/CD&lt;/li&gt;

&lt;li&gt;Working CI/CD with GitHub Actions - build/test on commit, release by request&lt;/li&gt;

&lt;li&gt;Saturday evening and Sunday (~ 8h effort)&lt;/li&gt;

&lt;li&gt;Spent ~$50 in tokens&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gemini-3-pro-preview&lt;/td&gt;
&lt;td&gt;42757369&lt;/td&gt;
&lt;td&gt;$32,02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.1-high&lt;/td&gt;
&lt;td&gt;9721834&lt;/td&gt;
&lt;td&gt;$5,79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-4.5-opus-high-thinking&lt;/td&gt;
&lt;td&gt;9065436&lt;/td&gt;
&lt;td&gt;$8,66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.1-codex-high&lt;/td&gt;
&lt;td&gt;276380&lt;/td&gt;
&lt;td&gt;$0,20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;composer-1&lt;/td&gt;
&lt;td&gt;10999&lt;/td&gt;
&lt;td&gt;$0,01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grand Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;61832018&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$46,68&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Why even do this app in the first place? Well, I've been driving an "analog" VW Polo for a week while my EV was in a workshop. I had a serious withdrawal during this time missing plenty of "conveniences" my &lt;a href="https://github.com/maxim-saplin/zeekr_apk_mod" rel="noopener noreferrer"&gt;Zeekr&lt;/a&gt; has provided: watching/listening-in on YouTube videos, highway autopilot allowed me to doom-scroll, 15 inch OLED infotainment screen always loaded with info (nav, videos). &lt;/p&gt;

&lt;p&gt;During the 2nd week of digital withdrawal I felt a sudden relief.. That was a 90-s vibe, a nice song coming through car audio, pixelated LCD screen showing the name of a popular at the time artist, no urges to pick up the phone and scroll while staying at the traffic lights. That reminded me of a &lt;a href="https://www.youtube.com/watch?v=orQKfIXMiA8" rel="noopener noreferrer"&gt;video&lt;/a&gt; touching on the subject how gadgets and constant connectivity steals from our lives... Why not create a simple app that darkens the infotainment in my EV?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo88mjsj47l9wxrbxz0dv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo88mjsj47l9wxrbxz0dv.png" alt="VW Polo Skin in app" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  SDD Sidenote
&lt;/h2&gt;

&lt;p&gt;After my Kiro experiment in September I moved on to scaling SDD approach to actual production work.&lt;/p&gt;

&lt;p&gt;First, I tried GitHub SpecKit with my Cursor enterprise subscription (I couldn't use Kiro free tier with commercial code base) - and I didn't like what I saw. After Kiro it felt bloated, too many artifacts loaded with text, extra steps etc.&lt;/p&gt;

&lt;p&gt;Turned out, there were Kiro prompts circling around the internet. By tweaking those ones a bit and putting into the right place I've recreated Kiro experience in Cursor - check out &lt;a href="https://gist.github.com/maxim-saplin/49d0f490bf82dfedc26e452bf462c206" rel="noopener noreferrer"&gt;this gist&lt;/a&gt; for details.&lt;/p&gt;

&lt;p&gt;Over that week I successfully shipped 4 features in Python/Dart codebase - merged and rolled to prod. All of that while multi-tasking and occasionally switching to check the results OR untangle roadblocks. I had mixed feelings, losing grip of the codebase, being lost in a flux.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dztcm9acdvnrolj7crz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dztcm9acdvnrolj7crz.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some of the lessons learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Feasibility Checks are Mandatory:&lt;/strong&gt; Models often propose impossible or broken solutions (e.g., bad data flows, unworkable stacks). Always verify feasibility before implementation to avoid wasting days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggressively Prune "Bloat":&lt;/strong&gt; AI tends to over-engineer (excessive env vars, extra containers, verbose docs). Reducing scope before code generation saves massive cleanup time later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read Specs:&lt;/strong&gt; Bugs caught during spec review are far cheaper than bugs caught in implementation. Poor doc review compounds AI-generated issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Shallow" Trap:&lt;/strong&gt; AI allows you to avoid deep diving into tech, but this backfires during debugging. You are often faster if you understand requirements and the underlying tech/codebase rather than blindly trusting the agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid "Time Sinks":&lt;/strong&gt; Be ruthless about abandoning low-value features (e.g., "Geo in Analytics," complex filters) that the AI suggests but struggles to implement cleanly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  This time I felt in Control
&lt;/h2&gt;

&lt;p&gt;In October Cursor team has introduced their response to ever growing demand for "think before you do" approaches - &lt;a href="https://www.linkedin.com/posts/maxim-saplin_cursor-team-has-added-native-support-of-a-activity-7381909791939534848-OIVL?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAY52t4BLtN4gJKk-YVpWKb4ZkU3sVysR8w" rel="noopener noreferrer"&gt;Plan Mode&lt;/a&gt;. Since then I mostly used this mode rarely reverting to SDD. And I never kept the produced Plans/Designs (unlike the specs produced by SDD). I saw Plan Mode more of more structured approach allowing to spend tokens and have an "alignment" ceremony with an agent on a "transaction" - something barely small, a task, a deliverable... Part of this transaction could be a doc put into a dedicated place, to keep traces of important decisions and be used later.&lt;/p&gt;

&lt;p&gt;While working on &lt;code&gt;nothingness&lt;/code&gt; it felt natural to plan the implementation, argue certain decisions, decide on document creation rule, document, add Cursor rule to create rule, create rule, design testing framework, expand test coverage... The experience was quite different - I felt complete control and confidence what I do. Even if there were any bugs or deficiencies I had no doubts those will be easily fixable.&lt;/p&gt;

&lt;p&gt;One could say I vibe coded an app over the weekend - I would argue I exercised a disciplined approach and produced maintainable code that can be built upon. And indeed over the next day I did quite a lot of refactoring and added multiple features.&lt;/p&gt;

&lt;p&gt;The "Plan Mode" wasn't just about generating a to-do list; it acted as an &lt;strong&gt;alignment ceremony&lt;/strong&gt;. It was a deliberate pause—spending tokens to "think" and clarify intent before rushing into implementation. In the same dialog I could switch between Plan and Agent modes multiple times, periodically compacting the conversation via &lt;code&gt;/summarise&lt;/code&gt; command. When the thread was done - feature delivered, task done - I could nudge the agent to check test coverage (sometimes new tests were added) or if a doc is worth adding.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about the Structured Approach?
&lt;/h2&gt;

&lt;p&gt;While most of the work flowed naturally and I did not struggle with heavy ceremonies (think BMad or SpecKit), there was software engineering common sense, paying attention to structure (of solution and work execution):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;First prompt was a feasibility check of what felt like the most unclear/challenging part:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllfdt22laryj3uetwqzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllfdt22laryj3uetwqzx.png" alt="Feasibility check" width="800" height="306"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After some discussion with the agent I outlined the requirements and worked on the plan proposed by the agent:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejixgclb57fikypsxw3k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejixgclb57fikypsxw3k.png" alt="Requirements planning" width="800" height="887"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In further dialogs I asked the model to define documentation discipline and when I decided it was worth making a pause and leaving traces of the docs I prompted the model to make a detour. Those docs were later used by agents when ramping up new features.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When the minimal version of the app was running together with the agent we agreed on a general approach to testing, documented it, added the initial coverage and later on added and modified test harness in accordance with the testing discipline which emerged early on. Again, a best practice that protects against regressions and also is a strong signal to AI agent in terms of how good or bad it does with newer features.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;While reviewing the produced code I had several occasions challenging the solution breakdown (i.e. why downstream code must be aware of upstream scaling details) - that led to a few refactors, test updates and new docs being created.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For CI/CD it was a deliberate step validating how MCP tooling works and if the agent can engage. While doing so a number of Cursor rules popped up explaining peculiarities when interacting with GitHub Actions and sandboxed CLI execution when dealing with Flutter commands.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The MCP Stutter
&lt;/h2&gt;

&lt;p&gt;I decided to let the agent autonomously set-up GitHub Actions CI/CD. In order to do this I needed GitHub MCP server working properly. This led to a few hours of "setup tax" that are worth mentioning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Auth Trap:&lt;/strong&gt; My Personal Access Token expired, wasted time browsing and configuring. Classic.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The Tool Bundle Limits:&lt;/strong&gt; GitHub's MCP server recently changed how it bundles tools. The default configuration exposed a limited set of tools (about 20), missing the critical Actions-related tools I needed. The agent initially couldn't "see" the CI/CD failures because it literally didn't have the tools in its context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validating MCP tooling:&lt;/strong&gt; I explicitly probed agent for MCP connectivity, that helped a lot, yet didn't solve the feedback loop completely (see next point).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo507u0gsm7lllnccst1e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo507u0gsm7lllnccst1e.png" alt=" " width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Troubleshooting YAML workflows:&lt;/strong&gt; Recently I've been noticing that LLMs struggle with YAML formatting. For an hour an agent struggled to get CI/CD running due to YAML file syntax error - it pushed the broken file, checked CI/CD job status on the server, saw it errored and then proceed to check job log - which was empty. Turns out, in case of workflow syntax the error should be checked in a dedicated 'annotation' file of workflow run tool call - this &lt;a href="https://github.com/maxim-saplin/nothingness/blob/main/.cursor/rules/github-actions-polling.mdc" rel="noopener noreferrer"&gt;rule&lt;/a&gt; handles GH Actions feedback loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I fixed the tool configuration, the payoff was massive - green and easily maintainable CI/CD pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tips and Tricks
&lt;/h2&gt;

&lt;p&gt;If you want to replicate this "Plan Mode" flow, here are the non-obvious lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Treat Plans as Disposable:&lt;/strong&gt; Unlike Kiro or strict SDD, I didn't treat the generated "Plan" as a sacred artifact to be committed to the repo. It's a transient thought process. The &lt;em&gt;result&lt;/em&gt; of the plan (code, specific docs) is what matters.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Know the Task:&lt;/strong&gt; as long as you are confident in what you are building, quite often we don't realise what we're building (feasibility, consistency, why?)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Choose Familiar Tech Stack:&lt;/strong&gt; it's easier to spot issues by skimming through generated code and docs&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Rules as Guardrails:&lt;/strong&gt; I added 6 specific Cursor rules (&lt;code&gt;.cursor/rules&lt;/code&gt;). One was specifically for documentation: "If you change logic, you must update the &lt;code&gt;docs/&lt;/code&gt; folder." This forced the agent to maintain a "Technical Decisions" log alongside the code, which saved me from the "black box" problem later.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Use &lt;code&gt;/summarize&lt;/code&gt; Ruthlessly:&lt;/strong&gt; Long context windows are great, but models get "dumb" and expensive as the chat grows (especially past 20-30k tokens). I frequently used the &lt;code&gt;/summarize&lt;/code&gt; command to compress the history. It keeps the agent sharp and the costs down.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Weekend Models:&lt;/strong&gt; Anecdotally, &lt;code&gt;gemini-3-pro-preview&lt;/code&gt; performed significantly better on Saturday/Sunday than during the week. Perhaps less traffic?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Model and Harness Progress
&lt;/h2&gt;

&lt;p&gt;I attribute my satisfaction with the results to significant progress the models have made over the past 3 months - more reliable in agentic settings (multi-turn dialogs with extensive tool use), it feels like the recent GPT 5+, Claude 4.5 and Gemini 3 are models that can be relied upon producing more substantial code and docs - no more shallow verbosity or pointless unit tests.&lt;/p&gt;

&lt;p&gt;Same goes about tooling, AI IDE assistants like Cursor do great in terms of context engineering and providing models with efficient tools and environments feeding relevant info and establishing effective feedback loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimer: When to Use What
&lt;/h2&gt;

&lt;p&gt;This experiment convinced me that for greenfield projects, prototypes, or "Solopreneur" work, this &lt;strong&gt;Plan Mode + Guardrails&lt;/strong&gt; approach is superior to heavy SDD. It's agile, keeps you in the driver's seat, and maintains momentum.&lt;/p&gt;

&lt;p&gt;However, &lt;strong&gt;SDD still has its place.&lt;/strong&gt; If I were tackling a massive legacy enterprise codebase, or working in a large team where "hidden knowledge" is the enemy, I would likely revert to a stricter Spec-Driven approach (like SpecKit or custom workflows). There, the overhead of generating strict artifacts pays off in alignment and safety.&lt;/p&gt;

&lt;p&gt;But for building a bespoke infotainment system for my car in a single weekend? &lt;strong&gt;AI coding with discipline&lt;/strong&gt; is the future.&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>ai</category>
      <category>flutter</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Dev: Testing Kiro</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Mon, 25 Aug 2025 12:08:10 +0000</pubDate>
      <link>https://forem.com/maximsaplin/ai-dev-testing-kiro-3b5j</link>
      <guid>https://forem.com/maximsaplin/ai-dev-testing-kiro-3b5j</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; is a yet another VSCode fork (just like Cursor or Windsurf) that integrates AI coding features. What caught my attention was the "spec-driven development" &amp;gt; it makes total sense proposing a structured approach to dev (as opposed to "vibe coding"). I got my invitation and over the weekend tested Kiro. I decided to re-create a command line &lt;a href="https://github.com/maxim-saplin/NetCoreStorageSpeedTest" rel="noopener noreferrer"&gt;cross-platform disk performance benchmark&lt;/a&gt; that was built in 2018 using .NET. This time I picked Rust and used AI. My expectations were low, yet I was impressed in a good way, I (or was it Kiro) did build a working app with solid test coverage! At times Kiro was left alone working for extended periods of time following the plan... And it maintained coherence - that impressed me the most. The result is not perfect, there're some things that don't work (i.e. CI/CD is broken and God knows how much time is needed to recover it), nevertheless part of blame is on me, I could have asked for less and be more attentive to the specs. Over the course of my experiment I have extensively &lt;a href="https://github.com/maxim-saplin/cpdt2/blob/main/NOTES.md" rel="noopener noreferrer"&gt;documented the process&lt;/a&gt;. These notes were used to create the below blog post using Grok 4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update, Aug 27:&lt;/strong&gt; After spending few more days with the app Kiro produced I am less enthusiastic. Kiro still falls for the shortcomings of other AI tools that eagerly produce code and complete the prompt "no matter what" &amp;gt; I poked cpdt2 codebase, using Cursor and Kiro and trying to recover CI/CD making it work, trying to get the app compile and run in Linux (under Dev Containers) - and non of the attempts succeeded under reasonable time. A classic AI SDLC dilemma, getting the result fast, wasting loads of time fixing and making it working. I think Kiro is a powerful tool (staying coherent while working on multiple tasks) yet when left unattended it can easily bloat your solution with loads of scope you, as a human, wouldn't be able to process. Is it the problem of the tool or of a human using it? Part of issue is on me, could have been more thorough and critical when sketching the specs. Anyways, below is a sample of me trying to make the integration tests running fast, launching a "spec &amp;gt; design &amp;gt; task" and eventually discovering that I went the wrong/non-feasible route wasting couple of hours. Btw, in a separate chat Kiro happily acknowledged the issue (and btw, whatever it proposed in this chat was also not feasible):&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznjsbflxohp2kz77o7an.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznjsbflxohp2kz77o7an.png" alt=" " width="800" height="1002"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hey folks, it's Maxim here—back with another dive into the wild world of AI-assisted coding. If you've read my &lt;a href="https://dev.to/maximsaplin/continuedev-the-swiss-army-knife-that-sometimes-fails-to-cut-4gg3"&gt;piece on Continue.dev&lt;/a&gt;, you know I'm all about testing these tools in the trenches, warts and all. This time, I spent a lazy Sunday (well, "lazy" if you ignore the occasional CoD: Modern Warfare 3 breaks) experimenting with Kiro, a new AI-native IDE that promises "Spec-Driven Development." Spoiler: It turned a vague prompt into a fully functional cross-platform Rust app, but not without some hilarious detours and existential questions about my role as a developer.&lt;/p&gt;

&lt;p&gt;Back in 2018, I built &lt;a href="https://github.com/maxim-saplin/CrossPlatformDiskTest" rel="noopener noreferrer"&gt;CrossPlatformDiskTest (CPDT)&lt;/a&gt;, a .NET-based storage speed tester that racked up 500k downloads on Android. It measured sequential/random reads/writes, memory copies, and more—nothing fancy, but it scratched an itch for benchmarking drives across platforms. This GUI app is in turn based on a &lt;a href="https://github.com/maxim-saplin/NetCoreStorageSpeedTest" rel="noopener noreferrer"&gt;Command Line Tool&lt;/a&gt;. Fast-forward to 2025: I decided to recreate the CLI version in Rust (a language I barely remember from a 2021 LinkedIn course) using Kiro. No hands-on coding from me—just prompts, reviews, and AI orchestration. The result? A repo called &lt;a href="https://github.com/maxim-saplin/cpdt2" rel="noopener noreferrer"&gt;cpdt2&lt;/a&gt; with 72 files, 13k lines of code, 246 tests, and even GitHub Actions for CI/CD. But let's break down the journey, because this wasn't just coding—it was coding while AI did the heavy lifting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: From Prompt to Plan
&lt;/h2&gt;

&lt;p&gt;Kiro's big hook is its structured workflow: Spec &amp;gt; Design &amp;gt; Tasks, all in Markdown. It's like forcing yourself to think before you code, which is honestly a breath of fresh air compared to the "prompt-and-pray" chaos of other tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzphskg0p7gx75n8vamru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzphskg0p7gx75n8vamru.png" alt=" " width="800" height="677"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I kicked things off with this prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I want to create a cross-platform disk speed test utility. It must be compilable as a command line tool for macOS, Windows, and Linux. It must have an isolated library/component that runs the speed tests and that I can later integrate with other non-CLI apps (e.g., GUI). The tests must include sequential and random read and write measurements with block sizes of 4MB for sequential and 4KB for random (default can be overridden), it must create a test file in a given device (CLI must provide a list of devices available in the system, for system drives utilize OS facilities to get writable app folder). The app must mitigate the effects of buffered reads and cached writes (by default disabling those). The stats collected must include min, max, and avg speeds. Additionally, the app must implement a 5th test - memory copy.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kiro (powered by Claude 3.7 or 4—I stuck with 4) fleshed it out into requirements, added niceties like MB/s units and progress indicators, and even suggested Android/iOS support when I nudged it. It generated a design doc, broke everything into 23 traceable tasks (e.g., core library setup, platform-specific implementations, CLI args, tests), and queued them up.&lt;/p&gt;

&lt;p&gt;Kiro UI? Clean and intuitive—rounded corners, tabbed chats, and a content pane that feels like a souped-up VS Code. One quirk: Use &lt;code&gt;#&lt;/code&gt; instead of &lt;code&gt;@&lt;/code&gt; for context in chats. I stumbled there once, but overall, it was smooth sailing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Build: AI Takes the Wheel, I Play CoD
&lt;/h2&gt;

&lt;p&gt;With tasks queued, Kiro started chugging away. It handled everything from project setup (Cargo.toml, build.rs) to platform-specific code for Windows, macOS, Linux, Android, and iOS. I "supervised" by reviewing diffs in Cursor (using GPT-5 at high reasoning mode) and occasionally fixing linter warnings or slow tests.&lt;/p&gt;

&lt;p&gt;Highlights (and lowlights):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Early Wins&lt;/strong&gt;: Tasks 1-5 flew by—core config, progress tracking, stats. Kiro even added unit tests when I prompted. A quick Cursor review confirmed it was solid, though I had to install Rust manually after a terminal hiccup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform Shenanigans (Tasks 6-8)&lt;/strong&gt;: Implementing non-buffered I/O across OSes? Kiro nailed it, but linter warnings piled up in unrelated files. I copy-pasted errors into the chat; Kiro fixed most, but it sometimes "hallucinated" checks. Still, better than older LLMs that'd just generate BS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing Drama (Tasks 9-17)&lt;/strong&gt;: The first real run was Task 9. Tests took forever (47 seconds initially) because of oversized files like 2GB dummies. I manually timed them in VS Code's Test Explorer and prompted fixes—down to 13 seconds. One test suite hung for 10-20 minutes; Kiro eventually debugged it. I even created Cursor rules for "runtime checks" (build, test, run the app) to double check after Kiro.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Big Queue (Tasks 18-23)&lt;/strong&gt;: I dumped the rest in one go. Kiro took ~1 hour, pausing twice for CLI approvals. It added 120+ tests, code coverage tracking, docs (like TESTING.md), and even GitHub Actions for CI/CD—plus a release script for crates.io. Mind-reading? I was thinking about CI/CD, and poof, there it was.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Meanwhile, I switched tabs to save Urzikstan in CoD MW3. Vibe-coding at its finest: AI builds while I snipe baddies. But cracks appeared—integration tests felt inconsistent, and I had to revert/restart once due to messy file placements (Rust's idiomatic unit tests in-source files tripped me up, given my rusty Rust knowledge).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0t2kzas48qt25p42n7e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0t2kzas48qt25p42n7e.png" alt=" " width="730" height="1172"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I used Cursor and GPT-5 High between the Kiro tasks to review Git diffs - not much value, most of the reveiws where "OK" and the rest of the doc I didn't care to read.&lt;/p&gt;

&lt;p&gt;End result? The app runs! Pick a path, run benchmarks, get stats. It even lists devices and handles caching as specified. But oops—one original req (interactive device selection for system drives) got lost in the shuffle. And 35 linter issues lingered, plus failing GitHub Actions. Fixable, but a reminder that AI isn't perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Stats: Bloat or Brilliance?
&lt;/h2&gt;

&lt;p&gt;Compare cpdt2 to my 2018 .NET version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;cpdt2 (Rust + AI)&lt;/strong&gt;: 72 files, 13k LOC, 1.9k comments, 3.5k blanks. Includes benches, docs, scripts, and heavy testing/CI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2018 CPDT (.NET)&lt;/strong&gt;: 23 files, 1.8k LOC. Leaner, but no automation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI inflated the codebase (thanks to tests and infra), but it works cross-platform without me writing a line. In 2018, that took a week of my life; this was one Sunday.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszzttm85tnb1n9sl4tl9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszzttm85tnb1n9sl4tl9.png" alt=" " width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reflections: Is This the Future of Coding?
&lt;/h2&gt;

&lt;p&gt;Kiro enforces discipline—think before coding—which aligns with prompt engineering best practices. It's not just "prompt &amp;gt; code"; it's a harness for coherent, long-horizon work. The agent stayed on-task for hours, breaking down complexity without losing context.&lt;/p&gt;

&lt;p&gt;But here's the rub: I coded blindly, barely glancing at the code. Am I even a developer anymore? It felt like pushing buttons while AI steered—fun, but I lost touch with the codebase. Maintainability? No clue. And without my prior CPDT knowledge, I'd be lost prompting effectively. Non-tech folks? Forget it; this still needs domain expertise.&lt;/p&gt;

&lt;p&gt;Side thoughts: Are high-level languages becoming assembly? I don't grok Rust tooling, but do I need to? AI rejection of dumb asks (e.g., fixing non-existent code) is a win over older models. Yet, running in a container from the start would've avoided potential disk litter from test files.&lt;/p&gt;

&lt;p&gt;Overall, Kiro's a promising tool—like a Swiss Army knife that mostly cuts, but occasionally needs sharpening. It turned my experiment into a working app, honed my "AI orchestration" skills, and left me pondering: If AI builds while I game, what's left for humans? Dive in, try it, and let me know your thoughts in the comments!&lt;/p&gt;

&lt;p&gt;If you're curious, check out &lt;a href="https://github.com/maxim-saplin/cpdt2" rel="noopener noreferrer"&gt;cpdt2 on GitHub&lt;/a&gt;. And yes, I'll fix those linter warnings... eventually.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>genai</category>
      <category>llm</category>
    </item>
    <item>
      <title>LLMs are Bad at Math</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Fri, 13 Jun 2025 06:20:11 +0000</pubDate>
      <link>https://forem.com/maximsaplin/llms-are-bad-at-math-5h4d</link>
      <guid>https://forem.com/maximsaplin/llms-are-bad-at-math-5h4d</guid>
      <description>&lt;p&gt;LLMs are known to struggle with math. Not in those PhD level tasks from AIME eval, where the &lt;a href="https://openai.com/index/learning-to-reason-with-llms/" rel="noopener noreferrer"&gt;reasoning models compete and shine&lt;/a&gt;... But rather in the everyday math we deal with - additions, multiplications, etc.&lt;/p&gt;

&lt;p&gt;Take for example Grok 3's DeepSearch where I prompted it to "... list countries by their GDP per capita in Japanese Yen". As you can see in the screenshot below, the agent did it most reasonably - found a readily available GDP per capita table from IMF, came up with a USD to JP¥ conversion rate, and created a summary table with IMF data converted using the exchange rate.&lt;/p&gt;

&lt;p&gt;In its explanation of the approach "... each USD value was multiplied by 146 to get JPY. For example, Luxembourg’s 140,941 USD became 20,577,186 JPY (140,941 × 146)" Grok 3 makes a calculation mistake. My non-AI native calculator gives me 21,891,386 as the result of 140,941 × 146 multiplication. All the cells in the following table were also wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9h7o1gepuo586sd6ta1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9h7o1gepuo586sd6ta1.png" alt=" " width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I went further by testing Grok in 3 different modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No thinking + Web Search&lt;/li&gt;
&lt;li&gt;Thinking + Web Search&lt;/li&gt;
&lt;li&gt;DeepSearch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwrjmc1nm8qqc8b8a0wd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwrjmc1nm8qqc8b8a0wd.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For each of the modes, the approach by Grok was the same: finding source data in USD, pegging to a certain exchange rate, doing the calculation, and outputting the resulting table. If we put aside the questions of why in all 3 cases the exchange rate was different, why pick a certain list of countries (and never use the full list of countries and territories)... I tested how one of the best SOTA models (Grok-3 ?Mini) faired with converting USD to JPY:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No thinking + Web Search: 32 countries, 3 wrong calculations&lt;/li&gt;
&lt;li&gt;Thinking + Web Search: 13 countries, all correct&lt;/li&gt;
&lt;li&gt;DeepSearch: 11 countries, 11 wrong (deviating at ~0.5%  from true values)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1bh1s07oxa6yoeih5qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1bh1s07oxa6yoeih5qr.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The complete calculation verification is available in this &lt;a href="https://docs.google.com/spreadsheets/d/1GKd_elYoa4OpCASUIlYtuBwmhgOdH926/edit?usp=sharing&amp;amp;ouid=107546815842839456165&amp;amp;rtpof=true&amp;amp;sd=true" rel="noopener noreferrer"&gt;spreadsheet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The example demonstrates a very common pitfall in LLM use. Any prompt and any context dealing with numbers may require the model to do the basic math. Likely it will not resort to using a tool call (i.e. asking a Python interpreter to run calculations) hence the numbers produced by LLM are not trustworthy. And I rarely see that prompts with numbers are followed by a tool call for calculus, models readily return completions with calculations done.&lt;/p&gt;

&lt;p&gt;Say you have Office 365 Copilot, Claude, ChatGPT, or any other chatbot doing errands for you. You ask it to look into an invoice and highlight value-for-money outliers. Or you are working on a quote and ask the chatbot to prepare a report. Or as a PM you use the AI assistant to look into sprint stats and evaluate velocity. There are numerous cases requiring basic number crunching. And if your life depends on the accuracy of those numbers I wouldn't trust any digit in the result. No matter what LLM product you use, Perplexity, Glean, Deep Research, Copilot, Gemini - all are based on LLMs that are bad at math.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;But how bad are LLMs at this sort of math? Assume you have the correct input (though it is rarely the case, models can easily hallucinate at any step, e.g. while &lt;a href="https://www.linkedin.com/posts/maxim-saplin_these-days-tools-like-perplexity-glean-activity-7311434030128726016-uYFC?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAY52t4BLtN4gJKk-YVpWKb4ZkU3sVysR8w" rel="noopener noreferrer"&gt;processing a table in a picture&lt;/a&gt;). What are the chances LLM will get the math right?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I've created a benchmark testing just that: &lt;a href="https://github.com/maxim-saplin/llm_arithmetic" rel="noopener noreferrer"&gt;llm_arithmetic&lt;/a&gt;. It prompts a model multiple times to do additions, subtractions, multiplications, and divisions of random numbers - and registers the accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Model                                      ┃ Trials ┃ Correct % ┃  NaN % ┃  Dev % ┃ Comp. Tok. ┃       Cost ┃      Avg Error ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ o4-mini-2025-04-16-medium                  │    480 │    97.08% │  0.00% │  2.92% │ 1110603.00 │  $4.903872 │         0.002% │
│ o4-mini-2025-04-16-medium-4k               │    480 │    93.54% │  0.00% │  6.46% │ 1083780.00 │  $6.741561 │         0.001% │
│ o4-mini-2025-04-16-low                     │    480 │    88.96% │  0.00% │ 11.04% │  575871.00 │  $2.551050 │         0.959% │
│ deepseek-r1                                │    480 │    84.17% │  0.21% │ 15.62% │ 1462524.00 │  $3.210413 │      2669.789% │
│ claude-sonnet-4-20250514-thinking16000     │    480 │    76.04% │  0.00% │ 23.96% │ 1332908.00 │ $20.085939 │      1740.396% │
│ o3-mini-2025-01-31-medium                  │    480 │    75.21% │  0.00% │ 24.79% │  945716.00 │  $4.178371 │         2.287% │
│ grok-3-mini-beta-high                      │    480 │    71.88% │  1.25% │ 26.88% │    2702.00 │  $0.006156 │       827.580% │
│ deepseek-r1-4k                             │    480 │    70.00% │  0.00% │ 30.00% │  620371.00 │  $0.000000 │       712.913% │
│ qwen3-32b@cerebras-thinking                │    480 │    69.58% │  5.62% │ 24.79% │ 2767460.00 │  $0.000000 │ 840317057.169% │
│ qwen3-14b@q8_0-ctx4k-thinking              │    480 │    66.25% │  0.21% │ 33.54% │ 2338564.00 │  $0.000000 │      9492.622% │
│ o1-mini-2024-09-12                         │    480 │    66.04% │  0.00% │ 33.96% │  572960.00 │  $7.617905 │      6825.446% │
│ claude-opus-4-20250514-thinking16000       │    480 │    65.83% │  0.00% │ 34.17% │  396158.00 │  $0.000000 │      1831.015% │
│ qwen3-14b@iq4_xs-ctx32k-thinking           │    480 │    65.83% │  0.83% │ 33.33% │ 2552276.00 │  $0.000000 │      8152.815% │
│ qwen3-32b@iq4_xs-ctx16k-thinking           │    480 │    65.62% │  3.75% │ 30.63% │ 3499454.00 │  $0.000000 │      5227.605% │
│ o3-mini-2025-01-31-low                     │    480 │    65.21% │  0.00% │ 34.79% │  284738.00 │  $1.270064 │         5.435% │
│ qwen3-14b@iq4_xs-ctx4k-thinking            │    480 │    65.00% │  0.42% │ 34.58% │ 2245910.00 │  $0.000000 │  72213401.589% │
│ qwen3-14b@q4_k_m-ctx4k-thinking            │    480 │    64.79% │  0.00% │ 35.21% │ 2334475.00 │  $0.000000 │      3769.350% │
│ claude-sonnet-3.7-20250219-thinking4096    │    480 │    57.08% │ 18.96% │ 23.96% │ 1214269.00 │ $18.306354 │       889.557% │
│ gemini-2.5-pro-preview-03-25               │    480 │    55.83% │  0.00% │ 44.17% │    5517.00 │  $0.078019 │        20.602% │
│ qwen3-14b@iq4_xs-ctx32k-thinking-4k        │    480 │    55.21% │  0.21% │ 44.58% │  710967.00 │  $0.000000 │       988.474% │
│ claude-sonnet-3.7-20250219-4k              │    480 │    52.50% │  0.00% │ 47.50% │    4213.00 │  $0.000000 │      2217.925% │
│ xai/grok-3-mini-beta                       │    480 │    51.46% │  0.00% │ 48.54% │    2511.00 │  $0.006060 │       913.579% │
│ claude-sonnet-3.7-20250219                 │    480 │    51.04% │  0.00% │ 48.96% │    4147.00 │  $0.114204 │      1302.437% │
│ claude-opus-4-20250514                     │    480 │    50.42% │  0.00% │ 49.58% │    4169.00 │  $0.572685 │      5037.315% │
│ gemini-2.5-flash-preview-04-17-thinking    │    480 │    50.42% │  0.21% │ 49.38% │  521284.00 │  $0.315585 │        27.894% │
│ claude-sonnet-4-20250514                   │    480 │    50.00% │  0.00% │ 50.00% │    4125.00 │  $0.113868 │        20.410% │
│ gemini-2.5-flash-preview-04-17-thinking    │    480 │    49.79% │  0.21% │ 50.00% │  310022.00 │  $1.087891 │       481.693% │
│ claude-3.5-haiku                           │    480 │    49.58% │  0.00% │ 50.42% │    3987.00 │  $0.029816 │      3351.666% │
│ gpt-4.5-preview-2025-02-27                 │    480 │    49.58% │  0.00% │ 50.42% │    2647.00 │  $1.607175 │        24.709% │
│ gpt-4.1-2025-04-14-4k                      │    480 │    48.54% │  0.00% │ 51.46% │    2688.00 │  $5.163010 │        25.919% │
│ gemini-2.5-flash-preview-04-17-no-thinking │    480 │    48.54% │  0.00% │ 51.46% │    5238.00 │  $0.005956 │        30.566% │
│ gpt-4.1-2025-04-14                         │    480 │    48.12% │  0.00% │ 51.88% │    2729.00 │  $0.068629 │      7284.099% │
│ qwen3-32b@cerebras                         │    480 │    46.46% │  0.00% │ 53.54% │    7457.00 │  $0.000000 │        63.979% │
│ qwen3-32b@iq4_xs-ctx16k                    │    480 │    46.04% │  1.04% │ 52.92% │    7132.00 │  $0.000000 │        63.271% │
│ qwen3-14b@iq4_xs-ctx32k                    │    480 │    45.21% │  1.67% │ 53.12% │    7533.00 │  $0.000000 │ 392239118.901% │
│ gpt-4-0613                                 │    480 │    41.04% │  0.00% │ 58.96% │    2450.00 │  $0.631020 │    362466.402% │
│ gpt-4.1-nano-2025-04-14                    │    480 │    38.54% │  0.42% │ 61.04% │    2841.00 │  $0.002749 │    686001.894% │
│ gpt-35-turbo-0125                          │    480 │    35.62% │  0.62% │ 63.75% │    2438.00 │  $0.011725 │        43.177% │
│ gpt-35-turbo-1106                          │    480 │    33.96% │  0.21% │ 65.83% │    2560.00 │  $0.011907 │       409.261% │
│ gpt-4o-mini-2024-07-18                     │    480 │    32.29% │  0.00% │ 67.71% │    2862.00 │  $0.004137 │        64.570% │
│ claude-2.1                                 │    480 │    13.33% │  0.00% │ 86.67% │    2661.00 │  $0.000000 │       174.584% │
│ deepseek-r1-distill-qwen-14b@iq4_xs        │    480 │    10.21% │ 70.21% │ 19.58% │ 1113604.00 │  $0.000000 │       163.793% │
└────────────────────────────────────────────┴────────┴───────────┴────────┴────────┴────────────┴────────────┴────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My observations based on testing a range of models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In general, models are fine with small numbers (2-3 digits)&lt;/li&gt;
&lt;li&gt;Performance is worse with multiplication and the worst with division&lt;/li&gt;
&lt;li&gt;There's a huge gap in performance between models&lt;/li&gt;
&lt;li&gt;o3/o4 models are surprisingly good, I'd trust it with number crunching tasks where accuracy under 1 percent is tolerable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4f8ne7ghotkyn0clp1f3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4f8ne7ghotkyn0clp1f3.png" alt="LLM Arithmetic Accuracy Heatmap" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>math</category>
    </item>
    <item>
      <title>Grok 3 API - Reasoning Tokens are Counted Differently</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Thu, 15 May 2025 16:24:21 +0000</pubDate>
      <link>https://forem.com/maximsaplin/grok-3-api-reasoning-tokens-are-counted-differently-197</link>
      <guid>https://forem.com/maximsaplin/grok-3-api-reasoning-tokens-are-counted-differently-197</guid>
      <description>&lt;p&gt;I've learned it the hard way... If you use the recently released Grok-3 Mini reasoning model (which is &lt;a href="https://maxim-saplin.github.io/llm_chess/" rel="noopener noreferrer"&gt;great&lt;/a&gt; by the way) you might have your token usage reported wrong...&lt;/p&gt;

&lt;h2&gt;
  
  
  TLDR;
&lt;/h2&gt;

&lt;p&gt;While both OpenAI and xAI report reasoning usage in &lt;code&gt;usage.completion_tokens_details.reasoning_tokens&lt;/code&gt; field:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI includes reasoning tokens in &lt;code&gt;usage.completion_tokens&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;xAI doesn't include&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hence for OpenAI (and according to my tests for Deepseek R1) in order to get the total tokens you can use the old good &lt;code&gt;completion_tokens&lt;/code&gt; field. With xAI you need to add up the 2 values to get the right totals (and get you cost estimations correct).&lt;/p&gt;

&lt;p&gt;Neither &lt;code&gt;litellm&lt;/code&gt; nor &lt;code&gt;AG2&lt;/code&gt; (out of recently used LLM libs) adjust the reported usage for that Grok's quirk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not fully OpenAI Chat Completions API Compliant
&lt;/h2&gt;

&lt;p&gt;Grok API provides a compatible OpenAI endpoint. For reasoning models the didn't invent the wheel and use the standard &lt;a href="https://docs.x.ai/docs/guides/reasoning#control-how-hard-the-model-thinks" rel="noopener noreferrer"&gt;&lt;code&gt;reasoning_effort&lt;/code&gt; parameter&lt;/a&gt; just like &lt;a href="https://platform.openai.com/docs/guides/reasoning?api-mode=chat" rel="noopener noreferrer"&gt;OpenAI does&lt;/a&gt; with its' o1/o3/o4 models. Yet for some reasons xAI decided to deviate from OpenAI's approach to reasoning tokens accounting.&lt;/p&gt;

&lt;p&gt;That's unfortunate this inconsistency got into prod API from xAI. &lt;/p&gt;

</description>
      <category>llm</category>
      <category>chatgpt</category>
      <category>api</category>
      <category>programming</category>
    </item>
    <item>
      <title>XYZ% of Code is Now Written by AI... Who Cares?</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Thu, 01 May 2025 18:30:30 +0000</pubDate>
      <link>https://forem.com/maximsaplin/xyz-of-code-is-now-written-by-ai-who-cares-5o9</link>
      <guid>https://forem.com/maximsaplin/xyz-of-code-is-now-written-by-ai-who-cares-5o9</guid>
      <description>&lt;ul&gt;
&lt;li&gt;Microsoft CEO Satya Nadella &lt;a href="https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-as-30percent-of-microsoft-code-is-written-by-ai.html" rel="noopener noreferrer"&gt;said&lt;/a&gt; that &lt;em&gt;"as much as 30% of the company’s code is now written by artificial intelligence"&lt;/em&gt; (Apr 2025).&lt;/li&gt;
&lt;li&gt;Anthropic's CEO &lt;a href="https://www.businessinsider.com/anthropic-ceo-ai-90-percent-code-3-to-6-months-2025-3" rel="noopener noreferrer"&gt;made a forecast&lt;/a&gt; that &lt;em&gt;"in 12 months, we may be in a world where AI is writing essentially all of the code,"&lt;/em&gt; (Mar 2025).&lt;/li&gt;
&lt;li&gt;Google CEO &lt;a href="https://blog.google/inside-google/message-ceo/alphabet-earnings-q3-2024/#search" rel="noopener noreferrer"&gt;stated&lt;/a&gt; that &lt;em&gt;"more than a quarter of code they've been adding was AI-generated"&lt;/em&gt; (Oct 2024).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I see this sort of title I often have a sense the XYZ figure has the connotation of &lt;strong&gt;software engineers replacement rate&lt;/strong&gt;. Code written by AI is the code not written by humans, we don't need those 30% of humans to type on their keyboards. With the media's focus on sensationalism and competition for the reader's attention, I don't see why they wouldn't optimize for more drama...&lt;/p&gt;

&lt;p&gt;While this sort of speculation is curious (&lt;em&gt;how can those CEOs get measurements of the metric beyond making guesstimates based on some clues/heuristics?&lt;/em&gt;), I don't see much meaning beyond merely evaluating the rates of adoption of AI coding tools...&lt;/p&gt;

&lt;h2&gt;
  
  
  100% of Code is Generated, 70% of Code is Deleted After Review
&lt;/h2&gt;

&lt;p&gt;Let me give you a recent example. I have worked on a &lt;a href="https://github.com/maxim-saplin/mcp_safe_local_python_executor" rel="noopener noreferrer"&gt;small project&lt;/a&gt; creating a local python interpreter wrapped as MCP Tool, think of Code Interpreter for ChatGPT.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Why even bother, aren't there Python tools already? There are ones, yet it's either Python execution in local env which is dangerous &lt;strong&gt;OR&lt;/strong&gt; relying on Docker or remote environments that need some effort to set up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The idea was to wrap into an MCP Server the custom-made, sandboxed, local &lt;a href="https://github.com/huggingface/smolagents/blob/main/src/smolagents/local_python_executor.py" rel="noopener noreferrer"&gt;Python interpreter&lt;/a&gt; provided with HuggingFace's &lt;a href="https://huggingface.co/docs/smolagents/en/index" rel="noopener noreferrer"&gt;smolagents&lt;/a&gt; library.&lt;/p&gt;

&lt;p&gt;After cloning smolagents' repo, investigating the codebase, and creating a small example of isolated use of the interpreter I've instructed Cursor's Agent to create a new MCP Server project. I showed it the example, and the interpreter code, and gave a link to MCP Server docs by Anthropic. The agent created a complete linter-warnings-free code base.&lt;/p&gt;

&lt;p&gt;Yet in the next couple of hours, I have iterated on the produced code. I've removed most of the files and lines. Used AI actively, both the autocompletion and chat, i.e. typed not much Python by myself.&lt;/p&gt;

&lt;p&gt;Can I state that 100% of code was AI-generated? Probably. Does this imply that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I was not needed in the process of building software (100% replaced by AI).&lt;/li&gt;
&lt;li&gt;Or did I get a 300x productivity boost since as an average human I can type 30 words per minute while SOTA models generate them at ~3000WPM (~150-200 tokens per second)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the &lt;a href="https://github.com/maxim-saplin/mcp_safe_local_python_executor/tree/main/.VSCodeCounter" rel="noopener noreferrer"&gt;stats&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1st version by Claude 3.7/Cursor Agent: 9 files, 1062 lines, 45 comments, 158 blanks&lt;/li&gt;
&lt;li&gt;Final modified and published version: 4 files, 318 lines, 9 comments, 79 blanks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While iterating on the code base I used my brain cycles to make sense of what AI had produced, also gaining a better understanding of what actually needed to be built - and that takes effort and time. Sometimes writing code is easier than reading. Besides writing code (or better say low-level modifications) has a very important function of learning the code base and giving you time for the task to sink in and make sense.&lt;/p&gt;

&lt;p&gt;After all, I dropped ~70% of AI-generated code. Does it tell much? Does it mean AI code is junk if it had to be thrown away? Generating in minutes reworking/debugging in hours? I don't think so. Yet the rework percentage isn't that telling metric alone, just like the percent generated metric.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;One might say that the example is isolated. Creating from scratch some small project is a corner case not met that often in real life. That's true. Yet I think it makes a relevant point and puts some numbers. There's the same tendency to remove/rework a lot of generated code when maintaining a large code base. The larger the scope of the task, the more agentic the workflow, and the more lines/files are touched - the more you have to fix. For some reason, the best AI tools still have a hard time getting the "vibes" of the project - they struggle creating consistent changes that follow the "spirit" of the code base.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Building Software is not About Writing Code
&lt;/h2&gt;

&lt;p&gt;It's about integrating and shipping code. Did you know that at some point Microsoft had a 3-year release cycle of Windows and &lt;em&gt;"on average, a release took about three years from inception to completion but only about six to nine months of that time was spent developing “new” code? The rest of the time was spent in integration, testing, alpha and beta periods"&lt;/em&gt; &lt;a href="https://news.ycombinator.com/item?id=16139236" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://arstechnica.com/gadgets/2018/10/microsofts-problem-isnt-shipping-windows-updates-its-developing-them/" rel="noopener noreferrer"&gt;2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Writing code is just one very important part, yet it is not the only one. Did you know that (according to a recent Microsoft &lt;a href="https://arxiv.org/pdf/2502.15287" rel="noopener noreferrer"&gt;study&lt;/a&gt;) developers spend just 20% of their time coding/refactoring (that's where the XYZ% AI generated metric lands):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3d98j8ffxu3r64jmvh0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3d98j8ffxu3r64jmvh0z.png" alt="The Gap Between Developers’ Ideal vs Actual Workweeks" width="800" height="564"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Working with teams and customers, building software I see many things where AI can barely help.&lt;/p&gt;

&lt;p&gt;What if your stakeholders become unresponsive, play internal politics, and can't make up their minds about the requirements? Will ChatGPT (or some fancy "agent") chase the client, flash out all the contradictions in requirements (&lt;a href="https://www.youtube.com/watch?v=BKorP55Aqvg" rel="noopener noreferrer"&gt;7 green lines, 1 must be transparent&lt;/a&gt;), communicate with the whole team and mitigate any of the &lt;a href="https://www.informit.com/store/waltzing-with-bears-managing-risk-on-software-projects-9780133492057" rel="noopener noreferrer"&gt;core risks&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Even if you have what seems to be refined requirements... How much time will it take for every individual team member to embrace what is the "thing" he or she is trying to achieve? How much time will it take for the team to find the internal consensus on how to organize around the goal, break down the scope, and bridge business requirements to implementation detail? Will Gen-AI tools accelerate the &lt;a href="https://en.wikipedia.org/wiki/Tuckman%27s_stages_of_group_development" rel="noopener noreferrer"&gt;team dynamics&lt;/a&gt; leapfrogging from forming and storing to norming and performance in days, not weeks?&lt;/p&gt;

&lt;p&gt;I see it all the time: people are slow thinkers, there are natural constraints on how much info our brain can process, how many &lt;a href="https://en.wikipedia.org/wiki/Dunbar%27s_number" rel="noopener noreferrer"&gt;social connections&lt;/a&gt; we can build and maintain, etc. Generating lots of texts that few care to read (and fewer try to understand) doesn't solve anything.&lt;/p&gt;

&lt;p&gt;Given the current state and trajectory of AI tools in software development, I see them as isolated productivity tools where human is the bottleneck. There's little progress with AI agents filling all the gaps a human worker does in a daily routine. Even at a higher level of AI autonomy people would still need time to make up their minds, evolve their perspectives, talk, and agree.&lt;/p&gt;

&lt;h2&gt;
  
  
  Productivity
&lt;/h2&gt;

&lt;p&gt;Ultimately businesses seek for more work to be done with less effort/money. Adopt AI in dev teams, and cut costs/headcounts by some magic number (for some reason it's always 20-30%) - that doesn't seem to be working that way. There's no definitive demonstration of step changes in developer productivity across the industry. I like these 2 examples, studies into developer productivity with AI from last Autumn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;👍 Microsoft, Accenture (&lt;a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566" rel="noopener noreferrer"&gt;https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566&lt;/a&gt;) reported a 26% increase in completed tasks.&lt;/li&gt;
&lt;li&gt;👎 Uplevel (&lt;a href="https://lnkd.in/eHnbrWAQ" rel="noopener noreferrer"&gt;https://lnkd.in/eHnbrWAQ&lt;/a&gt;) found no change in cycle time and a 41% increase in bugs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;P.S&amp;gt;&lt;/strong&gt; Did you know that as of April 2025, the popular AI coding assistant Aider has &lt;a href="https://aider.chat/HISTORY.html" rel="noopener noreferrer"&gt;~80% of its code&lt;/a&gt; generated with Aider ;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P.P.S&amp;gt;&lt;/strong&gt; Until the "P.P.S&amp;gt;" started the article was exactly 1200 words, it took me several days to contemplate and 4 hours to write. GPT 4.1 would have required 12 seconds to generate a similar size blog post :)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>coding</category>
      <category>programming</category>
    </item>
    <item>
      <title>GPT 4.1, o3, o4-mini - OpenAI releases through the lens of LLM_Chess</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Mon, 21 Apr 2025 17:21:10 +0000</pubDate>
      <link>https://forem.com/maximsaplin/gpt-41-o3-o4-mini-openai-releases-through-the-lens-of-llmchess-1pcg</link>
      <guid>https://forem.com/maximsaplin/gpt-41-o3-o4-mini-openai-releases-through-the-lens-of-llmchess-1pcg</guid>
      <description>&lt;p&gt;This will be a quick post. I've ran the recent OpenAI models through &lt;a href="https://maxim-saplin.github.io/llm_chess/" rel="noopener noreferrer"&gt;LLM Chess eval&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;o4-mini and o3 demonstrate solid chess performance and instruction following&lt;/li&gt;
&lt;li&gt;GPT 4.1 didn't qualify due to multiple model errors&lt;/li&gt;
&lt;li&gt;4.1 Mini is a good increment over 4o Mini, 4.1 Nano didn't impress&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is a matrix view of models' performance with Y-axis showing chess proficiency and X-axis instruction following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6npvqifeumly2nzv0tpv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6npvqifeumly2nzv0tpv.png" alt="LLM Chess Matrix View" width="800" height="649"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;P.S&amp;gt; The "Notes" section of the &lt;a href="https://maxim-saplin.github.io/llm_chess/" rel="noopener noreferrer"&gt;leaderboard web site&lt;/a&gt; dives deeper into model's performance.&lt;/p&gt;

</description>
      <category>openai</category>
      <category>llm</category>
      <category>ai</category>
      <category>genai</category>
    </item>
    <item>
      <title>Mercury Coder - A Quick Test of Diffusion Language Model</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Fri, 18 Apr 2025 14:34:38 +0000</pubDate>
      <link>https://forem.com/maximsaplin/mercury-coder-a-quick-test-of-diffusion-language-model-12b2</link>
      <guid>https://forem.com/maximsaplin/mercury-coder-a-quick-test-of-diffusion-language-model-12b2</guid>
      <description>&lt;p&gt;I have recently touched on how diffusion/transformer models come into new &lt;a href="https://dev.to/maximsaplin/4o-image-gen-diffusiontransformer-cross-over-trend-4p6k"&gt;domains&lt;/a&gt; - specifically the February news on Larage Language Diffusion models (LLaDA, Mercury).&lt;/p&gt;

&lt;p&gt;Last weekend, I received an invitation from Inception Labs to take part in beta testing their Mercury Coder Small model - one of a few representatives of the breed of dLLMs.&lt;/p&gt;

&lt;p&gt;The model is presented as &lt;strong&gt;(a)&lt;/strong&gt; based on novel non-transformer tech, &lt;strong&gt;(b)&lt;/strong&gt; matching the performance of SOTA SLMs (think OpenAI GPT Mini, Anthropic Haiku, and Google Gemini Flash) and &lt;strong&gt;(c)&lt;/strong&gt; being 5-10x faster.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;With performance I assume how smart the model is and how good the answers it produces are. Speed tells how many tokens per second a model can generate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Speed
&lt;/h2&gt;

&lt;p&gt;The key selling point from the &lt;a href="https://www.inceptionlabs.ai/news" rel="noopener noreferrer"&gt;Mercury introduction post&lt;/a&gt; was the generation speed - a 5-10x increase over similar-sized models. That's what I decided to test first.&lt;/p&gt;

&lt;p&gt;I have used a simple &lt;a href="https://github.com/maxim-saplin/py_chat_ui" rel="noopener noreferrer"&gt;Python UI&lt;/a&gt; that supports OpenAI compatible endpoints and can show tokens per second metric after the response is received:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqink4cyjfuli076m68t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqink4cyjfuli076m68t.png" alt="Py Chat UI Mercury Coder" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I've obtained a stable 370 token/second generation with almost 0 variation. What's also curious is that the model can only work with the temperature set to 0 and it always produced the same answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mercury-coder-small / 1102 tokens
TTF Chunk: 0.85s / TPS: 369.13
TTF Chunk: 0.73s / TPS: 369.86
TTF Chunk: 0.77s / TPS: 368.73
TTF Chunk: 0.77s / TPS: 370.96
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the one hand, it is faster than the 200 tokens we've seen from models like GPT-4o Mini and Gemini 2.0 Flash. On the other hand, it's twice lower than the &lt;a href="https://www.inceptionlabs.ai/news" rel="noopener noreferrer"&gt;advertised&lt;/a&gt; 737 tok/s. I have also tried a simple &lt;code&gt;curl&lt;/code&gt; request with no streaming yet received the same speed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A side note. Mercury provides OpenAI compatible chat completions API (but who doesn't these days..). It turned out it supported &lt;a href="https://platform.inceptionlabs.ai/docs" rel="noopener noreferrer"&gt;streaming&lt;/a&gt; responses. And that was a surprise for me. The key differentiator of dLLM from traditional LLMs is how they produce a large block of text and then gradually change parts in it (the diffusion effect), rather than spill out tokens one by one. I don't see how streaming can be implemented in that case except by completing the full generation in the backend and then simulating the streaming (at a significant slowdown).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For comparison, recently I got my hands on GPT 4.1 Nano. With the same prompt, it produced a reply of similar size (1000 tokens), and the speed fluctuated between 150 and 340 tokens per second:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpt-4.1-nano-2025-04-14 / 1096-1251 tokens
TTF Chunk: 0.31s / TPS: 341.31
TTF Chunk: 0.45s / TPS: 305.91
TTF Chunk: 0.34s / TPS: 268.98
TTF Chunk: 0.36s / TPS: 200.13
TTF Chunk: 0.45s / TPS: 132.89
TTF Chunk: 0.45s / TPS: 150.13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;I didn't use the model as a daily driver (in IDE or as a chatbot) and would not pretend to have a comprehensive answer to how good it is...&lt;/p&gt;

&lt;p&gt;First of all, I ran a few prompts in chat, and it seemed OK, generating code snippets and giving basic answers.&lt;/p&gt;

&lt;p&gt;Then I ran it through my own &lt;a href="https://maxim-saplin.github.io/llm_chess/" rel="noopener noreferrer"&gt;LLM Chess&lt;/a&gt; eval. It took place in the middle-bottom of the list, a typical SLM struggling to adhere to game protocol making on average 10 moves before breaking the game loop: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfe51gsnts2iuz7w7hkb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfe51gsnts2iuz7w7hkb.png" alt="Mercury Coder Small LLM Chess" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;token accounting is wrong, the model is not as verbose&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The major conclusion is that the new tech doesn't solve the old problem of LLM hallucinations. The common failure mode is yielding illegal moves. See in the example dialog traces below how the model gets the list of legal moves and still fails to pick one (making up a combination not on the list):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--------------------------------------------------------------------------------
Proxy (to Player_Black):

g8h8,f8e7,f8d6,f8c5,f8b4,f8a3,e8e7,d8e7,d8f6,d8g5,d8h4,c8b7,b8c6,b8a6,e4f6,e4d6,e4g5,e4c5,e4g3,e4c3,e4f2,e4d2,h7h6,g7g6,f7f6,d7d6,c7c6,a7a6,h7h5,g7g5,f7f5,d7d5,c7c5,a7a5

--------------------------------------------------------------------------------
Player_Black (to Proxy):

Thank you for providing the list of legal moves. After analyzing the position and the available moves, I'll make a move to develop my pieces and control the center of the board. I'll move my knight from g8 to h6. Please provide the command 'make_move g8h6'.

--------------------------------------------------------------------------------
Proxy (to Player_Black):

Failed to make move: illegal uci: 'g8h6' in rnbqkbr1/pQpp1ppp/8/4p3/2P1n3/N4P1N/PP1P2PP/R1B1KB1R b KQq - 0 11
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I have also run the model through &lt;a href="https://github.com/maxim-saplin/LiveBench/blob/experiments/NOTES.md" rel="noopener noreferrer"&gt;LiveBench&lt;/a&gt; (public dataset from November 2024) which I had readily configured for local runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;category                              average  coding  data_analysis  instruction_following  language  math  reasoning
model
google_gemma-3-27b-it@iq4_xs             50.7    36.9           52.8                   82.1      31.9  53.5       47.3
gpt-4.1-nano-2025-04-14                  42.7    40.6           46.0                   60.0      24.0  46.8       39.1
gemma-2-27b-it@iq4_xs                    39.8    36.6           48.1                   67.6      29.5  25.0       32.0
mercury-coder-small                      35.9    34.4           44.7                   53.2      12.4  35.1       35.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Any Good?
&lt;/h2&gt;

&lt;p&gt;The model doesn't impress with its smarts. Speed, that's something we've seen already - Groq and Cerebras have been serving popular open-source models such as Llama, Gemma, Owen, etc. for quite a while - using their custom hardware and boasting thousands of tokens per second.&lt;/p&gt;

&lt;p&gt;At the same time, I don't think Mercury is a failure. It's a win. They've built a capable model that qualifies as a general-purpose chatbot. And they did it leveraging completely new ideas (in LLMs at least). There's no need for custom hardware to run diffusion models at insane speeds. While Groq/Cerebras might find it difficult to find any applications to their LPUs beyond transformer model inference.&lt;/p&gt;

&lt;p&gt;The model is in 2023 in terms of performance. Looking forward to new increments, hope they can match SOTA models of 2024-2025 - this will bring low-cost high-speed LLM inference.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>genai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Llama 4 - 10M Context? Coding? Decent Follow-up?</title>
      <dc:creator>Maxim Saplin</dc:creator>
      <pubDate>Tue, 08 Apr 2025 19:04:53 +0000</pubDate>
      <link>https://forem.com/maximsaplin/llama-4-10m-context-coding-decent-follow-up-426n</link>
      <guid>https://forem.com/maximsaplin/llama-4-10m-context-coding-decent-follow-up-426n</guid>
      <description>&lt;p&gt;Meta has brought the long-awaited &lt;a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/" rel="noopener noreferrer"&gt;Llama 4 models&lt;/a&gt; on Saturday, April 5. Llama 3 came out on April 26, 2024, so this year Meta came earlier, looks like a good sign. The newer models were retrained on (!)40T tokens, an almost 3x bump from the previous version's 15T training dataset. The model brought plenty of changes, including MoE architecture, multi-modality, huge context windows, a giant 2T version behind the scenes, and evaluations showing that the model matches other SOTA models. And yet I didn't find the release exciting.&lt;/p&gt;

&lt;h2&gt;
  
  
  No Longer "Local LLaMA"
&lt;/h2&gt;

&lt;p&gt;None of the released models are of reasonable sizes. Llama 2 had 7B and 13B variants, and Llama 3 came at 8B the first day. The name "llama" became a synonym for LLM hobbyist tinkering with local execution, large communities were formed around the Llama family of models... There are super popular projects that bear the name: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; is probably the most popular runtime for local models (not just Llama). There's a whole ecosystem with llama.cpp's native &lt;code&gt;GGUF&lt;/code&gt; model format&lt;/li&gt;
&lt;li&gt;Ollama might be the most popular tool to discover, download, and run &lt;code&gt;GGUF&lt;/code&gt; models locally.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;locall_LLaMa&lt;/code&gt; is a subreddit with 426K users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2023 Llama models ignited open-source research in LLMs, papers were published, enthusiasts fine-tuned models on their gaming GPUs, and tinkering recipes flooded the internet. Alpacas, Vicunas, Wizards (and plenty of other fine-tunes) were published as open-weight models. At that time there was the famous &lt;a href="https://semianalysis.com/2023/05/04/google-we-have-no-moat-and-neither/" rel="noopener noreferrer"&gt;"We Have No Moat"&lt;/a&gt; leaked document questioning Google's ability to succeed in the LLM race while against Open-source community (equipped with Llama)...&lt;/p&gt;

&lt;p&gt;With Meta boasting how the smallest 109B Llama 4 Scout can fit into a single $20k H100 GPU I don't see how the new models can be treated as a follow-up Llama 1, 2, 3. As of now the lineage is broken, and there's no clue if true local models are coming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long Context
&lt;/h2&gt;

&lt;p&gt;The new models promised super huge context windows at 1M and 10M tokens - something unseen in open models. There's also a claim of nearly perfect retrieval in the &lt;a href="https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/" rel="noopener noreferrer"&gt;"Needle in the Haystack"&lt;/a&gt; test:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rrpyuzwpjw48dbrmo8s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rrpyuzwpjw48dbrmo8s.png" alt="LLama 4 Needle in the Haystack" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few weeks ago I've found an interesting take context window eval, the one testing models' comprehension - if they can truly understand long texts, build mental representations and retaliations between subjects, answer complicated questions - &lt;a href="https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87" rel="noopener noreferrer"&gt;Fiction.liveBench&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For some reason their version of the results is not sorted/colorized, here's the April 6 version I formatted:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6bsn68f1l3dkjcnj1n7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6bsn68f1l3dkjcnj1n7.png" alt="Ficiton Live Bench, Llama 4" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The ranking puts Llama 4 to the bottom :(&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding
&lt;/h2&gt;

&lt;p&gt;Another disappointment came from the &lt;a href="https://aider.chat/docs/leaderboards/" rel="noopener noreferrer"&gt;AIDER benchmark&lt;/a&gt; - here the larger 400B Maverick model occupies the bottom part of the leaderboard right below Qwen2.5-Coder-32B-Instruct. Here are a few items from the leaderboard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;claude-3-7 (no thinking): 60.4%&lt;/li&gt;
&lt;li&gt;claude-3-5-sonnet-20241022: 51.6%&lt;/li&gt;
&lt;li&gt;Qwen2.5-Coder-32B-Instruct: 16.4%&lt;/li&gt;
&lt;li&gt;Llama 4 Maverick: 15.6%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yet if we look at the coding part of the &lt;a href="https://livebench.ai/#/?Coding=a&amp;amp;organization=Meta%2CAnthropic" rel="noopener noreferrer"&gt;LiveBench&lt;/a&gt; suddenly Maverick is not that bad, it's actually great! The model is above Claude 3.5 and 3.7, Anthropic models that earned the reputation of best ones for programmers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5c7b7aciy3zxlcdf58v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5c7b7aciy3zxlcdf58v.png" alt="LLama 4 LiveBench Coding" width="800" height="762"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Benchmarks Skepticism
&lt;/h2&gt;

&lt;p&gt;After I created my own &lt;a href="https://maxim-saplin.github.io/llm_chess/" rel="noopener noreferrer"&gt;LLM Chess benchmark&lt;/a&gt; I care about benchmark scores much less :) My experience with LLMs can be summarised with a simple statement "sometimes they work, sometimes they don't". LLM performance is very task specific, the same model used inside IDE can shine with one request and fail miserably with another.&lt;/p&gt;

&lt;p&gt;If one took at face value needle-in-the-haystack results for Llama 4, one would be convinced it is the best model long prompts supporting the largest models. The ones coming from Fiction.liveBench would cross out Llama 4 as garbage. The same polar conclusions will come after evaluating coding abilities based on Aider and LiveBench figures.&lt;/p&gt;

&lt;p&gt;There has been plenty of criticism towards evals. People complain that they don't tell the full story, they do not reflect their real-life experience. E.g. there is valid criticism that HumanEval could be a good coding benchmark in 2021 at the dawn of chat models yet it is barely relevant to the day-to-day routine of software engineers. Many evals are just fixed questions with well-known answers and LLMs are great at memorizing -  there's is a valid point questioning any static "question - right answer" kind of benchmark... And suspecting AI labs in "overfitting" (or simply put cheating) on benchmarks - after all there are huge incentives to score great in those press releases.&lt;/p&gt;

&lt;p&gt;LMArena (aka LMSys, Chatbot Arena) is often one of the key metrics that AI shops boast upon releases - this is a rating score testifying human preferences. Yet for quite some time this score is just noise for me... And &lt;a href="https://arxiv.org/html/2409.12822v1" rel="noopener noreferrer"&gt;this study&lt;/a&gt; demonstrated how easy it is to trick humans overoptimizing via RLHF. &lt;/p&gt;

&lt;p&gt;By the way, there's a sketchy story with a loud intro of Llama 4 as the &lt;a href="https://x.com/lmarena_ai/status/1908601011989782976" rel="noopener noreferrer"&gt;2nd best in the arena&lt;/a&gt;... Followed by users suspecting something was not playing out and an &lt;a href="https://x.com/lmarena_ai/status/1909397817434816562" rel="noopener noreferrer"&gt;official clarifications&lt;/a&gt; that arena's Llama 4 &lt;strong&gt;(a)&lt;/strong&gt; was neither the Scout or Maverick version released and &lt;strong&gt;(b)&lt;/strong&gt; was specifically tuned for human preferences (to score higher). Not cheating, "optimising for human preferences" as they say :)&lt;/p&gt;

&lt;h2&gt;
  
  
  P.S&amp;gt;
&lt;/h2&gt;

&lt;p&gt;I have my reservations in regards to fictionLive.bench. While I like the idea there's little info on implementation, there's no code, no tech report... There're many things that make the inner perfectionist unhappy: what is the token-size of 0-token story, how many questions are there, why not rank the models by average? If same attention to detail characterises the actual implementation, I question the findings...&lt;/p&gt;

&lt;h3&gt;
  
  
  P.P.S&amp;gt;
&lt;/h3&gt;

&lt;p&gt;My &lt;a href="https://maxim-saplin.github.io/llm_chess/" rel="noopener noreferrer"&gt;LLM Chess Eval&lt;/a&gt; has recently brought similar contradictions in evals. Gemma 2 was one of the best open models in the leaderboard, it was consistent in the game loop demonstrating solid instruction following - I would say the best out of smaller models. These results related to my own chatting experience putting Gemma 2 models above Llama 3. Gemma 3 release got me excited, I anticipated a successor that will surpass the previous model. And it didn't happen :)&lt;/p&gt;

&lt;h2&gt;
  
  
  P.P.P.S&amp;gt;
&lt;/h2&gt;

&lt;p&gt;Context window size interested me for quite a while. Here's my article putting into perspective token sizes of &lt;a href="https://dev.to/maximsaplin/gpt-4-128k-context-it-is-not-big-enough-1h02"&gt;various artifacts&lt;/a&gt;. I've been keeping eye on related evals, unfortunately, there're not many (the aforementioned needle in the haystack, Ruler). My own intuition is that even 10% or 20% percent of context can be too much and degrade LLM performance - I am one of those guys who starts clean chats for every small conversation.&lt;/p&gt;

&lt;p&gt;When you check the headboards you can notice that there's always never 100% accuracy in retrieval or it drops right away, e.g. at 1k tokens. Take from example &lt;a href="https://github.com/NVIDIA/RULER" rel="noopener noreferrer"&gt;Ruler&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygil3iodx9ub3h5tayzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygil3iodx9ub3h5tayzj.png" alt="Ruller LLM Bench" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I used tools such as Perplexity, or DeepResearch, or ant RAG based chatbot I always have this at the back of my mind, anticipation that there're plenty inaccuracies and missed bits in the generated responses.&lt;/p&gt;

&lt;p&gt;Here's a &lt;a href="https://www.linkedin.com/posts/maxim-saplin_these-days-tools-like-perplexity-glean-activity-7311434030128726016-uYFC?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAAAY52t4BLtN4gJKk-YVpWKb4ZkU3sVysR8w" rel="noopener noreferrer"&gt;my story&lt;/a&gt; of (unproductive) use of GPT-4.5 to convert the aforementioned fiction.live Bench into a table to colorize and sort in Excel (the screenshot I shared above).&lt;/p&gt;

</description>
      <category>genai</category>
      <category>llm</category>
      <category>ai</category>
      <category>localllama</category>
    </item>
  </channel>
</rss>
