Forem: Darko from Kilo

The Arrival of GPT-5.5: OpenAI’s New Deep-Thinking Powerhouse

Darko from Kilo — Mon, 27 Apr 2026 09:19:50 +0000

OpenAI recently rolled out GPT-5.5 and its heavy-duty sibling, GPT-5.5 Pro, and everybody wants to put them to the test.

If you feel like the model landscape is moving faster and faster, you're right. OpenAI's chief data scientist told TechCrunch this week that "the last two years have been surprisingly slow," but what he meant is that now we're really moving — now we're cooking with gas. And that's a good thing for consumers.

These SOTA models aren't just becoming smarter and more comprehensive, they're also becoming more token-efficient for larger tasks.

What's new?

GPT-5.5 is OpenAI's latest release for complex professional workloads, building on GPT-5.4 with stronger reasoning, higher reliability, and improved token efficiency on hard tasks.
GPT-5.5 Pro is OpenAI's high-capability model optimized for deep reasoning and accuracy on complex, high-stakes workloads.

Both new models are now available in the Kilo Gateway and GPT-5.5 is one of our top recommended models out of the gate.

A New Standard for Complex Work

GPT-5.5 is particularly impressive when it comes to coding and reasoning, and the kind of computer-use and browser skills needed by always-on agents like KiloClaw:

Terminal-Bench 2.0 (Command-line workflows & tool coordination): 82.7% (vs. GPT-5.4: 75.1% | Claude Opus 4.7: 69.4%)
Expert-SWE (Internal long-horizon coding tasks ~20 hours): 73.1% (vs. GPT-5.4: 68.5%)
GDPval (Knowledge work across 44 occupations): 84.9% (vs. GPT-5.4: 83.0% | Claude Opus 4.7: 80.3%)
OSWorld-Verified (Operating real computer environments): 78.7% (vs. GPT-5.4: 75.0% | Claude Opus 4.7: 78.0%)
BrowseComp: 84.4% (GPT-5.5 Pro scores 90.1%)

But benchmarks are only half the story. We had the privilege of pre-testing the alpha release of GPT-5.5, and we're ready to share what this means for builders, agents, and the broader AI ecosystem. First of all, it's exciting to see OpenAI continuing to bridge the gap between execution and high-level strategy. Coming just two days after the release of GPT-5.4 Image 2, a stunning new image generation model for multimodal workflows, GPT-5.5 covers a lot of bases for professional workloads. This new model can transform how engineering teams scale their most complex autonomous workflows.

In our testing, GPT-5.5 has proven to be tremendously capable at long-context tasks and agentic coding. Where previous generation models would occasionally lose the plot during massive refactoring jobs or deep-reasoning requirements for large codebases, GPT-5.5 stays locked in.

More importantly for our ecosystem, it has become a formidable daily driver for KiloClaw as well as an excellent fit for getting a new claw up and running and exploring new use cases. We've been using it to run always-on agents handling highly complex, multi-step professional work, and the reliability jump is palpable.

As we noted in our recent deep dive comparing Claude Opus 4.7 and Moonshot's Kimi K2.6, the frontier of AI is fiercely competitive right now. While Opus 4.7 and Kimi K2.6 brought massive leaps in their own rights, GPT-5.5 introduces a new class of autonomous capability that specifically targets professional, high-stakes workflows where fewer retries and higher reliability directly translate to better outcomes.

GPT-5.5 is definitely crushing a wide range of benchmarks, which fits with our experience testing the model in Kilo Code and KiloClaw. Significantly, it topped the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google.

In our testing, GPT-5.5 did have some issues with UI-related design tasks, but we found that more specific instructions helped resolve some of those problems.

So which one should you use?

GPT-5.5 is priced higher than GPT-5.4, reflecting its heavy-duty reasoning capabilities. And with this new model OpenAI did push up pricing again.

In fact, GPT-5.5 ($5 / Mtok input, $30 / Mtok output, $0.50 / Mtok cache) is more approachable than it might look from the outside. The 5.5 series is more token efficient than 5.4. For hard tasks, this efficiency often results in a lower actual cost per completed task because the model gets it right on the first try, without needing endless prompt engineering or loop retries.

GPT-5.5 often reaches higher-quality outputs with fewer retries, so it can be more token-efficient on real workflows even when reasoning is higher. And good news for Kilo Coders: it's the most token efficient at coding workflows.

We would also like to echo OpenAI's own advice here: "Higher reasoning can use more tokens, so customers should match reasoning effort to the task."

In-memory prompt caching is not supported for GPT-5.5. Caching for this model relies exclusively on extended prompt caching. During inference, the model caches tokens from previous requests directly on GPU-local storage.

Does it Claw?

We're excited to see what Kilo users around the world do with it. Like the new Opus, it's super smart. But is it too smart for daily tasks? Or will it become your daily driver?

My prediction is that GPT-5.5 will compete more directly with the latest Opus release for coding, but be more of a top-agent driver in Hermes and OpenClaw workflows like KiloClaw: sub-agents will likely need to use smaller models or OSS models to remain cost-efficient.

That said, the only way to really

Shell Security Plugin

Darko from Kilo — Mon, 27 Apr 2026 09:16:14 +0000

I ran openclaw security audit on my instance the other day and got back a wall of text. Six findings — one critical, three warnings, two informational. I stared at it for a minute, scrolled through the nested objects, and thought: "Okay, but what should I actually do about this?"

That's the gap the new Shell Security plugin fills. It takes that same audit output, sends the findings (not your secrets, not your config) to the KiloCode Security Advisor API, and gives you back a prioritized report with specific remediation steps. The whole thing happens in your chat — Telegram, Slack, the Control UI, wherever you talk to your agent.

What it does

The plugin is a thin bridge between two things that already exist:

openclaw security audit — the built-in CLI command that checks your local config for common security foot-guns (weak models without sandboxing, exposed runtime tools, missing trusted proxies, multi-user setups without isolation)
KiloCode's Security Advisor API — an endpoint that takes those findings and returns expert analysis with context-specific remediation guidance

The plugin runs the audit locally, packages the JSON output, and sends it off. What comes back is a markdown report that covers what was found, why it matters, and what to do about it — organized by priority.

Installing it

It's currently dev-only but will be released soon!

openclaw plugins install @kilocode/shell-security
openclaw plugins enable shell-security
openclaw gateway restart

The gateway restart is a one-time thing after install. If you're talking to your agent through Slack or Telegram, you'll see a brief connection blip and then it's back.

Two ways to run it

Slash command (recommended):

This runs the plugin directly and renders the full report. It bypasses the LLM's summarization layer entirely, so you get the complete output regardless of which model you're running.

Natural language:

You can also just say "run a security checkup" or "audit my OpenClaw config" and the agent will call the tool. One thing to know: if you're running a smaller model (Haiku, GPT-x-nano), it might paraphrase or truncate the report. Capable models like Sonnet or GPT's latest handle it fine. When in doubt, use the slash command.

First-run authentication

The first time you run it, the plugin prompts you to connect your KiloCode account through a device auth flow:

Open a URL in your browser
Enter a code
Sign in or create a free account
Run /security-checkup again

After that, the token is saved and you never see the auth flow again. There's a gateway reload on first auth (the plugin writes the token to your config), but subsequent runs are instant.

If you're running OpenClaw in CI or a container, you can skip the interactive flow entirely by setting KILOCODE_API_KEY as an environment variable.

What gets sent (and what doesn't)

This matters. Your OpenClaw instance has access to your filesystem, your API keys, your chat history. The plugin doesn't send any of that.

Sent:

The JSON output of openclaw security audit — finding IDs and summaries, no secrets
Your OpenClaw version and plugin version
Your instance's public IP (for optional remote probes)

Not sent:

Config file contents
API keys, secrets, or tokens
Chat history
Workspace files

Everything goes over HTTPS, authenticated with your KiloCode account token.

What the report looks like

On my instance, the report came back with findings grouped by severity — the critical one about small models running without sandboxing at the top, followed by the warnings about trusted proxies and multi-user heuristics, and then the informational items. Each finding includes context about why it's a risk and concrete steps to fix it.

It's... a lot of text right now. The formatting still needs work — the dev release is functional but not polished. There's also a bug where the KiloClaw call-to-action shows up even if you're already a KiloClaw user. These are known rough edges that'll get smoothed out before the stable release.

Why this is useful

Running openclaw security audit is already good practice. But JSON output requires you to interpret each finding yourself, look up what the check IDs mean, and figure out the right remediation. The Security Advisor layer turns those findings into specific guidance you can act on immediately.

For anyone running OpenClaw as a personal assistant (which is most of us), the security surface is real. Your agent has shell access, filesystem access, web browsing. A misconfigured model fallback or an unintended multi-user exposure means your agent could be manipulated by untrusted input. Having something that checks this and explains the results in plain language saves you from reading JSON and guessing at severity.

Current status

The npm package is live and the source is on GitHub under MIT license. A stable release is coming — the main work remaining is formatting improvements and fixing the conditional CTA logic.

Install it, run /shell-security, see what it finds. It takes about thirty seconds.

New VS Code Extension - Week Three: Memory, Stability, and Moving at Kilo Speed Into the Future

Darko from Kilo — Fri, 24 Apr 2026 08:16:10 +0000

Three weeks ago we GA'd the completely rebuilt Kilo Code extension for VS Code. Week one was about what we were hearing and what we were shipping. Week two was about addressing the most urgent feedback and bumps.

This week is about the two other areas of frequent feedback and challenges: memory usage on Windows and session stability under sustained use. Both are materially better now than they were a week ago. Neither is 100% fixed and "done", we can see from open GitHub issues that some of you still hit rough edges, but the experience is significantly improved especially on Windows when using Agent Manager.

Across the week we shipped 80+ Kilo PRs and merged three more upstream OpenCode releases.

Windows Memory: A Big Step Forward

This is the one we know has caused the most pain. Users on Windows reported the Kilo core process climbing into multiple GB of RAM within minutes of opening Agent Manager and staying there. A handful of you sent us heap snapshots — thank you — which helped track down root cause on some harder to reproduce issues.

The high-level story: Agent Manager was polling git status and diffs through the Kilo core subprocess, and on Windows the combination of IPC round-trips, diff payload sizes, and allocator behavior meant freed memory wasn't being returned to the OS cleanly. In v7.2.20 we've restructured that path (#9046) and made the extension much more careful about what it holds in memory:

Agent Manager's git work now runs directly in the extension host, not through the core process.
We cap how much of any single diff we'll read into memory, so opening a very large file no longer causes a spike the allocator can't recover from.
We also tuned the allocator on the core process itself to release memory back to the OS more promptly on Windows.

If you were running on a downgraded 5.x build because of memory issues, this is the release to come back on. If you're still seeing unbounded growth, please keep the issues coming — the heap-snapshot command we added this cycle (#9034) makes those reports much easier to act on.

Session Stability: Fewer Interruptions

The second theme was sessions getting interrupted mid-flow — usually recoverable by sending another message or re-opening the session/extension. Most of the reports we got traced back to a handful of specific state-machine edges, and those are now meaningfully better.

The one we heard about most often was sessions ending up stuck — most visibly when VS Code was closed while a suggestion prompt was still showing, which left the session permanently marked busy and any follow-up message queued forever. Sessions now go idle correctly while waiting on a suggestion response (#9199). A related set of stuck states around the end-of-plan flow — where "Start new session" and "Continue here" didn't reliably transition you into the handover session — also got fixed, so those buttons now move you into a new session that stays visibly busy until the handover summary lands (#9245, #9300).

Everyday chat behavior got a lot smoother too. The most common irritation was the chat view snapping back to the bottom while you were trying to read earlier context during a streaming response; that no longer happens, and scrolling back through long sessions now correctly reloads earlier history from the virtualized list (#9236, #9194). Switching between long sessions in Agent Manager — which used to briefly freeze the UI — is now near-instant, with the chat view self-healing if messages arrived while it was in the background (#8911). Smaller queue and layout fixes also landed around follow-up prompts and tool output interleaving.

Finally, a nice performance-and-stability win from the community: @IamCoder18 landed visibility-aware git polling plus resolution caching in Agent Manager's git stats poller (#8703), meaningfully reducing the number of git subprocesses the extension spawns on repos with many worktrees.

New Capabilities This Cycle

Stability was the priority, but we still shipped meaningful new capability:

Fork sessions from any user message — both in Agent Manager (#9207) and in the sidebar (#9244). Branch at any point without losing the original.
KiloClaw chat panel in VS Code — the KiloClaw group chat experience now lives directly inside the editor (#7960).
Folder @-mentions — reference a folder with @ and include its top-level file contents as context (#9023).
Autocomplete backend prewarm — inline completions are ready on the first keystroke without having to open the Kilo sidebar first, and autocomplete state refreshes when workspace folders change (#9305).
Heap snapshots from the Command Palette — capture a snapshot of the bundled Kilo core directly from VS Code (#9034).
"Contribute on GitHub" CTA in Marketplace — a subtle footer link inviting contributions of new skills, modes, and MCP servers (#9099).

Upstream OpenCode

Three more OpenCode upstream releases merged this cycle — v1.4.4, v1.4.5, and v1.4.6 — bringing continued improvements to session sync, provider compatibility, Windows terminal handling, and the underlying AI SDK layer. Building on a shared open-source foundation continues to pay off: work from the broader OpenCode community lands in Kilo automatically.

Codebase Indexing Progress

Community contributor @shssoichiro's codebase indexing work (#6966) remains active. The branch is being kept current against main, review iterations are ongoing, and we're closing in on a form we can land. This is a substantial feature and we want to get it right — thank you for the sustained effort here.

Community Update

Some numbers and names from this cycle:

80+ PRs merged on top of the upstream OpenCode work.
3 upstream OpenCode releases merged — v1.4.4, v1.4.5, and v1.4.6.
Multiple stable releases promoted to the marketplace through the period, with v7.2.20 as the current stable.

Thank you to community contributors whose work landed or continued this cycle:

@shssoichiro — continued work on codebase indexing (#6966).
@IamCoder18 — visibility-aware git polling in GitStatsPoller (#8703).

And broad thanks to every community member who filed heap snapshots, reproduction steps, Discord reports, and sustained the long-running Windows performance thread (#8030). That conversation is the reason we had the signal we needed to tackle the memory work head-on this week.

Moving at Kilo Speed Into the Future

This is the last of the regular weekly updates in this series. The core issues that we highlighted in Week 1 — rate limiting, Plan/Ask strictness, human-in-the-loop controls, config resilience, and Windows memory — are either resolved or meaningfully better. We will continue to focus on smoothing out the rough edges in the near future.

We will also be driving Kilo further towards the vision of where agentic coding is going, enabling engineering teams to ship at Kilo Speed safely and confidently, faster than ever before. We are excited about this future and believe that the new V7 is on a strong foundation to build on. Agent Manager continues to improve for those who like to run multiple agent sessions in parallel, and will only become more useful as models continue to improve and become more capable and need less oversight. And when a particular change or workstyle requires closer agent supervision and pair programming, you can do that too. The AI landscape is evolving quickly and models keep advancing, and the tools we use need to keep pace.

To everyone who showed up over these three weeks — the issue filers, the PR authors, the Discord commenters, the prerelease testers, the heap-snapshot senders, and the folks who point to the future with feature requests — thank you. Your feedback, issues, and pull requests are genuinely what makes this community great. We value every piece of it, and we'll keep making the extension better because of it.

See you in the release notes.

— Josh and Mark

Move at Kilo Speed.

The future of Product Managers

Darko from Kilo — Thu, 23 Apr 2026 12:54:39 +0000

A product leader we know has 15 years of experience shipping developer tools. He spent a decade at a household name. He is, genuinely, one of the best product minds we've encountered in this industry.

He can't get a conversation for a group PM role.

That is a signal, not a market blip.

We've spent a lot of time talking about what AI is doing to engineers – how one developer with the right tools now ships what used to require a team of five. But we had an adjacent question: what happens to product managers?

Shipping isn't a funnel anymore

For years, software development worked like a funnel. PMs turned customer insights into specs. Engineers turned specs into code. The funnel created a natural place for the PM to sit – upstream, owning the translation layer.

Shipping was expensive. So you needed someone to decide what was worth shipping.

That's no longer true. Shipping is close to free now. So what is a PM's role now that the funnel has collapsed and PMs aren't filtering a very costly resource (engineering time)? Is there still a place for PMs in this new world?

As former PMs ourselves, we're watching this shift from two very different vantage points. At Kilo, there are about 40 people and one PM. We operate with a WAUzer (Weekly Active User) model – every engineer owns a single product area and is accountable for the weekly active users in that area. Every Monday, Evgeny would stand up for two minutes: here's what I did on cloud agents, here are the numbers, here's my target for next week. He was fast. He was accountable. And across those product areas, we saw roughly 10% week-over-week growth.

The product hat shifted to engineers. And it worked.

But, it didn't work everywhere – the VS Code extension had too much surface area for one engineer to own clearly. So we brought in Josh. He runs a pod. He decides what gets built. Traditional PM model.

At Solo (Asher's company), it's just two people – one developer – moving at a pace that would have required a team of 10 three years ago. No PM at all. No coordination layer. The product question and the building question sit with the same person.

Two different experiments. Same conclusion forming.

It's always been vibe coding

"PMs were the original vibe coders. We wrote the spec, and the engineers were our LLMs."

That framing came out of a conversation between us. Because if the spec-to-code handoff is getting absorbed by AI tooling – if engineers can hold the product context and build without a translation layer – then the PM role has to move. The question is where.

We see two paths forward.

Path one: shift left toward go-to-market. The thing that's genuinely hard, even in an AI-native company, is knowing what to build. Not technically – but commercially. What will people pay for? What problem are we actually solving? Who is the buyer, and do we have them before we build?

That's where PMs might land. Not writing specs, but sitting closer to sales, customer research, and market discovery to orchestrate the product strategy and business rationale for building a feature. A big portion of the PM's role will be saying no to features to prevent bloat and identify customers who are willing to pay for features before building it.

Path two: the long thin layer – engineers who wear the product hat. Each engineer owns their area completely. Customer conversations, support, metrics, roadmap decisions – all of it. No handoff, no telephone game.

The upside is accountability. The downside is that it requires people who can go wide – technically sharp AND commercially minded AND customer-facing. That's a rare profile. And at some point, a customer doesn't want your one thin area. They want the whole package.

Both paths are real. You'll see companies betting on each.

The traditional shipping funnel is gone. It's dead in startups now and will die in F100s over the next 5 years. The people who figure out the new shape of product ownership – whether that's engineers, PMs who've shifted left, or something we don't have a name for yet – are the ones who'll be standing in three years.

The senior product leader we mentioned will land somewhere. His experience is real. But the role he's looking for may not look like what it used to. The best thing any PM can do right now is stop waiting for the old model to come back and start experimenting with new models.

Developers are working in the future. PMs need to join them.

We Gave Claude Opus 4.7 and Kimi K2.6 the Same Workflow Orchestration Spec

Darko from Kilo — Thu, 23 Apr 2026 12:47:55 +0000

Kimi K2.6 launched on April 20, 2026, four days after Anthropic released Claude Opus 4.7. We gave both models the same spec for FlowGraph, a persistent workflow orchestration API with DAG validation, atomic worker claims, lease expiry recovery, pause/resume/cancel, and SSE event streaming. Then we reviewed the code and reproduced the edge cases the models' own tests did not cover.

TL;DR: Claude Opus 4.7 scored 91/100 and Kimi K2.6 scored 68/100 on the same build. Kimi K2.6 reached 75% of Claude Opus's score at 19% of the cost, but the 25-point gap sits in lease handling, scheduling, and live streaming (the parts its own tests never exercised).

Pricing

Claude Opus 4.7 runs at roughly 5x the input cost and 6x the output cost of Kimi K2.6. That is the gap we wanted to pressure-test.

Why a Workflow Orchestration Spec

A workflow engine runs jobs like a nightly settlement: fetch captured payments, charge customers, send receipts, publish analytics. Four steps with dependencies between them, retries when a step fails, and recovery when a worker crashes mid-step. Temporal, Airflow, and AWS Step Functions all solve the same problem at different scales.

Most of our API comparisons test a wide range of skills (architecture, auth, filtering, error handling). For this test we wanted a single deep build where correctness was the main axis. A workflow engine with DAG validation, atomic step claims, lease expiry recovery, retry scheduling, and pause/resume/cancel semantics has objectively right and wrong answers. Either two workers can win the same step or they can't. Either an expired lease is recovered or it isn't. Either a step becomes runnable when its dependencies succeed or it doesn't.

The spec also calls out at-least-once execution, deterministic scheduling across all eligible steps, and SQLite as the source of truth. The full spec is 1,042 lines and covers 20 endpoints across workflow definitions, runs, workers, events, health, and metrics.

The Prompt

We ran both tests in Kilo CLI and gave both models the same prompt:

"Read @spec.md and build the project in the current directory. Treat @spec.md as the source of truth. Do not simplify this into a mock, toy app, or basic CRUD scaffold. Create all code, configuration, Prisma schema, tests, and README needed for a runnable project. Work autonomously and continue until the implementation is complete. Before you finish, install dependencies, run the test suite, fix any failures you can reproduce, and make sure the project is runnable."

Claude Opus 4.7 ran on high thinking mode. Kimi K2.6 ran on thinking mode. Each model worked in its own empty directory with no shared state.

What Each Model Produced

Claude Opus 4.7 finished in about 20 minutes. Kimi K2.6 took longer on the clock, but we are not scoring elapsed time here. Kimi K2.6 was released the day of this test and provider availability is still limited. Wall-clock comparisons against a model as well-supported as Claude Opus 4.7 would distort the picture. Expect that gap to close as more providers host Kimi K2.6.

Both models delivered the project shape we asked for:

Prisma with SQLite as the source of truth
Hono routes for workflow definitions, runs, worker actions, events, health, and metrics
Conditional updateMany for step claiming
Retry and lease-expiry scheduling
A RunEvent table for audit logs
Readmes with setup instructions and at-least-once execution notes

Both Models Said Their Tests Passed

Claude Opus 4.7 ran 31 tests across 6 files. Every test passed. Kimi K2.6 ran 20 tests inside a single file. Every test passed.

If we had stopped there, the two implementations would look close. They weren't. A direct code review plus targeted reproductions against isolated SQLite databases surfaced one real bug in Claude Opus 4.7 and six in Kimi K2.6. We will show each one with the line that causes it.

Claude Opus 4.7: One Real Bug

Multi-expired lease recovery leaves retryable siblings on a failed run

The spec says that when a step exhausts retries, the parent run fails and every other non-terminal step becomes blocked. Claude Opus 4.7's recovery path handles this correctly for a single expired lease. With two expired leases in the same recovery pass, it can undo its own block.

In src/services/workers.ts, runRecovery() loads every expired running step into memory and iterates:

If the first iteration exhausts retries for one step, failRunDueToDeadStep() fires, the run becomes failed, and every other non-succeeded step is set to blocked. That is correct.

The problem is the second iteration. handleLeaseExpiry() updates by id only:

There is no guard on status, so a step that was just marked blocked by the prior failure cascade gets updated back to waiting_retry.

We reproduced it with a run containing two expired running steps: a with maxAttempts = 1 and b with maxAttempts = 2. After recovery:

Step b should have been blocked because the run had already failed. Instead it is eligible to be claimed again on the next /workers/claim call.

Claude Opus 4.7's test suite does not cover this case. It tests single-step lease expiry in isolation.

Smaller contract risks

Two smaller issues turned up in review but did not need a full reproduction.

The claim path reads maxClaims * 10 candidates. That is fine most of the time, but a queue with many skipped candidates at the front can hide valid work farther down the ordered list.
The SSE stream subscribes after replay finishes and treats an unknown afterEventId as "replay everything." The spec does not define unknown-cursor behavior explicitly, so this is more a looseness than a bug.

Kimi K2.6: Six Confirmed Issues

1. Claim ordering is not global across runs

The spec requires that when multiple steps are eligible, claim order is priority descending, then availableAt ascending, then createdAt ascending, across all eligible steps.

Kimi K2.6's claim loop orders steps inside each run, then iterates runs in whatever order the database returns them:

We reproduced this with two active runs on the same queue. One had a step at priority = 10. The other had a step at priority = 100. The call to POST /workers/claim returned the priority 10 step first.

2. SSE is replay-only, not live

The spec requires that GET /runs/:id/events/stream replays stored events and then switches to live streaming.

Kimi K2.6's stream reads every persisted event, writes them to the stream, and then starts a keepalive timer. Nothing subscribes to new events. The file src/lib/events.ts even defines an emitAndBroadcast function and a subscriber map, but the route never wires to them:

Clients receive replayed history once, then silence. The README still claims live streaming.

3. Expired leases can still be completed

The heartbeat endpoint rejects expired leases. The complete and fail endpoints do not. We reproduced this by claiming a step, forcing leaseExpiresAt into the past, and calling POST /step-runs/:id/complete:

The step was marked succeeded on an expired lease. The spec treats lease expiry as a failed attempt. A worker can crash, its lease can expire, recovery can schedule a retry for the next worker, and the original worker can still phone in a "success" afterwards.

4. "No active version" returns 404 instead of 409

The spec: if there is no active version and no explicit version, return 409.

Kimi K2.6 raises NOT_FOUND (404):

5. Validation is narrower than the spec

CreateRunSchema and CompleteSchema use z.record(z.any()) for input, metadata, and output. The spec allows arbitrary JSON payloads. A string, array, or number payload is rejected even though the spec accepts it.

6. The clean build path fails

npm test passes. npm run build does not:

package.json expects npm start to run node dist/index.js, so the documented build-and-start flow is broken on a clean checkout.

What Each Model Said About Itself

Both models produced end-of-run summaries claiming their implementations were complete and all tests passed. Both were technically true. Neither flagged the issues above.

Claude Opus 4.7's summary was mostly accurate. It described its recovery path, atomic claim pattern, and event persistence correctly. The one thing it missed was the multi-expired lease interaction.

Kimi K2.6's summary claimed deterministic global scheduling and live SSE streaming. Both of those claims are in the README too. The code does not deliver either.

"My tests pass" is not the same thing as "my implementation is correct." Both models understood the spec well enough to build most of it. Neither model wrote tests that would have caught its own worst behavior.

Scoring

We scored each model on the spec, weighted by how much each category mattered for a correctness-first workflow engine.

Claude Opus 4.7 lost points on the reproduced recovery bug, the bounded claim scan, and the SSE cursor fallback.

Kimi K2.6 lost points on the six confirmed issues above. The biggest hits are in recovery, scheduling, and streaming, which is exactly where the spec's hardest requirements live.

Cost vs Quality

Kimi K2.6 is about 4x cheaper per point. The missing 23 points are in step-leasing, scheduling, and event streaming, which is where the hardest spec requirements live. Those are the parts that separate "the endpoints exist" from "the system behaves correctly under load."

Where Open-Weight Models Stand Right Now

This test sits inside a pattern we've been tracking for a while. MiniMax M2.7 matched Claude Opus 4.6's detection rate on our last three-part benchmark. GLM-5.1 scored five points behind Claude Opus 4.6 on our job queue spec. Kimi K2.6 landed 23 points behind Claude Opus 4.7 here on a harder spec, but still produced the right shape of the system on the first pass.

The gap on surface coverage has narrowed meaningfully over the last year. The gap on correctness inside hard code paths (lease recovery, cross-run scheduling, streaming semantics) is still there. For work where the bugs only show up under contention or mid-crash, frontier proprietary models are the safer choice today. For work where you need the scaffold, the tables, the endpoint surface, and a starting test suite, open-weight models like Kimi K2.6 are close enough that the price delta matters.

Kimi K2.6's current pricing ($0.95 / $4 per million tokens) is a starting point, not a floor. Moonshot AI releases open weights, which means Kimi K2.6 will end up hosted on multiple providers, with pricing and latency converging on whoever runs it most efficiently. That is already playing out with MiniMax M2.5, which became the #1 most-used model across every mode in Kilo Code in the months after release. Price competition tends to pull these numbers down further as more hosts come online.

Being open-weight also means you can self-host or fine-tune Kimi K2.6 if you have data residency requirements, custom workflows, or a cost profile that makes API-only models impractical at scale. That is not a capability Claude Opus 4.7 offers at any price.

None of that changes the correctness findings above. It does reframe them. At $0.67 with a careful review pass, Kimi K2.6 is a real option now. At $3.56 with fewer corrections needed, Claude Opus 4.7 is the safer call. Which trade-off wins depends on the work. A year ago, that choice did not really exist at this level of complexity.

Takeaways

For building the scaffold of a complex backend: Kimi K2.6 did well. It produced the right project shape, the right tables, the right endpoint surface, and a test suite that passed. For prototyping, exploring a design, or generating a starting point you plan to review carefully, the $0.67 run is a good deal.

For systems where state-machine correctness matters: Claude Opus 4.7 pulled clearly ahead. The two implementations look similar in shape but diverge in the code paths that are hard to test casually (lease expiry, cross-run ordering, SSE, expired-lease rejection). If the project needs to behave correctly when leases expire, when multiple runs compete for workers, or when events need to flow live to clients, Claude Opus 4.7's output is closer to something you could ship.

On trusting model self-reports: Both models said they were done. One was mostly right. The other had six spec-level issues in shipped code. "Tests pass" is a necessary signal. It is not a sufficient one for work this correctness-sensitive. A review pass plus a few targeted reproductions closed the gap between what the models said and what they actually built.

A Note on Kimi K2.6 Speed

Kimi K2.6 was released the day of this test. Provider availability is limited right now, so the current wall-clock timings understate the model's real speed. We saw similar adoption curves on previous open-weight releases from MiniMax and Z.ai as more providers came online. We expect Kimi K2.6's elapsed time (and its effective cost) to keep dropping as that happens.

Testing performed using Kilo Code, a free open-source AI coding assistant for VS Code and JetBrains with 2,300,000+ Kilo Coders.

Enterprise AI Has a Trust Problem. We’re Hearing It Firsthand.

Darko from Kilo — Thu, 23 Apr 2026 12:41:29 +0000

The last few weeks have been chaotic for anyone paying attention to the AI tooling market. Cursor is set to sell to SpaceX. Anthropic pulled the rug on subscription pricing for businesses. And in the middle of all that noise, our conversations with enterprise teams have been converging on the same frustrations.

The specifics differ by industry. The underlying problem is consistent: walled gardens and pricing uncertainty.

Their Ceiling Is Your Ceiling

Take infrastructure trust. A top-three auto manufacturer came to us because their developers were hitting Cursor rate limits and couldn't build while they waited for them to reset. That same company had a second concern, quieter but more significant: they suspected the frontier lab powering their primary tool had oversold capacity and was running into compute headroom issues.

Whether that was probably true didn't matter. The perception had already taken root. If your workflow depends on one lab's availability, their ceiling is your ceiling.

Then there's cost visibility. A Director of DevEx at one of the world's largest banks came to us because his developers had existing model agreements with frontier labs, negotiated at the enterprise level, and he wanted them to actually use those models instead of routing everything through a middleman — which isn't possible on vendor-locked tools. On top of that, the other tools he'd evaluated gave him no visibility into token-level costs. When you can't see what you're paying for, you're trusting a vendor's math on your own spend.

A platform engineer at one of the UK's largest retailers had a similar frustration: his colleague was evaluating a tool with an opaque credit system and finding that developers burned through credits fast when they asked what he called "some juicy questions of the codebase." They wanted powerful models, but they also wanted to know what those models were costing.

Routing and Compliance Shouldn't Be Optional

For others, the issue is routing and compliance. A healthcare software CEO was simultaneously in contract negotiations with two different vendors when he reached out. He wanted to know if there was a more open alternative before he signed with either, and was already writing his own model routing layer internally (a CEO, doing infrastructure work) because "the world changes too much to bet on any one solution."

A separate healthcare data company came to us for a specific technical reason: they work with PHI and can't route that data through outside vendor infrastructure, but they still need frontier models for tasks that don't touch patient data. They needed one tool that could route differently based on what was actually in the request. That's not an unusual ask. It's compliance.

And then there's the on-prem and sovereignty tier. A defense contractor with CUI requirements told us that on-prem model routing wasn't optional, it was a contractual necessity. A cloud CTO asked for mixed inference on day one, with some calls going to self-hosted models, others to their existing AWS Bedrock commitments, and the rest through our gateway, because running models is literally his business and single-vendor inference lock-in was a risk he'd already mapped out. The platform engineer at the UK retailer liked the tool he'd been using personally for 18 months, but said plainly, "obviously I can't bring that to my work environment." He needed enterprise data controls with his company's own Bedrock models underneath.

The AI champion at a major fast food chain put it most directly: closed vendors are building something that looks a lot like OpenClaw but locked inside their own walled garden, and that's precisely why model-agnostic infrastructure matters to her. The capability isn't the moat. Who controls access to the models is.

The Data Backs This Up

We see this play out in our usage data too, and the numbers are striking. On an average day this month, Kilo users are actively running 348 different models. Yesterday, the top 10 by usage came from six different labs: MiniMax, StepFun, xAI, ByteDance, Anthropic, and NVIDIA. MiniMax was #1 by request volume. The three most popular models combined only covered half of all usage, and a full third of Kilo traffic goes to labs that most people wouldn't have recognized 18 months ago.

Nearly half of Kilo users run models from more than one lab in a given month, and that share grew from 29% to 46% over the last six weeks. Among organizational customers specifically, 42% used models from two or more labs in a single week, generating 1.1 million requests routed to 19 different labs. The number of labs with 1,000+ weekly active users on Kilo grew from 8 in January to 12 in April.

People also aren't just switching between projects. Yesterday, 15% of users routed to two or more models within a single hour. Power users average five labs a month. The average Kilo employee, who has every model available and no spend cap, draws from 5.7 labs per month. Even internally, with unlimited access, nobody settles on one lab. Multi-model isn't a power-user quirk anymore. It's becoming the default way developers work.

Cursor & SpaceX: The Cost of Structural Dependency

The Cursor/SpaceX deal is worth understanding through this lens. Cursor built a genuinely good product and still ended up in a position where the models at the core of their product were controlled by companies now competing directly against them. The $60 billion acquisition option and access to a million H100s is the cost of buying out of that structural dependency — training their own models so they're not reliant on infrastructure providers who also ship competing tools. That's not a Cursor problem. That's just what it costs to not be dependent on your competitors.

The auto manufacturer waiting on rate limits, the bank that can't see its token costs, the healthcare company that can't route PHI externally, the defense contractor with on-prem requirements, the retailer who loved a tool he couldn't bring to work. These are all expressions of the same structural problem. When you don't own the model layer, the decisions of whoever does become your constraints.

And as frontier labs move further into tooling, the likelihood of those constraints tightening only goes up. One enterprise customer said it plainly: "I do not like vendor lock-in. All the features that these big companies are making to try and lure you in and get vendor lock-in on their flagship models is not something I'm interested in."

He's not alone. The market is moving toward infrastructure that stays out of the way, routing intelligently to whatever model fits the task, showing you exactly what it costs, and not requiring you to trust a vendor's judgment about which models you should have access to. The walled garden is a bet that lock-in wins. Increasingly, the developers and enterprise teams we talk to are betting the other way.

Kilo is the all-in-one agentic engineering platform, open-source and model-agnostic. Install the VS Code extension or get started at app.kilo.ai.

Congratulations Cursor on being acquired by SpaceX!

Darko from Kilo — Wed, 22 Apr 2026 13:33:41 +0000

Cursor reportedly just sold for $60 billion. To SpaceX. Which already owns xAI.

When a coding tool gets acquired by an AI lab, users don't get more choices; they get fewer. This is a pattern. Anthropic pulled model access from Windsurf the moment acquisition talks with OpenAI were public info. That was how this industry works. Every major lab wants to own the full stack: the model and the tool sitting on top of it. Control the tool, control what developers reach for every day. Control what they reach for, and you control which models win.

The endgame is lock-in. Your coding assistant becomes a distribution channel for whatever model the parent company needs to push.

SpaceX @SpaceX

SpaceXAI and @cursor_ai are now working closely together to create the world's best coding and knowledge work AI. The combination of Cursor's leading product and distribution to expert software engineers with SpaceX's million H100 equivalent Colossus training supercomputer will…

10:11 PM · Apr 21, 2026 · 100K Views — 231 Replies · 264 Reposts · 1.54K Likes

Think about what made Cursor worth using in the first place. It wasn't the interface alone. It was the fact that developers could reach for Claude when they needed deep reasoning, GPT-4o when they needed speed, and whatever else was best for the job at hand. That flexibility was the product. It was the reason engineers trusted it with real workflows.

SpaceX has an AI strategy. It's called xAI. They spent $1.25 trillion worth of equity absorbing it in February. You don't make that bet and then happily route Cursor users to Anthropic's models. You route them to Grok. You fund xAI's next release. You use Cursor as a growth lever for the model you already own. That's just how businesses work.

Cursor was under pressure. OpenAI (Codex) and Anthropic (Claude Code) both gained significant traction this year. They were embraced quickly. Cursor found itself fighting uphill on two fronts: defending market share from tools backed by the very labs whose models Cursor depended on, while also shopping for its next funding round. SpaceX was a lifeline. A $50 billion one.

But lifelines come with strings. The string here is that Cursor users are inheriting Elon Musk's AI roadmap, whether they asked for it or not.

There's also the Anthropic question, which is more immediate than the xAI consolidation story. Anthropic pulled access from Windsurf during acquisition talks. Not after acquisition. During. The logic is straightforward: if a competitor is about to buy a distribution channel that runs your models, why hand them more leverage? Cursor users who rely heavily on Claude should take that precedent seriously. It could move fast.

This matters beyond just Cursor. What's happening here is the broader consolidation of the AI coding market into vertically integrated stacks. OpenAI has its own coding product. Anthropic has its own coding product. Google has its own coding product. Each of those labs has clear incentives to favor tools they own or control. The independent, multi-model layer is shrinking.

Kilo doesn't have a model to sell. We have a tool to build. That means Opus 4.7 when it is best, GPT-4o when GPT-4o is best, Mistral Large 3 when it's the right tool for the job, and the next breakthrough model the moment it's available — whoever ships it. We have no incentive to steer you toward any particular model.

That's model freedom. It sounds simple because it is. Use the best tool for the job. Don't let your coding assistant's corporate parent make that decision for you.

The Cursor news is a reminder of why that principle matters.

The Elephant is Out of the Bag: Meet Ant Group's Ling-2.6-flash

Darko from Kilo — Wed, 22 Apr 2026 12:33:55 +0000

A short time ago, we announced Elephant, a 100B-parameter stealth model from a prominent open model lab.

The response from the Kilo community was fantastic. Across coding tasks, complex document parsing, and dynamic agentic workflows, your feedback was incredibly consistent. Elephant was extremely fast and capable. The speculation immediately took off on X and Discord. Was it a new proprietary model from a well-known tech giant? A highly tuned open-source derivative? A completely new architecture?

Today, it is time to address the Elephant in the room and take off the mask.

We're excited to officially reveal that the stealth model you've been using in your day-to-day coding workflows and agentic assistants is none other than Ant Group's Ling-2.6-flash.

After all, you can't spell Elephant without ant!

By releasing Ling-2.6-flash under a pseudonym, we wanted to let the model's performance speak entirely for itself, free from any pre-existing brand bias or market expectations. The community's blind tests confirmed what we already suspected: this model is an absolute powerhouse for developers building next-generation AI applications. With super fast inference from Novita, it was a win-win.

But the Kilo community didn't just test Elephant — you actively helped refine how it operates. During the stealth phase, we received some absolutely great community PRs improving the system prompts and fine-tuning the integration. Thanks to your collaborative optimizations, the model's performance on Kilo has been pushed even further, unlocking better instruction adherence and sharper contextual reasoning.

So, what exactly is under the hood of the model formerly known as Elephant?

Here is the official description of the newly unmasked powerhouse:

Introducing **Ling-2.6-flash, an Instant model with 104B total parameters and 7.4B active parameters, built for real-world agents to deliver fast responses, strong execution, and high token efficiency — matching SOTA-class performance at similar scale while significantly reducing token usage across coding, document processing, and lightweight agent workflows.

This unique architectural balance is what makes it so incredibly agile. In an era where AI agents are expected to operate autonomously, process massive context windows, and return actionable code in milliseconds, Ling-2.6-flash hits the exact sweet spot of intelligence and speed. You get the deep reasoning capabilities of a massive 104B parameter model, paired perfectly with the low latency and cost-effectiveness of a highly active, focused 7.4B network.

Many are familiar with Ant Group's trillion-parameter model released at the end of 2025 — Ling-1T — which seemed designed to compete directly with DeepSeek-V3. This flash model is an intriguing refinement of those capabilities.

A major driver behind this agility is how the model was trained. Ant Ling models are specifically designed with Agentic RL (Reinforcement Learning). Because of this agent-first foundation, Ling-2.6-flash is fully compatible with OpenClaw. The best way to see how this works is to use KiloClaw, our hosted OpenClaw that's faster, easier and safer than anything else on the market. This empowers the model to go far beyond simple text generation and seamlessly handle complex agentic workflows, executing terminal operations, managing dynamic GUI interactions, and coordinating sophisticated tool calls.

Celebrate the Reveal: Free Ling-2.6-flash for an Entire Week

To celebrate this unmasking and thank you for your incredible contributions, we want to make sure everyone has the opportunity to experience the raw power of Ling-2.6-flash without any friction. You'll find it under the inclusionAI moniker, which is the name of Ant Group's Artificial General Intelligence (AGI) initiative.

Starting right now, Ling-2.6-flash is completely free to use in Kilo Code and KiloClaw for an entire week — with absolutely no limits. That's right. No rate limits holding back your automated agent loops, no token caps on your massive document processing tasks, and no paywalls stopping your late-night coding sessions. Whether you are building an autonomous research agent or a personal AI assistant with KiloClaw, we've got you covered.

The Elephant is out of the bag 🐘

We can't wait to see what you build with Ant's Ling-2.6-flash during this unlimited free week. Log into Kilo now, and let your agents loose!

Thank you, Roo! We’ll take it from here.

Darko from Kilo — Wed, 22 Apr 2026 12:26:23 +0000

TL;DR: Roo Code is no more. We're grateful for what the Roo team contributed to Kilo, and we're still going full speed on building the best agentic coding experience in VS Code. Install here.

Roo Code is officially shutting down. The team announced they're archiving the repo on May 15th to go all-in on Roomote, their cloud agent.

First: congrats to the Roo team. 3 million installs is a hell of a run, and a lot of the modern IDE agent playbook came out of that project. Custom modes, the Architect/Code/Debug split, diff-based editing, the whole "let the agent actually do things" philosophy that's now table stakes. Roo pushed it forward when a lot of people were still arguing about whether autocomplete was enough.

Kilo started as a fork of Roo. We've been contributing back upstream since our inception, and a lot of what Kilo does well today started with the work Roo shipped first. For that, we're very grateful!

Kilo is not slowing down on VS Code

The IDE is not over. Far from it, actually. Every independent developer, every engineering team, every enterprise shipping production software still lives in an editor for most of their working hours. That's not going away, and the quality of the agent sitting next to them in that environment matters enormously.

Which is why we just completely rebuilt the Kilo VS Code extension from the ground up on the OpenCode server, a portable open-source core that now shares the same engine as the Kilo CLI and Cloud Agents. That's not something you do if you think the IDE is a dead end.

The rebuild unlocked things that weren't possible before: true parallel execution, subagent delegation, an Agent Manager for running and monitoring multiple agents at once, inline diff review with line-level comments, and cross-platform sessions that carry state between your terminal and your editor without losing context.

It's already a fundamentally different surface than what we shipped at launch just a few weeks ago, and we're still actively hardening it based on what the community is telling us.

We also think coding isn't the only place that AI should be working for you. KiloClaw is a personal AI assistant that can proactively take actions across external platforms, automate workflows on its own schedule, and handle work that doesn't require you to be in the IDE at all.

We care about both sides of how developers actually spend their time, and we're not slowing down on either.

For the Roo community

If you've been using Roo Code, you'll feel right at home in the Kilo extension — the codebases share a grandparent, after all.

If you want to go deeper, the repo is completely open. Open source is how this whole ecosystem got here, and open source contributors are the reason Kilo moves as fast as it does. If you've been contributing to Roo, we'd love to have you.

To the Roo team and the Roo community: thanks for everything. The bar you set is the reason the rest of us had something to aim at.

Kimi K2.6 Has Arrived: An Open-Weight Powerhouse for Agentic Work

Darko from Kilo — Tue, 21 Apr 2026 10:38:21 +0000

Moonshot AI just dropped their latest model, Kimi K2.6, and it's an absolute powerhouse for agentic workflows. Even better? It's completely open-weight from release day.

Moonshot AI is starting to feel like less of a "moonshot" and more of a sure thing. The lab's previous big release, Kimi K2.5, was an immediate hit on Kilo. Our users praised its ability to reason through complex codebases, suggest refactoring strategies, and maintain context across large-scale projects.

The next iteration doesn't disappoint, ensuring that Kimi models will stay competitive with frontier offerings like OpenAI's GPT models. During our early preview testing, Kimi K2.6 blew us away with its ability to handle complex, long-context tasks across massive codebases. We're thrilled to announce that Kimi K2.6 is already live, fully integrated, and available to use in Kilo Code and KiloClaw.

K2.6 offers SOTA-level performance at a fraction of the cost. It's tremendously good at long-context tasks across the codebase, as well as the day-to-day work needed to support an always-on agent like KiloClaw. Moonshot has impressed us yet again!

—Scott Breitenother, Co-founder & CEO, Kilo Code

A Model Designed for OpenClaw

What sets Kimi K2.6 apart is its sheer stamina and reliability for continuous, long-horizon coding tasks. This isn't just an iterative update. The numbers speak for themselves. In head-to-head benchmarking against the industry's heaviest closed-source hitters, Kimi K2.6 is an open-weight model that holds its own. It scored an impressive 80.2% on SWE-Bench Verified and 58.6% on SWE-Bench Pro, showcasing its deep understanding of real-world software engineering issues. Additionally, it achieved a strong 92.5% F1-score on DeepSearchQA and an excellent 66.7% on Terminal-Bench 2.0.

We found Kimi K2.6 to be tremendously capable at handling the rigorous, day-to-day processing required to support an always-on agent like KiloClaw. Over a continuous 13-hour execution period, Kimi K2.6 independently iterated through 12 optimization strategies, made over 1,000 tool calls, and precisely modified more than 4,000 lines of code. The result? A massive 185% leap in median throughput (from 0.43 to 1.24 MT/s).

We're excited to give it a run in PinchBench shortly and see if these tests apply to OpenClaw tasks for the benchmark.

Kimi K2.6 raises the bar for open-source models. It excels in coding and especially for agentic tools like OpenClaw and Hermes. In early testing, it sustains long multi-step sessions with impressive stability.

—Michael Chiang, Co-founder, Ollama

For teams deploying multi-agent systems, Kimi K2.6 elevates the Agent Swarm architecture to entirely new heights. The model can now dynamically scale horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously — a massive leap from K2.5's limit of 100 sub-agents and 1,500 steps. This extreme parallelization fundamentally reduces end-to-end latency while enabling the swarm to execute deeply complex, heterogeneous tasks concurrently.

K2.6 also gives single agents more power. It lets you turn files such as PDFs, spreadsheets, slides, and Word documents into agent skills, unlocking a wide range of agentic knowledge work (see their release post for examples).

Ready for Kilo Code and KiloClaw

Whether you're doing deep codebase refactoring, hunting down non-obvious bugs, or setting up autonomous 24/7 workflows, K2.6 delivers the performance, instruction-following, and stability you need. One caveat: the model can be very creative, so make sure you give it clear instructions; when you do, its ability to minimize repetitive overhead translates to a significantly smoother, more trustworthy end-to-end experience for developers.

Ready to put these impressive stats to the test in your own workspace? Kimi K2.6 is available to use now in the Kilo Gateway. That means you can use it wherever you use Kilo — in the Kilo CLI, our VS Code and JetBrains extensions, Hermes, KiloClaw (our hosted OpenClaw), and more. Experience the next evolution of open-source agentic intelligence today.

Read more about the official model release and dive into the full technical benchmarks from Moonshot AI here: Kimi K2.6 Announcement.

Talk to the Claw: The Interface Is Now a Single Sentence

Darko from Kilo — Tue, 21 Apr 2026 10:35:15 +0000

At Kilo, we aren't approaching this question in the abstract — we're living it every day.

As we lean into agentic flows, we're discovering that working in a new interface means that the layer between you and the tool is no longer a dashboard, a form, or a button. It's a sentence.

You will still hear people talk about UX improvements. Better navigation. Cleaner design. More intuitive onboarding flows. It will be framed as progress.

But the real change runs deeper than any redesign. The interface layer is decoupling from the application layer entirely. You don't need to know where the button is. You don't need to learn the menu structure. You just say what you need done.

Natural language is the new UI.

I'm not saying every app will disappear.
I'm not saying this works perfectly today for every use case.
I'm not saying you should throw away your existing workflows.

But here's what I am saying: the apps you already use didn't have to rebuild themselves from scratch for this to be true. KiloClaw can talk to Todoist and Linear and your calendar and your inbox — through the same window, using the same language you'd use to text a colleague. You don't have to live inside each one to operate them.

Credit: Todoist

This isn't about saving five minutes. It's about a bigger shift. The way we interact with software is fundamentally changing.

Twelve Tools, One Front Door

Here's where the new interface really shines.

Last week, I had a new project land in my inbox. I downloaded the PDF, uploaded it to my KiloClaw bot on Telegram, and typed a simple prompt in natural language: Create a Todoist project for this and add the tasks based on these guidelines.

That's it. No excessive bulleted lists. No diagrams. No long paragraphs discussing the background and goals for this project. Just a single sentence.

Thirty seconds later, it was done.

On Monday, I was meeting with a friend and colleague, and we agreed to sync again the following week. We both pulled up our calendars, found a time. I sent a message to KiloClaw. My friend received a calendar invite a minute later.

Two different tools. Two different workflows. One conversation.

Here's the thing: Todoist actually has a feature for this. It's called Todoist Ramble — you can talk to it, describe your project, and it populates tasks for you. That's cool. But that's not the unlock I'm talking about.

I'm the kind of person who has a different tool for everything. Todoist for tasks. Obsidian for a knowledge base. GitHub for engineering projects. Slack for team communication. Gmail for email. Each of them lives in its own silo, with its own interface, its own learning curve, its own quirks.

The problem has never been the tools.

The problem is the twelve different front doors. With a unified interface that acts on natural language, we now have a single way into the house.

The New Interface is the Front Door We Always Needed

Count the apps you opened before lunch today.

Email. Slack. Calendar. Linear. Todoist.

They're all like different doors into your life, each with its own login, its own layout, its own way of asking you to do the same basic thing: move information from your head into the right place.

That tax — the constant context-switching, the re-orienting, the "where does this live?" — is so familiar that most of us stopped noticing it.

We got so used to micro context-switching that we forgot there could be a better way.

Curious?

Here's what I recommend you do to get started: Choose one workflow you do repeatedly. Something tedious. Something where you're just copying information from one place to another. Tell your bot to do it instead.

You might be surprised how short the conversation needs to be.

#MyBotDoesThat: 7 Tasks the Kilo Team Retired From Forever

Darko from Kilo — Mon, 20 Apr 2026 12:57:02 +0000

What Does It Actually Look Like to Retire from a Task?

A lot of people are still waiting for AI to deliver on the futurist promise of freeing us from the boring, tedious, repetitive tasks that nobody wants to do. Scheduling, monitoring, status updates. The kind of low-stakes stuff that somehow still eats 30 minutes of your day because you have to context-switch into it, do the thing, and context-switch back out.

The thing is, that future is already here. Most people just haven't noticed yet. KiloClaw is an always-on personal AI agent that connects to your other platforms, runs in the background, and handles the tasks you keep telling yourself you'll "get to later."

People are using Claws for everything from meal prep to cattle farming, and we wanted to share what that actually looks like in practice before we tell you about the challenge we're running (and the prizes attached to it).

Challenge TL;DR: Retire from a mundane task by automating it with KiloClaw, film a 30-second video of it, post it on social media with #MyBotDoesThat, and nominate 3 people to do the same. You need to mention KiloClaw by name and show part of your dashboard workflow. First place gets $500 in Kilo credits, a $250 Amazon gift card, and 2 free months of hosting.

Enter the Challenge

7 Automations from the Kilo Team

Evgeny: Weekly Meal Prep

Evgeny, Engineer at Kilo, has been running a meal prep workflow through his Claw for the past couple weeks. Every Friday evening it sends him a reminder and they plan the next week's meals together, after which it pushes all the groceries into a single Todoist list and creates a separate list for each day's dishes with step-by-step prep instructions.

He batch cooks on the weekend, freezes everything, and each evening the Claw tells him what to pull out of the freezer for the next day. The loop of plan, shop, prep, freeze, defrost, eat is all coordinated through a single bot that he set up once.

Ligia: Running a Cattle Farm from 10,700 km Away

Ligia, Kilo Support Engineer, manages a cattle operation in Brazil while living in The Netherlands. She uses a health monitoring tool that tracks whether each cow is healthy, lactating, or dry. The tool throws off a constant stream of live updates, most of which are just noise, so she pipes all of them into her Claw. If a health issue persists for more than 24 hours, the Claw adds it to her to-do list and drafts a message to her vet over email or Telegram.

The Claw also has access to the system that monitors milk production and collection, which means she can check output volumes and cash flow without logging into anything. She's doing all of this from over 10,000 kilometers away from the actual farm, using her Claw as the single interface for everything that's happening on the ground.

Scott: Meeting Prep He Never Has to Think About

Scott, Co-Founder and CEO of Kilo, gave his Claw access to its own Google account and connected it to his calendar. Thirty minutes before any meeting, the Claw reviews the attendee list and any attached documents, cross-references everything with his CRM, and sends him a briefing of exactly what he needs to know about 10 minutes before the meeting starts. He doesn't scramble to remember who someone is or what the last conversation was about anymore, because the bot already did that work for him.

Emilie: Checking the Weather

Emilie, Co-Founder and VP of Engineering at Kilo, took about 15 minutes to set up a daily weather briefing as part of her morning update. Instead of opening a weather app, making sure it's pulling the right location, and scrolling past hourly forecasts she doesn't care about, she just gets the relevant info sent to her each morning. It's one of those automations that sounds almost too simple to bother with, but once it's running you realize how many small steps you were doing manually every single day.

Brian: Flight Info Without the Email Dig

Brian, DevRel at Kilo, forwarded his personal Gmail and Google Calendar to his Claw's own Google account, so it could scan for flight information whenever it comes in. Now, whenever he has a flight coming up, the Claw pulls together a briefing with the gate number, flight number, departure time, and seat number. The whole routine of opening your email at the airport, searching "confirmation," scrolling past three marketing emails from the airline, and finally finding the actual itinerary is just gone. The bot already read it and told him what he needs.

Brendan: Spinning Up Benchmarks

Brendan, DevRel at Kilo, retired from dialing into servers to spin up new benchmarks for PinchBench, the benchmark that tests how models perform in OpenClaw. That was it for him. He set up the Claw, pointed it at the workflow, and stopped thinking about it. Not every retirement needs to be elaborate — sometimes the best automation is the one where you just never do the thing again and forget it was ever manual.

Ari: Getting Around NYC

Ari retired from using multiple transit apps to get around NYC. He was tired of apps that don't sync correctly with his calendar, suggest routes that aren't actually great, and ignore some of the best options for getting around the city like ferries. So he deleted all of them and replaced the whole stack with his Kilo Claw bot. One interface that knows his schedule, knows his options, and doesn't try to upsell him on a premium subscription to see the fastest route.

The #MyBotDoesThat Challenge

We're running a challenge where you do the same thing these folks did: retire from a task by offloading it to KiloClaw, film it, and post it. The best videos win prizes.

Place	Prize
🥇 1st Place	$500 in Kilo credits + $250 Amazon gift card + 2 free months of hosting
🥈 2nd Place	$250 in Kilo credits + 2 free months of hosting
🥉 3rd Place	$150 in Kilo credits + 2 free months of hosting

How to Enter

Do the task one last time and make it dramatic.
Say the line: "I'm [name], and I'm retiring from [task] permanently. My KiloClaw bot handles that now."
Show your workflow. Flash your KiloClaw dashboard on screen, screenshare it, or walk through the prompt you gave your bot. We need to see some part of your KiloClaw setup.
Nominate 3 people by name. "I nominate [name], [name], and [name]. What are YOU retiring from?"
Keep it around 30 seconds and post it on any social media platform with #MyBotDoesThat. Tag your nominees in the caption and call them out in the video.
Drop your link here: Enter the Challenge

To be eligible, you need to mention KiloClaw by name and show some part of your workflow in the KiloClaw dashboard. Post it on TikTok, X, LinkedIn, Instagram, YouTube Shorts — wherever you want!

The contest closes Friday, April 24th at 11:59 pm PDT.

Need Inspiration?

We have a massive recipe book of KiloClaw use cases here: ClawBytes

The best entries won't be about impressive automations. They'll be about relatable ones — the kind where everyone watching thinks "wait, I could automate that too." The nomination chain handles distribution and the relatability handles the rest.

What are you retiring from?