Forem: Arindam Majumder

Scaling Karpathy’s AutoResearch Using Nebius Token Factory

Arindam Majumder — Sun, 03 May 2026 13:50:44 +0000

Introduction

We ran about 250 prompt optimization experiments overnight using a small AI agent loop. The idea was simple: let an AI system propose an experiment, run it, evaluate the result, and then try again with a better idea. Instead of manually testing prompts one by one, the system keeps improving its own attempts over multiple iterations.

This idea comes from Andrej Karpathy’s AutoResearch, where an AI agent can automate the typical machine learning research cycle. In a normal workflow, researchers adjust parameters, run experiments, observe the results, and repeat the process many times before reaching a good configuration. AutoResearch shows that this repetitive process can be handled by an intelligent agent.

In this article, we will walk through how we built a cloud-native AutoResearch loop using Nebius Token Factory for LLM inference, allowing the agent to run hundreds of experiments automatically while keeping structured records of every attempt.

Pitfalls in the Original AutoResearch Implementation

Karpathy’s AutoResearch project is a very interesting idea and a great starting point to understand how AI agents can run experiments automatically.

The repository mainly consists of simple files such as program.md (which defines the research goal), train.py (which runs the experiment), and a lightweight results log that records experiment outcomes. The agent reads the goal, modifies the experiment code, runs it, and stores the results, demonstrating how an AI system can iterate through experiments automatically.

It was built as a research prototype to demonstrate the concept, not as a full system for running large-scale experiments in real workflows.

As a result, a few limitations arise when we try to scale the idea.

Experiment tracking using TSV files: In the original implementation, experiment results are stored in a simple TSV (tab-separated values) file. This file usually contains basic fields like experiment ID, score, and parameters. While this is easy to implement, it becomes difficult to manage as the number of experiments grows.
Limited structure for experiment metadata: A flat TSV file does not provide structured storage for experiment details such as prompts, responses, timestamps, or reasoning steps.
Local model dependency: The original workflow usually depends on running models locally. This means developers often need access to local GPUs or preconfigured environments to run inference.
Difficulty scaling experiment loops: Because of the local infrastructure and simple logging system, running hundreds of experiments in a controlled way becomes harder.

// Detect dark theme var iframe = document.getElementById('tweet-2030371219518931079-421'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2030371219518931079&theme=dark" }

What We Are Going to Build

The use case for this project is prompt optimization. In many real workflows, developers often try multiple prompts manually to get the best response from a language model. This usually involves repeated trial-and-error, where prompts are slightly modified, tested, and evaluated before arriving at a good result. When the number of experiments increases, managing and tracking these attempts becomes difficult.

In this tutorial, we will build a small system that keeps the same AutoResearch idea while addressing a few of those practical limitations.

AutoResearch-style experiment loop: We implement the same research cycle where an agent proposes an experiment, runs it, evaluates the result, stores the outcome, and then proposes the next attempt based on previous results.
Cloud-based reasoning using Nebius Token Factory: Instead of relying on local models, the agent uses Nebius Token Factory to generate experiment ideas and responses. This removes the dependency on local GPUs and makes the research loop easier to run.
Structured experiment tracking using JSON: Instead of logging experiments in a flat TSV file, each experiment is stored as a structured JSON record. This allows us to track prompts, responses, scores, and timestamps more clearly.
Prompt optimization as the experiment domain: For this tutorial, the system will try to generate the best explanation for vector databases. Each experiment proposes a different prompt and evaluates how well the response matches our criteria.
Running a large experiment loop: The system runs around 250 iterations, allowing the agent to gradually improve prompts by learning from previous experiment results.

Why Nebius Token Factory

To run the AutoResearch loop properly, the agent needs a reliable way to generate experiment ideas and responses. Instead of running models locally, we use Nebius Token Factory as the inference layer.

Managed model inference: Nebius Token Factory lets us run open models through an API. We do not need to download models or manage GPUs locally.
Access to open models: The platform provides models such as Llama, DeepSeek, and Qwen, which can be used for reasoning, prompt generation, and experiment responses.
OpenAI-compatible API: The API follows the same structure used in the OpenAI SDK. This makes integration simple in Python applications.
Agent reasoning backend: In our system, the AutoResearch agent calls Nebius Token Factory to analyze previous experiments and propose the next prompt.
Supports large experiment loops: Because inference runs in the cloud, we can run hundreds of experiment iterations without worrying about local compute limits.
No local GPU requirement: Developers can run the experiment loop directly from their machine while the model inference happens in Nebius infrastructure.

For more demos, refer to the cookbooks here.

Tutorial: Building an AutoResearch-like System with Nebius Token Factory

Step 1 — Clone the Project Repository

Clone the repository.

git clone https://github.com/studio1hq/Nebius_autoresearch.git

Install required dependencies.

pip install requests python-dotenv

Set up a working environment.
Review the Project Structure

File	Purpose	What it Does
program.md	Defines the research goal	Contains the task the agent is trying to optimize. In this project, the goal is to generate the best prompt for explaining vector databases.
agent.py	Generates the next experiment	Uses Nebius Token Factory to analyze previous experiment results and propose the next prompt to test.
experiment.py	Runs the experiment	Sends the generated prompt to Nebius Token Factory and retrieves the model’s response.
scorer.py	Evaluates experiment output	Scores the response based on simple rules such as response length and presence of relevant keywords. Returns a numeric score.
main.py	Controls the AutoResearch loop	Loads previous experiment history, runs the experiment cycle, logs results, and repeats the process for multiple iterations.
results.json	Stores experiment history	Saves structured experiment records including prompt, response, score, and timestamp. Easier to analyze compared to flat TSV logs.

Step 2 — Configure Nebius Token Factory

Create a Nebius Token Factory API key

Open the Token Factory dashboard:

Navigate to the API Keys section in the left sidebar and click Get API key.

Create a new key for your project. This key will be used by the application to authenticate API requests sent to Nebius Token Factory.

Store the API key as an environment variable

Create a .env file in the root of the project and store the key as an environment variable:

NEBIUS_API_KEY=your_api_key

Configure the Token Factory client

Nebius Token Factory provides an OpenAI-compatible API, so we can use the same client structure used for OpenAI integrations.

Initialize the client using the Nebius API base URL and the API key stored in the environment variable.

The request sends the prompt to the selected model (for example, llama-3-70b-instruct) and receives the generated response.

In this system, these API calls power two parts of the AutoResearch loop:

Agent reasoning: Where the agent proposes the next experiment prompt (agent.py)

from typing import List, Dict, Any
from client import generate_response

def propose_experiment(history: List[Dict[str, Any]], goal: str) -> str:
    """
    Proposes a new system prompt based on the research goal and experiment history.

    Args:
        history: List of previous experiment results (dicts with 'prompt' and 'score').
        goal: The research goal string.

    Returns:
        str: The proposed system prompt for the next experiment.
    """
    # Format history for the prompt context
    if not history:
        history_text = "No previous experiments."
    else:
        # Use the last 3 experiments to provide context without overflowing context window
        recent_history = history[-3:]
        history_text = "\n".join([
            f"Attempt {i+1}:\n"
            f"Prompt: {r.get('prompt', 'Unknown')}\n"
            f"Score: {r.get('score', 0)}/10\n" 
            for i, r in enumerate(recent_history)
        ])

    system_instruction = (
        "You are an expert AI researcher optimizing a system prompt to achieve a specific goal.\n"
        "Analyze the previous attempts and their scores. Identify what worked and what didn't.\n"
        "Then, generate a NEW, improved system prompt that is likely to achieve a higher score.\n"
        "Do not repeat previous prompts. Be creative and precise.\n\n"
        "Format your response exactly as follows:\n"
        "THOUGHT: <your analysis and plan>\n"
        "PROMPT: <the actual system prompt text>"
    )

    user_message = (
        f"Research Goal:\n{goal}\n\n"
        f"Experiment History:\n{history_text}\n\n"
        "Based on the above, generate the next system prompt to test.\n"
        "Ensure you include your thought process before the prompt."
    )

The agent builds a context using the goal + past experiments.
This prompt is sent to Nebius using generate_response().
The model suggests the next experiment (new prompt).
This is where the “thinking” of the agent happens.

Experiment execution: where the prompt is sent to the model, and the response is evaluated. (experiment.py)

from client import generate_response

def run_experiment(system_prompt):
    """
    Runs the experiment by using the proposed system prompt to explain vector databases.
    """
    test_prompt = f"{system_prompt}\n\nExplain vector databases. Respond in under 120 words."
    return generate_response(test_prompt)

The generated prompt is sent again to Nebius.
The model produces the actual output for evaluation.
This response is later scored and stored.
This is the execution step of the experiment.

With this configuration in place, the AutoResearch loop can use Nebius Token Factory as its inference backend.

For example, (refer to client.py)

api_key = os.environ.get("NEBIUS_API_KEY")
    if not api_key:
        raise ValueError("NEBIUS_API_KEY environment variable not set")

    # Nebius Token Factory OpenAI-compatible endpoint
    url = "https://api.tokenfactory.nebius.com/v1/chat/completions"

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        # Optional parameters for generation control
        "temperature": 0.7,
        "max_tokens": 4096
    }

You can also use an OpenAI-compatible API interface. The same client patterns used in OpenAI-based applications can be used here with minimal changes, making integration into existing workflows straightforward.

from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com",
        api_key="NEBIUS_API_KEY")

completion = client.chat.completions.create(
    model="llama-3-70b-instruct",
    messages=[{
      "role":"user",
      "content":"What is the answer to all questions?"
    }]
)
print(completion.choices[0].message.content)

Step 3 — Understanding the AutoResearch Loop

The system follows a simple research loop where the agent proposes experiments, evaluates the results, and improves the next attempt based on previous runs.

Load the research goal: The system reads the task defined in program.md. In this case, the goal is to generate the best prompt for explaining vector databases.
Load previous experiment history: The program reads results.json to understand what prompts were already tested and how they performed.
The agent proposes the next experiment: The agent uses Nebius Token Factory to analyze the previous results and suggest a new prompt to try.
Run the experiment: The new prompt is sent to the model through the Token Factory API, and the response is generated.
Score the result: The response is evaluated using the scoring function defined in scorer.py.
Store the experiment record: The prompt, response, score, and timestamp are saved in results.json.
Repeat the loop: The system continues this cycle for around 250 experiment iterations, allowing the agent to gradually improve the prompts.

Step 4 — Run the System

Command:

python main.py

Terminal output shows:

iteration number
proposed prompt
generated response
score.

Every 25 experiments, the system prints a short report in the terminal showing the number of experiments completed, the best score so far, and the prompt that produced the best result. This helps track how the agent is improving over time.

Each experiment is stored as a structured record in results.json. The record contains the prompt used, the response generated by the model, the score assigned by the evaluation function, and the timestamp of the run. This structured format makes it easier to inspect and analyze experiment history later.

Instead of starting from scratch every time, the system loads the existing results.json file when the program starts. It detects the last completed experiment and continues from the next iteration. This allows the experiment loop to resume without losing previous results.

Key Takeaways

AutoResearch shows how an AI agent can run experiments automatically by proposing ideas, testing them, and learning from the results.

In this tutorial, we extended that idea by using Nebius Token Factory for cloud-based model inference and structured experiment logging for better tracking. This makes the research loop easier to run, easier to observe, and more practical for developers experimenting with AI workflows.

If you want to try similar experimentation workflows, you can start with Nebius Token Factory to run open models through a simple API without managing GPUs. The broader Nebius Cloud platform also provides GPU infrastructure, scalable inference, and tools for building AI applications.

Explore the available services, experiment with different models, and build your own AI-driven systems using the Nebius ecosystem.

Build a HackMD-Style Collaborative Markdown Editor with React, Antigravity IDE & Velt

Arindam Majumder — Mon, 27 Apr 2026 07:21:30 +0000

TL;DR

Building real-time collaboration from scratch takes significant effort. You need sync logic, presence, comments, and infrastructure before you even ship the feature.

In this guide, we generate a pixel perfect HackMD style editor UI using Antigravity, connect live markdown preview in React, and then use Velt to add presence, live sync, and comments in just a few steps.

What We’re Building

We are building a HackMD style markdown editor with a clean two pane layout. On the left, users can write markdown. On the right, they see a live rendered preview. The interface follows a dark theme and mirrors the structure and layout of HackMD closely.

This is not just a static clone. The final result will support real time collaboration, allowing multiple users to edit, comment, and stay aware of each other inside the same document.

Tech Stack

We use a focused, minimal stack:

React with Vite and TypeScript: Provides a fast development setup and a clean component-based architecture.
**Antigravity:** Used to generate a pixel-accurate editor UI directly from a reference image. This allows us to replicate the layout precisely without manual design iteration.
**Velt React SDK:** Adds the collaboration layer. We use it to enable presence, live state sync, and contextual comments without building real-time infrastructure from scratch.

Step 1: Generating a Pixel-Perfect UI with Antigravity

Antigravity is an AI-powered development platform and “agent-first” IDE where AI agents assist with coding tasks across your editor, terminal, and browser, moving beyond simple code completion toward autonomous execution of complex software workflows.

It lets you generate and modify real code based on high-level instructions, orchestrating planning, editing, and validation with minimal manual effort.

Why Use Antigravity?

Cloning an interface like HackMD manually is time-consuming. Matching spacing, typography, layout, and dark mode details takes careful iteration.

We used Antigravity to generate the editor UI directly from the reference image. The prompt enforced strict visual fidelity. No redesign. No interpretation.

This gave us:

Rapid UI cloning: Full split layout with header, editor, preview, and status bar in minutes.
Pixel-accurate output: Layout and styling matched the reference closely.
No design drift: The UI stayed consistent with the original.

With the UI ready, we could move straight to functionality and collaboration.

The Prompt Strategy

The prompt was written with strict visual constraints. Every layout detail, spacing rule, and styling decision had to follow the reference image exactly.

We enforced a simple rule: the image always wins. If there was any conflict between best practice and the screenshot, the screenshot was treated as the authority.

You are an expert frontend engineer and UI pixel-perfect implementer.

Your task is to build **only the editor UI** of a HackMD-style markdown editor **exactly matching the provided image**. This is **not** a redesign, interpretation, or approximation. This must be a **visual and behavioral clone** of the image.

---

### CRITICAL INSTRUCTIONS (DO NOT IGNORE)

* **DO NOT make any assumptions** about layout, spacing, colors, typography, sizing, or behavior.
* **DO NOT invent UI elements** that are not visible in the image.
* **DO NOT omit UI elements** that are visible in the image.
* **DO NOT restyle or “improve” anything.**
* **DO NOT change colors, icons, padding, fonts, or alignment.**
* **DO NOT guess breakpoints** — infer responsiveness strictly from the image and standard proportional scaling.
* **Follow the image exactly as it is.** If something is unclear, replicate it as faithfully as possible from the visual evidence alone.

---

### INPUT CONTEXT

* You are working inside a **basic React project**.
* You are building **only the editor UI** (no authentication, no backend, no real GitHub integration).
* The editor consists of:

  * **Left pane**: Markdown editor
  * **Right pane**: Live markdown preview
* The provided image is the **single source of truth**.

---

### REQUIRED OUTPUT

Produce **production-ready React code** that recreates the UI **pixel-perfectly**.

You must:

1. Use **React functional components**
2. Use **CSS (or CSS Modules / styled-components)** to precisely match styles
3. Ensure the layout is **fully responsive** to all screen sizes while preserving proportions
4. Match:

   * Background colors
   * Pane widths
   * Divider behavior
   * Toolbar icons and placement
   * Font family, size, weight
   * Line height
   * Button styles
   * Hover/focus states (only if visible/implied)
   * Spacing and margins
5. Implement:

   * Markdown input on the left
   * Live preview rendering on the right
6. Match **dark mode styling exactly** as shown
7. Match scrollbar appearance as closely as possible
8. Use **no external UI libraries** unless strictly necessary for markdown parsing
9. Use semantic HTML where applicable

---

### LAYOUT REQUIREMENTS

* Two-column split layout
* Left: editable markdown text area
* Right: rendered markdown preview
* Divider exactly positioned as shown
* Toolbar at the top exactly matching icon order, spacing, and alignment
* Bottom GitHub buttons and template buttons must appear exactly as shown (visual only)

---

### RESPONSIVENESS REQUIREMENTS

* On smaller screens:

  * Maintain proportional scaling
  * Preserve visual hierarchy
  * Do NOT collapse, remove, or redesign panes unless explicitly shown in the image
* No mobile-specific UI unless clearly implied by the image

---

### FUNCTIONAL REQUIREMENTS

* Markdown typing updates preview in real time
* Toolbar buttons do NOT need real functionality unless explicitly visible
* GitHub buttons are **visual only**
* No routing, no persistence, no API calls

---

### DELIVERY FORMAT

* Return:

  * React component(s)
  * CSS
  * Brief explanation of structure
* Code must be clean, readable, and copy-paste ready

---

### FINAL RULE

If there is ever a conflict between best practices and the image:
**THE IMAGE ALWAYS WINS.**

Use the provided image as the **absolute authority** and replicate it **exactly**.

Generated Component Structure

Antigravity generated a clean, modular React structure instead of a single large file. The UI was split into focused components, each responsible for one section of the editor. This made the layout easy to reason about and ready for collaboration features.

Header.tsx — Top navigation and toolbar
Editor.tsx — Left pane markdown input
Preview.tsx — Right pane rendered markdown
StatusBar.tsx — Bottom metadata bar
Layout.tsx — Structural wrapper composing the editor layout

This structure is accurate based on the repo you shared. It matches the actual src/components breakdown and reflects how the UI is assembled in App.tsx.

Implementing Markdown + Live Preview

The editor follows a simple two-pane model. The left pane is a controlled textarea where users write markdown. The right pane renders the parsed markdown in real time.

State is lifted to a shared parent component so that every keystroke updates both the editor and the preview instantly. This keeps the UI predictable and ensures the preview always reflects the latest content.

Step 2: Making the Editor Collaborative with Velt

Once the local markdown editor was working, the next step was to make it collaborative. Instead of building real-time infrastructure from scratch, we integrated Velt to handle sync, presence, and comments.

Why Use Velt?

Velt is a collaboration SDK that lets developers embed real-time collaboration features into web products quickly and efficiently. It provides fully managed components and backend support so you can add multiplayer-style experiences without building real-time infrastructure from scratch.

Key Features of Velt:

Live Sync – Real-time shared state across users so everyone sees updates instantly.
Comments – Contextual commenting components like those in Figma, Google Docs, and spreadsheet tools.
Presence & Cursors – Shows active users and cursor positions in shared sessions.
Multiplayer Editing – Multiple users can edit content concurrently with conflict resolution.
Notifications – Built-in support for alerts and updates (mentions, replies).
Recording & Huddles – audio/video/screen recording and in-app collaborative sessions.
Customizable SDK – Components and behavior can be styled and extended to match your product.

Agent Skills and MCP Integration

Velt recently introduced Agent Skills and an implementation MCP that allow collaboration features to be integrated using AI agents. Instead of manually wiring presence, comments, and live sync, agents can now orchestrate much of the integration flow.

Installing & Setting Up Velt

We started by installing the Velt React SDK and adding it to the project. This gives us access to collaboration primitives such as live state, presence, and comments.

npm install @veltdev/react
# Optional: npm install --save-dev @veltdev/types

Next, we wrapped the root of the application with the Velt provider. This initializes the collaboration layer and connects the app to Velt using an API key.

From this point on, collaboration features can be layered into existing components without restructuring the entire application.

Since Velt also supports Agent Skills and MCP-based implementations in an agent-enabled environment, collaboration features can be scaffolded automatically, without manually wiring every component. The agent can configure provider setup, inject components, and connect live state with minimal manual steps.

In other words:

Manual SDK setup → Explicit integration in code
Agent Skills / MCP → AI-assisted integration with reduced setup effort

For this project, we used the manual SDK approach. But teams using agent driven workflows can accelerate collaboration integration even further.

Adding Text Comments

With Velt initialized, the next step was enabling inline comments inside the document.

We wrapped the editor layout with the VeltComments component in text mode. This attaches a collaborative comment layer directly to the markdown content without changing the editor’s internal logic.

Contextual inline comments: Users can select text and leave feedback directly within the document.
Anchored collaboration: Comments stay attached to specific sections even as content evolves.
Multi-user discussion: Multiple users can comment and reply in the same document in real time.

At this point, the editor moves from being a single user tool to a shared workspace.

function App() {
  const apiKey = import.meta.env.VITE_VELT_API_KEY;
  const [currentUser, setCurrentUser] = useState(staticUsers[0]);

  const switchUser = (user: VeltUser) => {
    setCurrentUser(user);
    localStorage.setItem("hackmd-current-user", user.userId);
  };

  // Load user preference on app start
  useEffect(() => {
    const storedUserId = localStorage.getItem("hackmd-current-user");
    const user = storedUserId
      ? staticUsers.find((u) => u.userId === storedUserId) || staticUsers[0]
      : staticUsers[0];
    setCurrentUser(user);
  }, []);

  return (
    <VeltProvider apiKey={apiKey}>
      <VeltComments textMode={true} darkMode={true} />
      <AppContent
        currentUser={currentUser}
        staticUsers={staticUsers}
        onSwitchUser={switchUser}
      />
    </VeltProvider>
  );
}

export default App;

Enabling Live Sync

Comments make the document collaborative, but the content itself is still local. To enable true multi-user editing, we replaced local React state with Velt’s shared live state.

Instead of managing markdown with useState, we switched to useLiveState. This hook stores the document content in a shared real time layer managed by Velt.

Every update to the markdown now propagates instantly across connected users. No WebSockets, no manual sync logic, no conflict resolution setup.

The rest of the component structure remains unchanged. Only the state source is replaced.

Multi user editing — Multiple users can type in the same document simultaneously.
Instant shared updates — Changes appear in real time across all active sessions.

This is the moment where the editor becomes fully collaborative.

import React from 'react';
import { useLiveState } from '@veltdev/react';
import Header from './Header';
import Editor from './Editor';
import Preview from './Preview';
import StatusBar from './StatusBar';
import type { VeltUser } from '../types/veltUser';
import { defaultMarkdown } from '../constants/defaultTemplate';

interface LayoutProps {
    currentUser: VeltUser;
    staticUsers: VeltUser[];
    onSwitchUser: (user: VeltUser) => void;
}

const Layout: React.FC<LayoutProps> = ({ currentUser, staticUsers, onSwitchUser }) => {
    const [markdown, setMarkdown] = useLiveState<string>('hackmd-clone-markdown', defaultMarkdown);

    return (
        <div style={{
            display: 'flex',
            flexDirection: 'column',
            height: '100vh',
            width: '100vw',
            overflow: 'hidden'
        }}>
            <Header currentUser={currentUser} staticUsers={staticUsers} onSwitchUser={onSwitchUser} />
            <div style={{
                display: 'flex',
                flex: 1,
                overflow: 'hidden',
                position: 'relative'
            }}>
                <Editor value={markdown} onChange={setMarkdown} />
                <div style={{
                    width: '1px',
                    backgroundColor: '#000',
                    opacity: 0.3,
                    zIndex: 10
                }} />
                <Preview content={markdown} />
            </div>
            <StatusBar />
        </div>
    );
};

export default Layout;

Presence Awareness

Editing and commenting are core collaboration features, but presence adds awareness. It lets users see who else is currently active inside the document.

With Velt, presence is automatically tracked once the provider is configured. Active users can be identified in the session, enabling visual indicators such as avatars or active participant signals.

This creates a collaborative awareness layer. Users know when others are viewing or editing the same document, which reduces overlap and improves coordination.

import React, { useState } from 'react';
import { VeltPresence } from '@veltdev/react';
import type { VeltUser } from '../types/veltUser';

interface HeaderProps {
    currentUser: VeltUser;
    staticUsers: VeltUser[];
    onSwitchUser: (user: VeltUser) => void;
}

const Header: React.FC<HeaderProps> = ({ currentUser, staticUsers, onSwitchUser }) => {
    const [showUserMenu, setShowUserMenu] = useState(false);
    return (
        <div style={{
            height: 'var(--toolbar-height)',
            backgroundColor: '#2f3136', // Darker gray for toolbar
            display: 'flex',
            alignItems: 'center',
            justifyContent: 'space-between',
            padding: '0 16px',
            borderBottom: '1px solid #111',
            fontSize: '14px',
            color: '#b9bbbe'
        }}>
            {/* Left Section */}
            <div style={{ display: 'flex', alignItems: 'center', gap: '8px' }}>
                <div style={{ 
                    display: 'flex', 
                    alignItems: 'center', 
                    gap: '8px', 
                    color: '#fff', 
                    fontWeight: 600,
                    marginRight: '12px'
                }}>
                    <div style={{
                        width: '24px', 
                        height: '24px', 
                        borderRadius: '50%', 
                        background: '#3370b7', 
                        display: 'flex', 
                        alignItems: 'center', 
                        justifyContent: 'center'
                    }}>
                        <Power size={14} color="white" />
                    </div>
                    <span>My workspace</span>
                </div>

                <div style={{ width: '1px', height: '20px', background: '#4f545c', margin: '0 4px' }}></div>

                {/* Editor Mode Buttons */}
                <div style={{ display: 'flex', background: '#333', borderRadius: '4px', padding: '2px' }}>
                    <button style={{ padding: '4px 8px', background: '#444', borderRadius: '3px', color: '#fff' }}>
                        <Pencil size={14} />
                    </button>
                    <button style={{ padding: '4px 8px', color: '#888' }}>
                        <Columns size={14} />
                    </button>
                    <button style={{ padding: '4px 8px', color: '#888' }}>
                        <Eye size={14} />
                    </button>
                </div>

                <button style={{ padding: '4px' }}><Plus size={18} /></button>
                <button style={{ padding: '4px' }}><HelpCircle size={18} /></button>
                <button style={{ padding: '4px' }}><Search size={18} /></button>
            </div>

...

export default Header;

What We Didn’t Have to Build

Using Velt removed the need to build and maintain a complex collaboration infrastructure.

No WebSocket layer for managing real-time connections
No CRDT or conflict resolution system for concurrent edits
No custom backend service for syncing document state
No notification engine for mentions and updates
No database layer for storing and anchoring comments

This allowed us to ship faster, reduce engineering overhead, and keep the codebase focused on core product functionality rather than infrastructure.

Try It Yourself

You can run the full demo locally and explore the collaborative features in action.

Clone the repository
Install dependencies
Add your Velt API key
Start the development server

Once running, open the app in two different browsers or devices. You will see live sync, comments, and presence working in real time.

Key Takeaways

Modern tooling changes how fast we can ship collaborative software. AI can drastically accelerate UI replication, allowing you to move from design to production-ready components in minutes. At the same time, collaboration infrastructure no longer needs to be built from scratch. By layering Velt on top of a clean React architecture, you can enable live sync, comments, and presence without managing real-time systems yourself.

If you’re building collaborative features into your product, explore Velt and see how quickly you can turn a single user interface into a shared workspace.

Resources

Velt Documentation
GitHub Repository
Live Demo: Try the application yourself

Claude Opus 4.7 seems to use way more tokens than expected

Arindam Majumder — Wed, 22 Apr 2026 04:35:24 +0000

While playing with Opus 4.7 over the last few days, I noticed that prompts were filling context much faster than I expected.

I also came across a few measurements from others testing it with real developer inputs like project instructions, git logs, stack traces, and long coding prompts.

Claude Opus 4.7 seems to use way more tokens than expected
Anthropic mentions the updated tokenizer may produce around 1.0–1.35× more tokens compared to previous models.

But a lot of the real-world measurements seem closer to ~1.4–1.47× more tokens. Which becomes noticeable pretty quickly if you're running larger contexts.

That means:

context budgets disappear faster
long-running sessions accumulate tokens much quicker
effective cost per workflow goes up

Not necessarily a bad thing, though.

I mean, Tokenizer changes are usually made to improve how the model handles code, markdown, structured text, and other developer-heavy inputs. So there’s probably a capability tradeoff happening here.

I made a short video here walking through the measurements, the tokenizer changes, and what it means in practice, if you want to explore more

Claude’s new Advisor Strategy is pretty interesting

Arindam Majumder — Wed, 15 Apr 2026 20:07:17 +0000

A lot of people building AI agents run into the same problem sooner or later.

If you run the entire agent on a powerful model, it works well but the costs grow quickly. If you run everything on a cheaper model, the system stays fast and affordable but it sometimes makes weak decisions, especially when planning complex tasks or choosing tools.

Anthropic recently introduced something called Advisor Strategy that tries to solve this in a simple way.

Instead of using one model for everything, the agent runs on a smaller executor model like Sonnet or Haiku. That model handles the normal workflow such as calling tools, executing steps, and moving the task forward. When the agent reaches something more complex, it can consult a stronger model like Opus for guidance. The advisor reads the full context, suggests what to do next, and the executor continues the workflow.

So most of the work stays cheap and fast, but the agent can still get strong reasoning when it actually needs it. It feels a lot like how a junior engineer works most of the time but occasionally asks a senior engineer for advice.

I found this architecture interesting because it pushes agent systems toward multi-model setups instead of relying on a single model for everything, which seems like a direction many frameworks will probably move toward.

I made a short video breaking down how the advisor strategy works and how developers can implement it in their own agents

Production-Aware AI: Giving LLMs Real Debugging Context

Arindam Majumder — Thu, 09 Apr 2026 05:22:32 +0000

TL;DR

Large language models struggle with production debugging because they do not have visibility into how code actually executes at runtime.
Inputs such as logs, stack traces, and metrics provide incomplete signals, which often cause confident but incorrect conclusions about root causes.
When AI reasoning is grounded in function-level runtime data collected from production systems, debugging becomes accurate, explainable, and reliable.

Introduction

Large language models are increasingly used by developers to understand code, analyze failures, and assist during incident response. In controlled environments, they are effective at explaining logic and suggesting fixes. In production systems, however, their usefulness often drops sharply.

A recent survey of developers found that a quarter of developers spend more time debugging than writing code each week. The same survey reported that bugs and tooling failures cost teams nearly 20 working days per year in lost productivity. These numbers reflect a reality most engineering teams already experience.

Production debugging takes time because failures depend on runtime factors such as traffic patterns, concurrency, queue depth, and system state that are absent in non-production environments. Most AI systems do not observe these execution conditions. They analyze code structure and reported symptoms, rather than the runtime behavior that caused the failure.

In this article, we will discuss why production context is critical for AI debugging, what production-aware AI really means, and how runtime intelligence enables more accurate and trustworthy debugging outcomes.

Why Production Issues Cannot Be Understood from Code Alone

Code defines control flow and data handling, but production behavior is determined by runtime conditions such as traffic volume, concurrency, and system state.

In production, requests arrive concurrently and compete for shared resources. As traffic increases, queues begin to accumulate work, caches evolve, and external dependencies respond with variable latency or partial failures. Together, these factors influence execution order, timing, and resource contention in ways that are not visible when reading code or running isolated tests.

Many production failures arise only when specific runtime conditions are met. Race conditions appear under concurrent access. Performance regressions surface under sustained or uneven load. Retry mechanisms can magnify transient upstream failures into system-wide impact. In each case, the logic itself may be correct, while the observed failure is a result of how that logic behaves under real execution pressure.

This leads to a common outcome during incident response. The code appears correct because the failure is not caused by a logical error. The root cause exists in how the code executes under real production conditions, not in how it reads in isolation.

How LLMs Debug Today: Strengths and Structural Limits

Large language models assist debugging by analyzing text. They infer intent, recognize common patterns, and map symptoms to known classes of problems. This makes them effective for code review, error explanation, and reasoning about familiar failure modes.

However, their understanding is entirely constrained by the inputs they receive. Without access to runtime execution data, their conclusions are based on probability rather than evidence.

Aspect	What LLMs Do Well	Structural Limitation
Code understanding	Explain logic, control flow, and common anti patterns	Cannot observe how code executes under real load
Input analysis	Reason over logs, stack traces, and snippets	Inputs represent symptoms, not full execution context
Pattern matching	Identify known bug patterns and typical fixes	Fails when failures are novel or environment specific
Root cause analysis	Propose plausible explanations	Cannot validate causality without runtime signals
Decision making	Rank likely fixes based on training data	Relies on probabilistic inference when facts are missing

Without visibility into execution order, timing, frequency, and state, LLMs are forced to guess. The results may sound correct, but they are not grounded in how the system actually behaved.

Hallucinations Are Caused by Missing Runtime Evidence

Hallucinations in AI-assisted debugging usually appear when the system does not have enough information about what actually happened during execution. This is common in production, where AI is asked to explain failures using logs, stack traces, or small pieces of code that describe symptoms but not runtime behavior.

Recent research on AI reliability shows that incorrect answers increase when important contextual details are missing. In debugging scenarios, these details include execution order, timing, system state, and how frequently specific code paths were executed. Without this information, AI systems infer causes based on likelihood rather than evidence.

The same pattern appears in studies on AI-driven debugging and code repair. When models are given execution traces or feedback from real runs, fault localization and fix accuracy improve. When this runtime information is absent, models often produce explanations and fixes that appear reasonable but fail to address the real cause of the issue.

Prompt refinement does not address this limitation. Clearer prompts help structure responses, but they do not introduce new facts. If execution data is missing, the model still reasons without evidence about how the system behaved.

In production debugging, hallucinations are therefore expected. They occur when AI systems are asked to explain failures they cannot observe, not because the reasoning process is flawed, but because the necessary runtime evidence is absent.

The Missing Context in AI Debugging Workflows

Most AI debugging workflows rely on the same signals engineers have used for years. These signals are useful, but they describe outcomes, not execution, which creates a gap between what failed and why it failed.

What AI usually receives today

Logs: Logs capture messages emitted by code paths that were explicitly instrumented. They are selective, often incomplete, and rarely reflect execution order, frequency, or timing across concurrent requests.
Stack traces: Stack traces show where an error surfaced, not how the system reached that state. They lack information about prior execution paths, state changes, and interactions with other components.
Metrics: Metrics summarize system behavior at an aggregate level. They indicate that something is slow or failing, but they do not identify which functions caused the issue or how behavior changed over time.

What is missing

Function level execution behavior: Which functions ran, how often they executed, and how long they took under real load conditions.
Runtime performance characteristics: Execution timing, concurrency effects, retries, and resource contention that emerge only during live operation.
Connection between user impact and code: Clear linkage between affected endpoints or workflows and the exact functions responsible for the observed behavior.

When AI reasons over incomplete signals, it cannot establish causality. Proposed fixes are derived from statistical patterns rather than observed execution, which often results in changes that compile or deploy successfully but do not resolve the underlying issue. Effective debugging requires visibility into execution behavior, not only error reports or surface-level symptoms.

Defining Production-Aware AI

Consider a common production incident. An API endpoint becomes slow after a deployment. Logs show no errors. Metrics show increased latency. The code itself looks unchanged or correct. An AI system reviewing this information can suggest several possible causes, such as a database query, a cache miss, or an external dependency. Each suggestion sounds reasonable, but none is confirmed.

This is where production awareness matters. A production-aware AI does not rely only on aggregated metrics or isolated log lines. It reasons using information about how the system actually executed under real traffic. It can see which functions ran more often than before, where execution time increased, and which code paths were exercised during the slowdown.

Production-aware AI is defined by the context it uses. It grounds reasoning in runtime behavior rather than static structure. It focuses on how functions are executed, how often they ran, and how their performance changes over time, instead of relying only on what the code looks like or what developers expect it to do.

This approach changes the quality of debugging. Instead of proposing likely explanations, the AI reasons from observed execution evidence.

Why Function-Level Runtime Intelligence Changes AI Debugging

Function-level runtime intelligence gives AI direct visibility into how software behaves while it is running. This visibility changes debugging from interpreting symptoms to analyzing execution.

Instead of inferring behavior from secondary signals, AI can reason using execution facts collected in real time.

Function-level data as the missing signal: Function-level data shows which functions executed, how frequently they ran, and how long they took under real load. This information allows AI to identify abnormal behavior at the exact point where performance or correctness changed.
Linking endpoints to execution paths: Runtime intelligence connects external symptoms to internal execution. When an HTTP endpoint slows down, or a queue backs up, AI can trace the issue to the specific functions involved, rather than reasoning only at the service or request level.
Temporal awareness across deployments: By comparing runtime behavior before and after a deployment, AI can identify which functions changed execution characteristics. This makes regressions visible without relying on alerts or manual comparison.

How Hud Enables Production-Aware AI

Hud captures function-level execution behavior directly from production systems. Instead of relying on aggregated metrics, sampled traces, or predefined alert rules, it observes how individual functions execute under real traffic, including errors and performance changes.

This execution data can be consumed directly by engineers and AI systems to reason about production behavior based on observed runtime evidence.

Below are the core capabilities that allow Hud to provide production-aware runtime context for AI-assisted debugging.

Runtime code sensing at the function level: Hud acts as a runtime code sensor. You get continuous function-level execution data from production, without manual instrumentation or ongoing maintenance. This data reflects how code actually runs under real traffic.
Automatic detection of errors and slowdowns: Hud automatically detects errors and performance degradations based on changes in runtime behavior, not static rules.
Linking user impact to code: When an endpoint slows down, or a queue backs up, Hud connects that business-level symptom directly to the functions responsible. You can see which parts of the code caused the impact, not just where it surfaced.
Post-deployment behavior comparison: Hud automatically detects deployments and compares function behavior across versions. You can see what changed in production after a release and identify regressions without manual diffing.
Runtime context for AI debugging: Hud provides a full forensic runtime context that you can use inside the IDE or pass to AI agents through its MCP server. This allows AI to reason from execution evidence instead of guessing from partial signals.

Key Takeaways

Without visibility into how code actually ran in production, AI systems reason over symptoms instead of causes, which leads to incorrect or incomplete fixes. Production systems demand runtime grounded reasoning, where function-level behavior, execution timing, and real traffic conditions are first-class inputs.

When AI is given this level of visibility, hallucination decreases, and confidence aligns with correctness. Production-aware AI is therefore not an optimization, but a requirement for reliable debugging.

Hud gives you function-level runtime visibility directly from production, with no configuration and no maintenance. Explore how Hud works, read the documentation, or book a demo to see how production-aware debugging changes the way you and your AI systems understand failures.

I built a local dashboard to inspect Claude Code sessions, tokens, and costs

Arindam Majumder — Thu, 02 Apr 2026 07:58:55 +0000

I’ve been using Claude Code heavily over the last few weeks and started wondering where my tokens were actually going.

Claude stores everything locally in ~/.claude/, which is great, but the data mostly sits in JSON logs. If you want to understand session usage, token costs, tool calls, or activity patterns, you basically end up digging through raw files.

So I built a small tool called cc-lens.

I built a local dashboard to inspect Claude Code sessions, tokens, and costs
It’s a local-first dashboard that reads your Claude Code session files and turns them into something you can actually explore.

It runs entirely on your machine. It doesn't have any cloud sync, sign-ups, or telemetry.

Some things it shows:

• Usage overview: sessions, messages, tokens, estimated cost
• Per-project breakdown: see which repos are burning the most tokens
• Full session replay: inspect conversations turn-by-turn with token counts and tool calls
• Cost & cache analytics: stacked charts by model and cache usage
• Activity heatmap: GitHub-style view of when you’re using Claude the most
• Memory & plan explorer: browse/edit Claude memory files and saved plans
• Export/import: move dashboards across machines

You can run it instantly with:

npx cc-lens
(or clone the repo if you prefer).

Here's the Github Repo, if you want to try it out!

[Boost]

Arindam Majumder — Sat, 28 Mar 2026 13:37:31 +0000

Arindam Majumder for Studio1

Mar 17

Running LLM Applications Across Providers with Bifrost

#ai #llm #proxy #litellm

Comments

5 min read

Build a Semantic Movie Discovery App with Claude Code and Weaviate Agent Skills

Arindam Majumder — Fri, 27 Mar 2026 20:45:45 +0000

Introduction

Versatility in agentic coding is increasing as new tools such as Model Context Protocol (MCP) servers and Agent Skills become more common. At the same time, many developers ask the same question when building AI applications: should they use MCP servers or Agent Skills? The important thing is understanding what each approach does well and choosing the one that fits your use case.

In this post, we’ll explain what MCP servers and Agent Skills are and how they differ, including architecture diagrams and technical details. In the later sections, we’ll also walk through how to use Weaviate Agent Skills with Claude Code to build a “Semantic Movie Discovery” application with several useful features.

Let’s get started!

Understanding MCP

The Model Context Protocol (MCP) is an open standard introduced by Anthropic that enables Large Language Models (LLMs) to interact with external systems such as data sources, APIs and services. MCP provides a structured way for an AI agent to connect to compliant tools through a single interface instead of requiring custom integrations for each service.

MCP Architecture

The MCP system operates on a client–server model and consists of three main components.

Host: the application that runs the AI model and provides the environment where the agent operates.
Client: the protocol connector inside the host that handles communication between the model and MCP servers.
Server: an external service that exposes tools, resources, or prompts that the agent can access.

MCP and Agentic Coding

Before MCP, each AI tool required custom integrations for every external service it wanted to connect to. MCP simplifies this process by introducing a shared protocol that multiple agents and tools can use.

Developers can now expose capabilities through an MCP server once and allow any compatible agent to access them without building separate integrations for each system.

Understanding Agent Skills

Agent Skills, also introduced by Anthropic, provide developers with a simple way to extend AI coding agents without running MCP servers. An Agent Skill is a structured configuration file, usually written as markdown files with YAML metadata that defines capabilities, parameter schemas and natural-language instructions describing how the agent should use those capabilities.

AI tools such as Claude Code read these files at session start and load the skills directly into the agent's working context without requiring an additional runtime.

How Agent Skills Work

When Claude Code detects a skill file in the project directory (typically under .claude/skills/), it loads the manifest into the agent's context at the beginning of the session.
The skill definition describes available capabilities, how to invoke them correctly and when to prefer one approach over another. Because the instructions are written in natural language alongside parameter schemas, the agent can reason about how to use the skill.
Skills are portable across repositories. If a developer commits a skill file to a repository, any collaborator who clones the project and opens it in Claude Code automatically gains access to the same capabilities without additional setup.

MCP and Agent Skills solve different problems in agent systems. MCP provides a standardized way for AI agents to connect to external tools, APIs, databases and services through a client–server architecture with structured schemas. Agent Skills extend the agent’s capabilities through configuration files that define workflows, instructions and parameter schemas without requiring a running server.

In simple terms, MCP enables agents to access external systems, while Agent Skills define how agents perform tasks or workflows within their environment.

Weaviate Agent Skills

Weaviate has released an official set of Agent Skills designed for use with Claude Code and other compatible agent-based development environments like Cursor, Antigravity, Windsurf and more. These skills provide structured access to Weaviate vector databases, allowing agents to perform common operations such as search, querying, schema inspection, data exploration and collection management.

The repository includes ready-to-use skill definitions for tasks like semantic, hybrid and keyword search, along with natural language querying through the Query Agent. It also supports workflows such as creating collections, importing data and fetching filtered results, and cookbooks. This enables agents to interact/build with Weaviate and perform multi-step retrieval and agentic tasks more effectively.

Agent Skills and Vector Databases

AI coding agents face difficulties when working with vector databases. Vector database APIs provide extensive capabilities, including basic “key–value” retrieval, single-vector near-text searches, multimodal near-image searches, hybrid BM25-plus-vector search, generative modules and multi-tenant system support. Without structured guidance, even a capable coding agent may produce suboptimal queries: correct syntax but the wrong search strategy, missing parameters or failure to use powerful features like the Weaviate Query Agent.

Weaviate Agent Skills address this by providing correct usage patterns, parameter recommendations and decision logic, enabling coding agents to generate production-ready code from their initial attempts.

The Weaviate Agent Skills repository is organized into two main parts:

Weaviate 𝗦𝗸𝗶𝗹𝗹 (skills/weaviate): Focused scripts for tasks such as schema inspection, data ingestion and vector search. Agents use these while writing application logic or backend code.
𝗖𝗼𝗼𝗸𝗯𝗼𝗼𝗸𝘀 Skill (skills/weaviate-cookbooks): End-to-end project examples that combine tools such as FastAPI, Next.js and Weaviate to demonstrate full application workflows.

Weaviate Agent Skills work with several development environments, including Claude Code, Cursor, GitHub Copilot, VS Code and Gemini CLI. When connected to a Weaviate Cloud instance, agents can directly interact with database modules and perform search, data management and retrieval tasks.

To evaluate how effective Weaviate Agent Skills really are, let’s build a small project and see how they accelerate RAG and agentic application development with Claude Code.

Building the Semantic Movie Discovery Application

We will build a Movie Discovery App that takes a natural-language description and returns the most semantically similar movies from a Weaviate collection. In the process, we will explore Weaviate capabilities such as multimodal storage, named vector search, generative AI (RAG) and the Query Agent in action with Claude Code, showing how these Agentic tools help you build applications faster.

Prerequisites

Python 3.10 or higher
Weaviate Cloud – Create a free cluster and obtain an API key.
TMDB API key – Used to fetch movie metadata
OpenAI API key – Required for RAG features.
Access to Claude Code
Node.js 18+ and npm – Required to run the Next.js frontend

Step 1: Project Setup

Create a movie-discovery-app folder

mkdir movie-discovery-app

Create and activate a Python virtual environment in the folder

cd movie-discovery-app py -m venv venv && source venv\Scripts\activate.bat

Install Python dependencies

pip install weaviate-client==4.20.1 fastapi uvicorn[standard] openai weaviate-agents>=1.3.0 requests python-dotenv

Install Node.js dependencies for the frontend

cd frontend && npm install

Now, create a .env file at the project root. Add the following parameters to configure Weaviate Agent Skills with Claude Code, along with your OpenAI API key and TMDB API key.

WEAVIATE_URL=your-cluster-host-without-https
WEAVIATE_API_KEY=your-api-key
OPENAI_API_KEY=your-openai-key
TMDB API key=your-tmdb-api-key

After signing up for Weaviate, click the Create Cluster button to start a new cluster for your use.

Click “How to Connect” to view the required Weaviate connection parameters.

Now that everything is set up, we can connect Weaviate Cloud with Claude Code by running claude in your project terminal:

Use the following prompt in your Claude terminal:

Write and run `check_modules.py` that connects using `weaviate.connect_to_weaviate_cloud`with `skip_init_checks=True`, loads credentials from `.env` with `python-dotenv`,
and prints the full JSON list of enabled Weaviate modules.
Run it with `venv/Scripts/python check_modules.py`."

Step 2: Create A Weaviate Collection and Import Sample Movie Data

In this step, we create a Weaviate collection and import the movie dataset into Weaviate. The dataset contains movie metadata sourced from the TMDB API. Each entry includes: title, overview, release_date, poster_url, popularity, and other important movie fields. You can import a JSON or CSV dataset directly into Weaviate.

Run this prompt to retrieve the dataset from the TMDB API and save it to a file named movies.json.

Create a TMDB dataset JSON file, movies.json, that contains 100 movie metadata and poster URLs directly from the TMDB API.

Afterwards, Weaviate Import Skills creates a Weaviate collection and imports the data from movies.json into the Weaviate database. Claude code activates Weaviate to perform this action when prompted with:

Import `movie.json` into a new Weaviate collection called Movie

Then the data is imported

Step 3: Building the FastAPI Backend and Next.js Frontend with Weaviate Cookbooks

Weaviate cookbooks enable the app to use a two-layer architecture: a FastAPI backend that exposes REST endpoints and a Next.js frontend that renders the UI. The backend connects directly to Weaviate Cloud and the Weaviate Query Agent. Weaviate cookbooks also include some frontend guidelines to communicate with the Weaviate backend over HTTP.

The app is organized into two views accessed via a collapsible sidebar:

Search view: performs semantic search and RAG using Weaviate named vectors.
Chat view: handles multi-turn conversations through the Weaviate Query Agent.

Our app includes the following features:

Layer	Component	Role
Backend	backend.py (FastAPI) - REST API on port 8000/docs	Routes: GET /health, GET /search, POST /ai/explain, POST /ai/plan, POST /chat
Frontend	Next.js + TypeScript (port 3000)	Single-page app with sidebar navigation
	SearchView.tsx	Semantic search (near_text), AI explanations (single_prompt), Movie Night Planner (grouped_task)
	MovieCard.tsx	Renders base64 poster inline, watchlist add/remove button
	ChatView.tsx	Multi-turn Query AI Agent chat
	AppSidebar.tsx	Navigation (Search/Chat), Weaviate logo + feature summary, watchlist manager with ‘.txt’ export

Use the following prompts with Claude Code to generate the backend and frontend:

Backend Prompt:

/weaviate cookbooks 

Create `backend.py`: a FastAPI app with CORS enabled for localhost:3000.
Connect to Weaviate Cloud using credentials from .env with skip_init_checks=True.
The /search endpoint should return genre and vote_average alongside title, description, release_year, and poster.
Implement these routes:  

- GET  /health                  → {"status": "ok"}  
- GET  /search?q=...&limit=3    → near_text on text_vector, return title/description/release_year/poster  
- POST /ai/explain              → generate.near_text with single_prompt  
- POST /ai/plan                 → generate.near_text with grouped_task  
- POST /chat                    → QueryAgent.ask() with full message history

Frontend Prompt:

Using Weaviate cookbooks frontend reference, create a Next.js TypeScript app in the frontend/ folder.
MovieCard.tsx should display a star rating (vote_average) and genre tag beneath the movie title. 

Components needed:  

- page.tsx        — SidebarProvider layout, view state (search | chat)  
- SearchView.tsx  — search input, MovieCard grid, AI explain and plan buttons  
- MovieCard.tsx   — poster image, title, year, description, watchlist button  
- ChatView.tsx    — message bubbles, source citations, clear chat  
- AppSidebar.tsx  — navigation, Weaviate logo + feature list, watchlist + exportBackend base URL from NEXT_PUBLIC_BACKEND_HOST env var (default localhost:8000)

Run backend and frontend servers with: uvicorn backend:app --reload --port 800 and npm run dev

After this, Claude Code will automatically build the app by adding relevant files and start both servers. You can start using the application immediately.

The FastAPI backend runs at http://localhost:8000/docswhile the frontend app is available at http://localhost:3000

You can also manually start both processes in separate terminals:

# Terminal 1 — Backend 
uvicorn backend:app --reload --port 8000
# Terminal 2 — Frontend
cd frontend && npm run dev

Congratulations! You’ve completed the project without needing to do much manual configuration or coding. 🔥

Demo

So far, we have used Weaviate Agent Skills with Claude Code to build a Semantic Movie Discovery Application powered by an OpenAI API key, a TMDB API key, and Weaviate.

The Movie Discovery app we built includes the following features:

Semantic search: Describe a mood or theme and retrieve matching movies using vector-based search (near_text).
AI explanations: Generate per-movie summaries using RAG with single_prompt.
Movie Night Planner: Create a viewing order, snack pairings and a theme summary using grouped_task.
Conversational chat: Ask questions about the movie collection through a chat interface powered by the Weaviate Query Agent, with source citations.
Watchlist: Save movies during your session and export the list as a .txt file.

What’s Next?

You could add image-based search to find similar movies and better meet your movie choices. You could also include a hybrid search feature that incorporates keyword-heavy queries and image search.

You can take your app even further by getting up to speed with Weaviate’s latest releases and becoming familiar with features such as server-side batching, async replication improvements, Object TTL and many more.

To explore further, check out the latest Weaviate releases and join the discussion on the community forum.

Weaviant Agent Skills in Action

Weaviate modules were used in the application:

Text2vec-weaviate: Responsible for text embeddings.
Multi2multivec-weaviate: Responsible for embedding images.
Generative-openai: Integrates GPT directly into the query workflow.
Weaviate Skill: Creates a collection and imports data.
Weaviate Cookbooks Skill: For defining the app’s logic.
Weaviate Query Agent: A higher-level abstraction that accepts natural language queries, decides the best query method, executes queries, synthesizes results and returns answers.

Weaviate Agent Skills help in shipping faster and more accurate RAG applications. Backend development tasks such as schema inspection, data ingestion and search operations are automated and optimized. Ultimately, this helps developers save valuable development time.

Conclusion

Both MCP servers and Agent Skills provide useful patterns for building AI-powered applications. MCP servers are well-suited for exposing external tools and services through a standardized interface, while Agent Skills focus on guiding coding agents with structured workflows and best practices.

In this tutorial, we demonstrated how Weaviate Agent Skills can simplify development by helping Claude Code generate correct database queries, ingestion pipelines and search logic. By combining vector search, multimodal storage and generative capabilities, we built a semantic movie discovery application with minimal manual setup.

As agentic development environments continue to evolve, tools like MCP servers and Agent Skills will likely be used together. The key is understanding where each approach fits and selecting the one that best supports your application architecture.

Happy building.

Resources

Model Context Protocol
Weaviate Agent Skills
Claude Code
GitHub Repository for the Movie Discovery App

We Cut Our MCP Token Spend in Half. Here's the Architecture

Arindam Majumder — Wed, 25 Mar 2026 19:04:52 +0000

When we started scaling our MCP workflows, token usage was something we barely tracked. The system worked well, responses were accurate, and adding more tools felt like the right next step. Over time, the cost began rising in ways that did not align with how much the system was actually used.

At first, we assumed this was due to higher usage or more complex queries. The data showed something else. Even simple requests were using more tokens than expected. This led us to ask a basic question. What exactly are we sending to the LLM on every call?

A closer look made things clearer. The issue came from how the system was built. We handled context, tool definitions, and execution flow by adding extra tokens at every step.

This article explains how we found the root cause and redesigned the architecture to fix it. The changes cut our MCP token usage by nearly half and gave us better control over how the system behaves.

Understanding Token Usage in MCP Systems

Once we started examining token usage, a clear pattern showed up. The LLM was receiving far more context than most requests actually needed. A large part of this came from tool definitions being sent repeatedly on every call.

Each request included the full list of tools, even when only one or two were needed. On top of that, earlier outputs and intermediate results were passed back into the model. The context kept growing, even for simple queries.

The execution flow added to the problem. The LLM would choose a tool, call it, process the result, and then repeat the same cycle if another step was needed. Each step added more tokens, and the same data often appeared many times across calls.

This setup worked at a small scale. As the number of tools increased, the cost grew quickly. More tools meant more context. More steps meant repeated processing. The system was doing extra work without adding real value. At this point, the cause was clear. Token usage came from how the system handled context and execution. The design itself was driving the overhead.

Introducing Bifrost

We started looking for a way to change how the system handled tool execution. The goal was simple. Reduce the amount of context sent to the LLM and avoid repeated processing across steps.

During this process, we came across Bifrost, an open source MCP gateway. It works between the application, the model, and the tools. It brings structure for how tools are discovered and executed, so the LLM receives only what is needed on each call.

This changed how we thought about the system. Tool access became more controlled. Context stayed limited to what was required for each request. The overall flow of execution became easier to follow and reason about.

These changes directly addressed the issues we were seeing. Tool definitions were sent only when required. Repeated decision loops were reduced. The system handled execution in a more controlled and predictable way.

From here, the focus moved away from adjusting prompts and toward changing how the system runs end-to-end.

Architectural Changes with Bifrost Code Mode

The main change came from how execution was handled inside Bifrost. Code Mode is a Bifrost feature that changes how the LLM interacts with MCP tools. Earlier, the LLM handled both planning and step-by-step tool interaction. Each step required another call, and each call carried a growing context.

Code Mode separates these responsibilities. The LLM focuses on planning. It generates executable code that defines the full workflow for a task.

Code Mode works best when multiple MCP servers are involved, workflows have several steps, or tools need to share data. For simpler setups with one or two tools, Classic MCP works well.

A mixed setup also works. Use Code Mode for heavier workflows like search or databases, and keep simple tools as direct calls.

This includes:

Selecting the right tools
Passing data between tools
Defining how the final output is produced

The system exposes a minimal interface to the LLM. It can list available tools, read tool details, and, when required, understand how each tool works. Tool definitions are accessed on demand, which keeps the initial context small.

Once the plan is generated, execution moves to a runtime environment. The code runs in a sandbox and interacts directly with tools. All intermediate steps, tool responses, and data transformations stay within this layer.

This removes the need for repeated LLM calls during execution. The workflow runs in one pass, guided by the generated code. The LLM is involved mainly at the planning stage and for producing the final response if required.

The flow becomes more structured. A request comes in, relevant tools are identified, code is generated, and execution happens in a controlled environment. The system handles state and intermediate data outside the LLM.

This approach improves clarity in how tasks are executed. The generated code can be inspected, debugged, and understood directly. Each request follows a defined path, which makes behavior easier to track and reason about.

Using Bifrost CLI in Our Workflow

Getting started required two commands. First, start the gateway:

npx -y @maximhq/bifrost

Then launch the CLI from a separate terminal:

npx -y @maximhq/bifrost-cli

MCP servers are registered once through the API. The key flag is is_code_mode_client, which tells Bifrost to handle that server through Code Mode instead of sending its tool definitions on every request:

curl -X POST http://localhost:8080/api/mcp/client \
  -H "Content-Type: application/json" \
  -d '{
    "name": "youtube",
    "connection_type": "http",
    "connection_string": "http://localhost:3001/mcp",
    "tools_to_execute": ["*"],
    "is_code_mode_client": true
  }'

Once registered, the LLM discovers tools on demand using listToolFiles and readToolFile, then submits a full execution plan through executeToolCode. A workflow that previously took six LLM turns now completes in three to four.

Bifrost organizes tool definitions using two binding levels. Server-level (default) groups all tools from a server into one .pyi file. Tool-level gives each tool its own file — better for servers with 30+ tools. Set it once in config.json:

{
  "mcp": {
    "tool_manager_config": {
      "code_mode_binding_level": "server"
    }
  }
}

Debugging became simpler because the generated code is the execution plan. When something went wrong, the issue was visible directly in the code rather than buried in prompt chains. This setup also made execution easier to inspect.

results = youtube.search(query="AI infrastructure", maxResults=5)
titles = [item["snippet"]["title"] for item in results["items"]]
result = {"titles": titles, "count": len(titles)}

The execution runs in a Starlark interpreter, a restricted subset of Python. A few constraints to keep in mind:

No import statements, file I/O, or network access
Classes are not supported, use dictionaries
Tool calls run synchronously; async handling is not required
Each tool call has a default timeout of 30 seconds

Code Mode also works with Agent Mode for automated workflows. The listToolFiles and readToolFile tools are always auto-executable since they are read-only.

The executeToolCode tool only auto-executes if every tool call within the generated code is on the approved list. If any call falls outside that list, Bifrost returns it to the user for approval before running.

Impact on Token Usage and System Efficiency

The reduction in token usage came from four specific changes:

Tool schemas were sent only when required
Intermediate outputs stayed within the execution layer
Repeated context across steps was removed
Fewer LLM calls were needed, since execution moved to a sandbox and ran in a single flow

These changes had a clear effect. Token usage dropped by nearly half. Latency reduced along with it. Execution became more predictable, since each request followed a defined path with fewer moving parts.

The broader takeaway is clear. Token cost comes from system design. Small changes in prompts or outputs help at the edges. The main overhead comes from the system's structure.

LLMs work best when they focus on planning. Managing execution through repeated loops adds cost and introduces variability. A separate execution layer keeps the flow stable and easier to understand. Context also needs careful control. It should be built for each request with only the required information. Letting it grow across steps results in unnecessary overhead and increased token usage.

Conclusion

Token inefficiency in MCP workflows comes from system design. Bifrost and Code Mode introduced a clear separation between planning and execution. The LLM handles planning, and the runtime handles execution. This brought immediate and measurable improvements in both cost and system behavior.

If you are working with MCP workflows at scale, Bifrost is worth exploring. The documentation provides a good starting point to set up the gateway, connect servers, and run workflows using Code Mode.

Composer 2 is controversial, but my actual experience was solid

Arindam Majumder — Sat, 21 Mar 2026 06:55:09 +0000

I tried Composer 2 properly today, and honestly, if you put all the controversy aside for a second, the model itself is not bad at all.

In fact, my first impression is that it’s a real upgrade over Composer 1 and 1.5. I gave it a pretty solid test. I asked it to build a full-stack Reddit clone and deploy it too.

On the first go, it handled most of the work surprisingly well. The deployment also worked, which was a good sign. The main thing that broke was authentication.

Then on the second prompt, I asked it to fix that, and it actually fixed the auth issue and redeployed the app.

That said, it was not perfect. There were still some backend issues left that it could not fully solve. So I would not say it is at the level of Claude Opus 4.6 or GPT-5.4 for coding quality.

But speed-wise, it felt much faster. For me, it was around 5 to 7x faster than Opus 4.6 / GPT-5.4 in actual workflow, and it also feels much more cost-effective.

That combination matters a lot.

Because even if the raw coding quality is still below Opus 4.6 / GPT-5.4, the overall experience was smoother than I expected. It gets you from idea to working product much faster, and for a lot of people that tradeoff will be worth it.

My current take is:

Better than Composer 1 / 1.5 by a clear margin
Fast enough to change how often I’d use it
Good at getting most of the app done quickly
Still weak enough in backend reliability that I would not fully trust it yet for complex production work
Not as strong as Opus 4.6 / GPT-5.4 in coding depth, but still very usable

So yeah, I agree with the criticism that it is not on the same level as Opus 4.6 / GPT-5.4 for hard-coding tasks. ( may be because the base model is Kimi K2.5)

But I also think some people are dismissing it too quickly. If you judge it as a fast, cheaper, improved Composer, it is genuinely solid.

I shared a longer breakdown here with the exact build flow, where it got things right, and where it still fell short, in case anyone wants more context

Building an AI-Powered Content Moderation API with InsForge Edge Functions

Arindam Majumder — Fri, 20 Mar 2026 09:55:13 +0000

Introduction

Modern applications rely on user-generated content such as comments, reviews, and messages. Platforms must moderate this content to enforce safety policies and maintain compliance. Manual moderation does not scale, so production systems typically rely on automated moderation pipelines powered by AI.

Traditional implementations require multiple backend services. Developers often provision servers, integrate AI APIs, manage databases, and configure storage separately. This fragmented setup increases operational overhead and slows development.

InsForge simplifies this architecture by combining Edge Functions, PostgreSQL Database, Storage, and Model Gateway in a single platform. Benchmarks also show that it can deliver ~1.6× faster responses and 2.4x lower token usage compared to fragmented integrations.

In this tutorial, we will build a production-ready AI moderation API that runs entirely within InsForge.

What We Are Building

Here are the tools that we will be using to build a simple backend moderation workflow using InsForge core services:

AI Moderation API Endpoint: We will create an API endpoint using Edge Functions that accepts user-submitted text content and processes moderation requests.
AI-Powered Content Evaluation: The API will use Model Gateway to access an AI model that classifies submitted content as SAFE or UNSAFE.
Database Storage for Approved Content: Approved comments will be stored in a PostgreSQL Database managed by InsForge.
Attachment Handling with Storage: Optional user attachments will be uploaded and stored using Storage Buckets.
Automated Moderation Response: Unsafe content will be rejected immediately, and the API will return a structured moderation response.
Production-Ready Backend Workflow: The moderation pipeline will run entirely within InsForge using Database, Edge Functions, Model Gateway, and Storage, without external servers or additional infrastructure.

Project Setup and Repository Structure

Before configuring the backend resources, clone the project repository and review the project structure.

Clone the repository:

git clone https://github.com/Studio1HQ/Content-moderation-Insforge
cd content-moderation-app

Install dependencies:

npm install

The repository contains both the Next.js frontend and the InsForge Edge Function used for moderation.

Repository Structure

Folder	Purpose
`src/app`	Next.js application pages and layouts
`src/components`	UI components such as the moderation form
`src/lib`	Client utilities for connecting to InsForge APIs
`insforge-functions/moderate-comment`	Edge Function implementation for moderation
`handler.ts`	Serverless function that processes moderation requests

This structure keeps the frontend and backend logic organized within the same project while allowing the Edge Function to be deployed independently.

After cloning the repository, proceed with configuring the backend resources in InsForge.

Note: You can set up this backend in two ways. Follow the manual steps in this tutorial to create the database, storage bucket, and Edge Function using the dashboard and CLI. Alternatively, you can use InsForge MCP with your AI coding agent to provision the same resources using a single prompt. See the MCP section at the end of the article for the prompt template and instructions.

Step 1: Setting Up the Database

InsForge provides a managed PostgreSQL Database that you can configure directly from the dashboard.

Open the Tables Section

Open your project in the InsForge Dashboard.
In the left sidebar, select Tables.
Click the + icon next to Tables.

Create the following columns.

Column	Type	Description
`id`	`uuid`	Primary key for each comment
`content`	`string`	User submitted comment text
`attachment_url`	`string`	URL for uploaded file (optional)
`status`	`string`	Moderation result (`approved` or `rejected`)
`created_at`	`timestamp`	Time when the comment was created

Save the Table

Click Create Table to apply the schema.
The comments table will appear in the Tables panel.

Step 2: Creating the Edge Function

Next, create the serverless API that will process moderation requests.

InsForge Edge Functions allow you to run backend logic without managing servers. In this tutorial, the function receives user content, evaluates it using AI, and stores approved results in the database.

Navigate to the Edge Function directory in the repository:

insforge-functions/moderate-comment/

Inside this folder, there will be a file named:

handler.ts

This file will contain the moderation logic executed by the Edge Function.

The Edge Function performs the following tasks:

Accept a POST request containing user content.
Send the content to the AI model through Model Gateway.
Classify the content as SAFE or UNSAFE.
Upload attachments to Storage if present.
Insert approved content into the comments table.
Return a structured moderation response.

All moderation logic runs inside the Edge Function, keeping the backend workflow centralized within InsForge.

Deploy the function using the InsForge CLI:

insforge functions deploy moderate-comment--file ./insforge-functions/moderate-comment/handler.ts

Once deployed, the function becomes available as a backend API endpoint that the frontend application can call.

Step 3: AI Integration Inside the Function

The moderation logic inside the Edge Function uses Model Gateway, which provides unified access to multiple AI models directly within InsForge.

Model Gateway allows Edge Functions to call AI models without configuring external API clients or managing provider-specific integrations.

Open the Model Gateway section in the InsForge dashboard and enable a model for the project.

For this tutorial, enable:

openai/gpt-4o-mini

This model will be used to classify incoming content during moderation.

Use the CLI to send a test request to the moderation API.

insforge functions invoke moderate-comment--data"{\"content\":\"This community platform is very helpful.\"}"

This command sends a JSON payload containing the content field to the Edge Function.

The Edge Function also inserts the approved comment into the comments table in the database.

Step 4: Configuring Insforge Storage

The moderation workflow also supports optional file uploads using InsForge Storage. Storage provides an S3-compatible object storage system that integrates directly with Edge Functions and the database.

When a user submits a comment with an attachment, the Edge Function uploads the file to a storage bucket before inserting the comment into PostgreSQL.

Create a Storage Bucket

Open the Storage section in the InsForge dashboard.

Navigate to Storage in the sidebar.
Click Create Bucket.
Name the bucket: attachments

This bucket will store files uploaded with moderated comments.

The upload operation returns a public file URL, which is stored in the attachment_url column of the comments table.

The moderation function processes attachments as follows:

The user submits content with an optional file.
The Edge Function evaluates the text using AI moderation.
If the content is classified as SAFE, the file is uploaded to the attachments bucket.
The returned file URL is stored in the comments table.
If the content is UNSAFE, the function rejects the request and no file is uploaded.

This ensures that only approved content and attachments are stored, keeping the storage system aligned with the moderation rules.

Step 5: Building the Next.js UI

The repository already includes a Next.js application that provides a simple interface for interacting with the moderation API.

Navigate to the frontend code inside the src directory.

Key UI Files

File / Folder	Purpose
`src/app/page.tsx`	Main page that renders the moderation interface
`src/components`	Reusable UI components for the moderation workflow
`src/lib/insforge.ts`	Utility for connecting the frontend to the InsForge backend

The UI includes a form where users submit content for moderation.

The form collects:

Text content entered by the user
Optional file attachment
Submit an action that triggers the moderation request

When the user submits the form, the application sends a POST request to the Edge Function endpoint.

The UI handles the API response and updates the interface accordingly.

Approved comments appear in the moderation results section.
Rejected content displays an error message.
Approved entries are also visible in the comments database table.

This setup creates a complete workflow where the Next.js UI communicates with the InsForge Edge Function to perform moderation in real time.

Using an AI Agent to Build the UI

You can also accelerate this step using an AI coding agent (such as Cursor, Claude Code, or other agent-based tools). Instead of manually writing the UI components, the agent can generate the form, API calls, and component structure based on a prompt.

Example prompt:

Create a Next.js page for a content moderation demo.

Requirements:
- A form with a textarea for user comments
- An optional file upload input
- A submit button
- Send a POST request to the InsForge Edge Function endpoint for moderation
- Display the moderation result (approved or rejected) in the UI
- Use React state to handle form submission and responses

Step 6: Testing the API Endpoint

After deploying the Edge Function and setting up the UI, test the moderation workflow to verify that the API behaves correctly.

Submit Safe Content

Enter a comment through the UI and submit the form.

Expected behavior:

The Edge Function sends the content to the AI moderation model.
The model classifies the text as SAFE.
The function inserts the comment into the comments table in PostgreSQL.
If an attachment is included, the file is uploaded to the attachments storage bucket.
The API returns an approved response to the frontend.

Next, test a rejection case.

Expected behavior:

The Edge Function sends the text to the AI moderation model.
The model classifies the content as UNSAFE.
The function immediately returns a rejection response.
No entry is inserted into the comments table.
No file is uploaded to Storage.

The table in your Insforge dashboard also reflects the results:

Step 6: Deployment Using InsForge

Once the function and UI are ready, deploy the backend using the InsForge CLI. This publishes the Edge Function and connects it to the project environment.

Refer to the deployment guide here

Authenticate the CLI with your InsForge account.

insforge auth login

Complete the authentication process in the browser. Link the local project directory to your InsForge backend.

insforge link

Select the project created earlier in the InsForge dashboard. This connects the CLI to the correct backend workspace.

Deploy the Next.js application while passing the required environment variable.

insforge deployments deploy .--env"{\"NEXT_PUBLIC_INSFORGE_BASE_URL\":\"https://your-project.insforge.app\"}"

This environment variable allows the frontend to communicate with the deployed Edge Function.

Verify the Deployment

After deployment, the application becomes accessible via the InsForge-hosted domain.

Access the live demo here

Using MCP to Accelerate Development

Instead of manually creating tables, storage buckets, and Edge Functions, you can also configure the backend using Remote MCP (Model Context Protocol).

MCP exposes InsForge backend capabilities as tools that an AI coding agent can call to provision resources automatically. With a single prompt, the agent can generate the database schema, configure storage, and deploy the moderation function.

Example prompt used to create this backend workflow:

Create backend resources for a content moderation application using InsForge.

Requirements:

1. Create a PostgreSQL table named "comments" with fields:
   id (UUID primary key)
   content (text)
   attachment_url (text, nullable)
   status (text)
   created_at (timestamp)

2. Create a storage bucket named "attachments" for storing uploaded files.

3. Create an Edge Function named "moderate-comment" that:
   - accepts POST requests with comment text
   - sends the text to an AI model
   - classifies the content as SAFE or UNSAFE
   - uploads attachments to storage if present
   - inserts approved content into the database

Using MCP, developers can provision backend resources and deploy functions directly from prompts, significantly accelerating backend setup while keeping the same architecture described in this tutorial.

Refer to the quick demo here.

Conclusion

In this tutorial, we built a content moderation API using InsForge Edge Functions, integrated AI-powered classification through Model Gateway, stored approved results in PostgreSQL, and handled optional file uploads with Storage. The entire workflow runs inside InsForge, without external servers or fragmented infrastructure.

This approach demonstrates how developers can combine Edge Functions, AI integration, database services, and storage to implement production-ready backend APIs with minimal operational overhead.

If your application relies on user-generated content, moderation pipelines, or AI-assisted workflows, this architecture provides a straightforward and scalable foundation.

Ready to simplify your backend stack? Explore InsForge’s Edge Functions, Model Gateway, PostgreSQL database, and Storage services to build intelligent APIs without managing infrastructure.

Try InsForge
Quickstart guide here

Cursor Composer 2: Features, Pricing, Benchmarks, and Initial Impressions

Arindam Majumder — Thu, 19 Mar 2026 20:25:28 +0000

Introduction

Cursor has released Composer 2, the latest version of its in-house coding model.

The announcement is focused and fairly easy to summarize. Cursor is making three main claims:

Composer 2 is frontier-level at coding
it is materially better than previous Composer versions on Cursor’s published benchmarks
it is priced aggressively enough to be practical for everyday use

That combination makes the release worth paying attention to. In this post, I’ll walk through what Composer 2 is, what Cursor says improved, how the benchmark results look, what the pricing means, and my initial take on the release.

What Is Composer 2?

Composer 2 is Cursor’s latest in-house coding model.

Cursor describes it as frontier-level at coding and positions it as a better cost-performance option for agentic software work. The model is now available in Cursor, and the announcement puts most of the emphasis on three areas:

stronger coding performance
improved long-horizon task handling
lower cost than many competing fast models

Unlike some model launches that bundle a large number of product features together, this one is mostly about the model itself. Cursor is not presenting Composer 2 as a general platform shift. It is presenting it as a more capable and more economical coding model.

Composer 2 Key Features

The Composer 2 announcement is short, but there are still a few important takeaways.

Better coding performance

Cursor says Composer 2 delivers large improvements on all of the benchmarks it tracks, including Terminal-Bench 2.0 and SWE-bench Multilingual.

That matters because it suggests the gains are not limited to one internal evaluation. Cursor is showing improvement across several coding-oriented benchmarks rather than relying on a single headline number.

Continued pretraining

One of the most notable details in the post is that these improvements come from Cursor’s first continued pretraining run.

This is important because continued pretraining is often what gives a model a stronger base before more specialized post-training methods are applied. Cursor is explicitly saying that Composer 2 starts from a better foundation than earlier Composer versions.

Reinforcement learning for long-horizon tasks

Cursor also says it trains Composer 2 on long-horizon coding tasks using reinforcement learning.

This is probably the most interesting technical claim in the announcement. Cursor says Composer 2 can solve challenging tasks requiring hundreds of actions. That implies the model is being optimized for sustained multi-step software tasks, not just short code completions or simple edits.

A fast variant with the same intelligence

Cursor also introduces a faster Composer 2 variant and says it has the same intelligence.

That is a useful product choice. Instead of forcing users to pick between a “smart” model and a “fast” model family, Cursor is presenting speed as a deployment option on top of the same underlying capability level.

Composer 2 Benchmarks

Cursor publishes three benchmark comparisons in the announcement:

Model	CursorBench	Terminal-Bench 2.0	SWE-bench Multilingual
Composer 2	61.3	61.7	73.7
Composer 1.5	44.2	47.9	65.9
Composer 1	38.0	40.0	56.9

These gains are large enough to be meaningful.

The biggest point here is not just that Composer 2 is ahead of Composer 1 and 1.5, but that the improvements show up consistently across all three benchmarks. That gives the release more credibility than a single isolated result would.

Terminal-Bench 2.0 is especially relevant because Cursor frames it as an evaluation for agentic terminal use. If Composer 2 is genuinely stronger there, that supports Cursor’s claim that the model is getting better at longer, more interactive coding tasks.

SWE-bench Multilingual is also worth noting because it suggests broader coding competence beyond narrow English-only setups.

Still, these are vendor-published numbers, so the right takeaway is measured optimism rather than certainty.

How Composer 2 Is Priced

Cursor says Composer 2 is priced at:

$0.50 per million input tokens
$2.50 per million output tokens

The faster variant is priced at:

$1.50 per million input tokens
$7.50 per million output tokens

Cursor also says the fast variant has lower cost than other fast models and that fast will be the default option.

This part of the announcement is more important than it looks. Model releases are usually judged on benchmark quality first, but pricing determines whether a model becomes part of normal daily use or gets reserved for occasional high-value tasks. Cursor is clearly trying to push Composer 2 into the first category.

On individual plans, Composer usage is part of a standalone usage pool with generous usage included.

Composer 2 vs Earlier Composer Versions

Based on Cursor’s published table, Composer 2 is a clear step up from Composer 1.5 and Composer 1.

The improvement is visible across all the benchmarks included in the post, and Cursor attributes that jump to a combination of:

a stronger base model from continued pretraining
reinforcement learning on long-horizon coding tasks

That is a sensible recipe for a coding model. Better base training improves general capability, while long-horizon RL helps the model stay coherent over extended multi-step tasks.

From the announcement alone, Composer 2 looks like a real model upgrade rather than a minor iteration.

Initial Impressions

My first impression is that this is a disciplined release.

Cursor is not trying to claim that Composer 2 changes everything. The message is narrower and more believable: the model is better, it handles long-horizon coding tasks more effectively, and it is priced aggressively enough to be useful in regular workflows.

The long-horizon point is the one I would pay most attention to. A lot of coding models can produce a good patch in one pass. Fewer models stay reliable across a task that unfolds over many actions. If Composer 2 is genuinely stronger there, that is a meaningful improvement.

The pricing is the other major strength. A coding model can be strong on benchmarks and still be awkward in practice if the economics are wrong. Cursor seems to understand that and is making cost a central part of the launch rather than an afterthought.

At the same time, this is still an announcement built around Cursor’s own evaluation framing. The benchmark gains look strong, but the real test will be whether Composer 2 feels materially better in day-to-day software work.

Final Thoughts

Composer 2 looks like a meaningful upgrade to Cursor’s coding model stack.

The release is compelling for three reasons:

the benchmark gains are substantial
the training story is technically coherent
the pricing is practical

If you already use Cursor, Composer 2 is worth trying.

If you evaluate coding models more broadly, this release is notable because it tries to improve both capability and economics at the same time. That is the right combination to optimize for.