Forem: Sunil Kumar Dash

[Boost]

Sunil Kumar Dash — Tue, 05 May 2026 14:08:42 +0000

VK for Composio

May 5

AI Workflow Automation Tools Are a Mess, Here’s What I Learned the Hard Way

#ai #productivity #learning #beginners

Comments

3 min read

I rebuilt OpenClaw from scratch without the security flaws

Sunil Kumar Dash — Mon, 16 Feb 2026 10:33:21 +0000

OpenClaw launched with great fanfare, and I was curious whether you could truly "vibe code" the entire project on your own, especially since the original creator built it with Codex. We're in the era of "build it yourself instead of setting it up" and I wanted to take that philosophy a step further by recreating it from scratch.

This is the story of how I rebuilt OpenClaw using modern coding agent SDKs, tackled integration challenges across multiple messaging platforms, and deployed it securely in production,all while avoiding the security pitfalls of the original.

Checkout the repository here: Secure OpenClaw

Research & Planning

The first thing I did was use GPT Pro mode to research the entire codebase and explain all the features and tools used. The Pro model excels at these broad tasks that require processing large amounts of information in a single shot. It gave me a detailed product spec on how OpenClaw works and what it uses for each functionality.

I decided to use coding agent SDKs because they represent the first real use cases people have had with LLMs beyond writing. Claude provides the Claude Agent SDK, and OpenCode provides a similar SDK. These SDKs natively provide access to tools like read, write, bash, edit, and support for skills and MCP (Model Context Protocol).

Architecture Overview

I wanted to set up two modes:

Terminal mode: For direct interaction and development
Gateway mode: For 24/7 operation, listening to WhatsApp, Telegram, Signal, iMessage, and other messaging apps

The gateway architecture is what makes OpenClaw powerful,it runs continuously in the background, monitoring multiple communication channels and responding autonomously.

Messaging Platform Integrations

WhatsApp integration uses a library called Baileys to establish a WhatsApp Web connection. Here's how it works:

Baileys connects to WhatsApp Web's WebSocket
When a message arrives, WhatsApp's server pushes it via WebSocket
Baileys emits a messages.upsert event with type 'notify'
The agent can then process and respond to the message

One challenge I encountered was creating the allowlist for WhatsApp numbers. WhatsApp doesn't use phone numbers directly in the WebSocket connection,it uses link IDs. Messages arrive with these IDs, and I needed bidirectional conversion between phone numbers and link IDs. Claude Code initially struggled with building the right mapping, but after some iteration, we got it working correctly.

Telegram was much more straightforward thanks to its Bot API. The implementation uses long polling:

Periodically calls Telegram's getUpdates API
Waits up to 30 seconds for new messages
When a message arrives, it immediately returns and calls getUpdates again
Emits a message event for each new message

The Bot API is well-documented and significantly easier to set up than WhatsApp.

iMessage

iMessage integration was a fascinating unlock. It uses a library called imsg, built by Peter Steinberger himself. The approach:

Reads the SQLite database where all iMessages are stored
Monitors the database using FSEvents, a kernel-level file system monitoring API on macOS
Detects new messages in real-time as they're written to the database

This gives the agent access to iMessage without requiring any official API.

Tools & Integrations

As they say, an agent is nothing without the tools it uses. I equipped the agent with:

Core Tools:

Read, Write, Edit (file operations)
Bash (command execution)
Glob, Grep (file searching)
TodoWrite (task management)
Skill (access to predefined workflows)
AskUserQuestion (user interaction)

Custom Tools:

Cron tools for scheduled tasks
Gateway tools for WhatsApp and Telegram communication

Third-Party Integrations: For secure integration with services like Slack, GitHub, Teams, and more, I used Composio. Composio lets you securely connect and use these tools in a sandbox environment while handling all the credentials and authentication.

Deployment Challenges

The Docker Setup

I created a Docker setup designed to run in the background on a DigitalOcean droplet. The goal was to make it quickly deployable without too many setup hassles. However, I ran into several issues:

Problem 1: OOM (Out of Memory) Errors

Running on a $6/month instance with 2GB RAM, the container kept crashing. The issue? It tried installing Claude Code and OpenAI's SDK together simultaneously, exhausting available memory. Once I identified this, I staggered the installations and the problem was resolved.

Problem 2: Permission Mode Conflicts

The gateway uses permissionMode: 'bypassPermissions' so the agent can run autonomously without human approval for each tool call. However, Claude Code refuses to enable this when running as root,a built-in security feature.

The Solution:

I had to restructure the entire Dockerfile to use a non-root user:

# Create non-root user (Claude Code refuses bypassPermissions as root)
RUN useradd -m -s /bin/bash claw && chown -R claw:claw /app
USER claw

This cascaded into fixing:

All file paths (/root/ → /home/claw/)
Docker Compose volume mounts
CLI installation directories
Workspace permissions

The refactoring took several hours but resulted in a much more secure deployment that adheres to best practices.

Key Takeaways

Modern coding agents are incredibly capable - With proper tooling and context, they can rebuild complex systems from scratch
Security by design matters - The forced non-root user setup, while initially frustrating, led to a more secure architecture
Integration complexity varies wildly - Telegram took 30 minutes, WhatsApp took hours, iMessage required creative solutions
Resource constraints force better architecture - The 2GB RAM limitation pushed me to optimize installation and runtime behavior
Documentation is everything - Services with good APIs (like Telegram) are significantly easier to integrate than those requiring reverse engineering

What's Next

The rebuilt OpenClaw is now running in production, handling messages across multiple platforms without the security issues that plagued the original. Future improvements include:

Adding more messaging platforms (Discord, Slack DMs)
Implementing better error handling and retry logic
Creating a web dashboard for monitoring and configuration
Optimizing memory usage to run on even smaller instances

Building this from scratch was an excellent exercise in understanding how modern AI agents work in production. The combination of LLM capabilities, proper tooling, and careful architecture makes it possible to create powerful autonomous systems that were previously extremely difficult to build.

How to better your Claude CoWork experience with MCPs

Sunil Kumar Dash — Mon, 19 Jan 2026 13:08:02 +0000

Right when everyone was busy talking about how good Claude Code is, Anthropic launched Claude CoWork, basically Claude Code with a much less intimidating interface for automating fake email jobs. It can access your local file system, connectors, MCPs, and do almost everything that can be executed through the shell.

Claude CoWork is currently available as a research preview in the Claude Desktop app as a separate tab for Max subscribers ($100 or $200 per month plans) on macOS, with Windows support planned for the future.

The tool works by giving users access to a folder on their computer, where it can read, edit, or create files on their behalf. It works inside a local containerised environment by mounting your local folders. Which means you can trust that it won’t access folders that you haven’t granted permission to.

There’s a lot to talk about CoWork, but perhaps in a separate blog post. This talks about using connectors and MCPs to do more than organising files.

TL;DR

If you don't want to waste time, just use rube.app inside Claude Code.
You will get

Instant access to 900+ SaaS Apps (Gmail, GitHub, BitBucket, etc)
Zero Oauth and Key management hassle
Dynamic tool loading, hence reduced token usage and better execution
create reusable workflows and access them as tools

Try Rube Now for FREE

Working with MCP Connectors

Claude AI Connectors are direct integrations that let Claude access your actual work tools and data. Launched in July 2025, these connectors transform Claude from an AI that knows a lot about the world into an AI that knows a lot about your world.

Claude comes with pre-built integrations, including Gmail, Google Drive, GitHub, and Google Calendar. Apart from these, there are tons of Local and Remote MCP servers from HubSpot, Snowflake, Figma, and Context7.

Using default Integrations

For default integrations, all you need to do is just connect your accounts and start working with them.

Navigate to Settings > Connectors
Find the integration you want to enable
Click the "Connect" button
Follow the authentication flow

Pro, Max, Team, and Enterprise users can add these connectors to Claude or Claude Desktop.

Using Anthropic Marketplace Connectors

Anthropic has an MCP marketplace where you can find Anthropic-reviewed tools, both local and remote-hosted connectors.

**For Desktop/Local MCPs: **Click Desktop → Search Your MCP → Click Install

*For remote MCPs, *

Navigate to Browse Connectors
On the Web tab, search your MCPs

Provide your server URL if needed, and you’re done.

Custom MCP Server

This is the most interesting part. You can use whatever MCP servers you prefer.

Click on Add a Custom Connector → Provide MCP name and Server URL → (Optional) Oauth credentials

But….You shouldn’t be using MCP servers

MCP servers are definitely a force multiplier, making it easy for LLMs to access data. However, they have physical limitations.

1. The MCPs are token hungry

Each MCP tool has a schema definition, what it does, the parameters, and sometimes examples. The more detailed the tool definitions, the more reliable the execution; however, LLMs have a limited context window (200k). And it’s well known that LLMs are more effective when they are not bloated. The more MCPs there are, the less space there is for actual execution.

For example, the GitHub and Linear official MCPs have 40 and 27 tools, respectively, and they consume 17.1K tokens (8.5%).

2. Tool definitions are always loaded, even when unused

Most MCP clients eagerly load all available tools into the model context. That means tools the model will never call still consume tokens on every request.

If your server exposes 20 endpoints but a given task only needs 2, the model still incurs the cost of all 20. Over time, this pushes teams to artificially split MCP servers, not for architectural clarity, but to work around context limits.

This also discourages experimentation. Engineers hesitate to add new tools because every addition degrades all existing interactions.

3. Large tool outputs quietly destroy context

The biggest failures are less about schemas. They are caused by results.

Logs, database rows, file lists, search results, stack traces, and JSON blobs all flow straight back into the model. Even a single careless response can erase half the conversation history.

This is not ideal at all and can jeopardize LLMs.

4. Tool selection degrades as tool count grows

As the number of MCP tools increases, tool selection accuracy drops.

Models begin to:

Call near matches instead of the correct tool
Overuse generic tools
Avoid tools altogether and hallucinate answers

This happens even if all tools are well described. The attention budget simply is not infinite. Past a certain point, the model stops fully reading tool definitions.

You can observe this directly by adding more tools and watching call precision decline.

How to fix this?

By implementing a few architectural improvements

1. On-demand tool loading

Instead of loading every tool definition into the context upfront, only load the tools you actually need for the current task.

This is the simplest way to cut token usage, because tool schemas are the “always-on” cost. If you can turn that into a “pay only when used” cost, you immediately get more room for reasoning and better reliability.

We’ve implemented this in the Rube, a universal MCP server, that dynamically loads tools based on the task contexts

A Planner tool that plans in detail about a task, and a Search tool that finds and retrieves required tools.
When the model needs something, it asks for the specific tool definition.
Only then do you inject that tool’s schema into the context.

This also fixes the experimentation problem. You can add more tools without degrading every session, since most sessions won't load them.

2. Indexing tools for better discoverability

Tool selection gets worse as the tool count grows, even if every tool is well described.

So don’t rely on the model to “scan” a long list of tools. Give it a way to search tools like an index.

The pattern is:

Maintain a small searchable catalogue of tools. Effectively, in a vector database with hybrid search (full text match + vector embeddings of tool definitions)
Each entry has: tool name, one-line purpose, key parameters, and a few example queries.
Let the model search the catalogue with natural language.
Return the top 3-5 matches, then load only those schemas.

It also makes tool naming less painful. Even if a tool name is slightly off, the index can still match on description.

3. Handling Large Outputs outside the LLM’s context

This is the biggest lever.

Most MCP failures occur when tools return a large payload, and you paste it straight back into the model. Once you do that, the session starts forgetting earlier goals and acting strangely.

The fix is to stop treating the model like your output buffer.

Instead:

Store large outputs outside the prompt (local file, object store, database, even a temp cache).
Return a small summary plus a handle (file path, ID, cursor, pointer).
Models are extremely good at file operations, and storing large blobs in file storage and letting the model retrieve only what’s needed can go a long way.

The model should never be forced to read 200 KB of JSON just because the tool had it available.

4. Programmatic Tool Calling or CodeAct

LLMs are extremely performant at writing code. So, instead of giving LLMs direct MCP tools, it's better to give a workbench where they can write glue code for MCP tool chaining and execute it to get outputs.

Instead of LLMs calling a tool, waiting, reading the result, then deciding the next tool call (and repeating that cycle over and over), LLMs** write a small chunk of code inside a code execution container that calls your tools as functions**. That code can loop, branch, filter, aggregate, and stop early without requiring a new model round-trip for every step.

The reason this matters for MCPs is context.

With traditional tool calling, every intermediate result is included in the chat and consumes token space. With programmatic tool calling, the intermediate tool results are processed inside the code execution environment and do not enter Claude’s context. Claude only sees the final output of the code, which is usually a much smaller summary.

Anthropic’s guidance is that it pays off most when you have any of these patterns:

Large datasets where you only need aggregates or summaries
Multi-step workflows with 3 or more dependent tool calls
Filtering, sorting, or transforming tool results before Claude sees them
Parallel operations across many items (for example, checking 50 things)
Tasks where intermediate data should not influence reasoning

There is some overhead because you are adding code execution to the loop, so it’s less useful for a single quick lookup.

We’ve already solved it

Before this became mainstream knowledge (thanks to Anthropic’s Blog post), we had already implemented the pattern at scale with Rube.

It’s an MCP server with meta tools that implements the above design patterns and more. This is a wrapper over our core tool infrastructure. You can access all our 877 SaaS toolkits without the headaches of implementing authentication.

Here’s what we’ve got in Rube MCP.

Discovery & Connection Tools

Tool	Purpose
RUBE_SEARCH_TOOLS	Discovers relevant tools and generates execution plans for tasks. Always call this first when starting a workflow. Returns tools, schemas, connection status, and recommended steps.
RUBE_GET_TOOL_SCHEMAS	Retrieves complete input parameter schemas for tools. Use when SEARCH_TOOLS returns `schemaRef` instead of full schema.
RUBE_MANAGE_CONNECTIONS	Creates or manages connections to user's apps. Returns auth links for OAuth/API key setup. Never execute tools without an active connection.

Execution Tools

Tool	Purpose
RUBE_MULTI_EXECUTE_TOOL	Fast parallel executor for up to 50 tools across apps. Primary way to run discovered tools. Includes memory storage for persistent facts across executions.
RUBE_REMOTE_WORKBENCH	Executes Python code in a remote Jupyter sandbox. Use for processing large data files, bulk operations, or scripting complex tool chains. Has 4-minute timeout.
RUBE_REMOTE_BASH_TOOL	Executes bash commands in a remote sandbox. Useful for file operations and processing JSON with tools like `jq`, `awk`, `sed`.

Recipe Tools (Reusable Workflows)

Tool	Purpose
RUBE_CREATE_UPDATE_RECIPE	Converts completed workflows into reusable notebooks/recipes with defined inputs, outputs, and executable code.
RUBE_EXECUTE_RECIPE	Runs an existing recipe with provided input parameters.
RUBE_FIND_RECIPE	Searches for recipes using natural language (e.g., "GitHub PRs to Slack"). Returns matching recipes with IDs for execution.
RUBE_GET_RECIPE_DETAILS	Retrieves full details of a recipe by ID, including code, schema, and defaults.
RUBE_MANAGE_RECIPE_SCHEDULE	Creates, updates, pauses, or deletes recurring schedules for recipes using cron expressions.

Typical Workflow

RUBE_SEARCH_TOOLS → Find tools for your task
RUBE_MANAGE_CONNECTIONS → Ensure apps are connected
RUBE_MULTI_EXECUTE_TOOL → Execute the tools
RUBE_REMOTE_WORKBENCH → Process large results if needed
RUBE_CREATE_UPDATE_RECIPE → Save as reusable recipe (optional)

How to use Rube with Claude CoWork

The process is essentially the same as adding any Remote MCP servers.

Head to Rube.app
Click on Use Rube
Copy the code https://rube.app/mcp

Open your Claude App and go to the connectors
Paste the MCP URL

And…You’re done.
Ask whatever you want. You’ll be prompted to authenticate with the apps you need. Then leave it upto Claude.

Some cool examples that I use every day

1. Analyse blog post performance from Google Search Console and create Notion files

2. Converting the Google Sheet to Notion

https://youtu.be/PsPjcFp4-iY

End Note

Claude CoWork is really great. If you want to take yourself to the next level, add all the apps you use. Rube is the one you should be using.

From Auth to Action: The Complete Guide to Secure & Scalable AI Agent Infrastructure (2026)

Sunil Kumar Dash — Mon, 10 Nov 2025 10:50:57 +0000

Key Takeaways

Auth is Not Enough: Getting an OAuth token (Pillar 1) is just the first step.
Production Needs Guardrails: You must build Granular Control (Pillar 2) with patterns like Brokered Credentials to prevent security risks.
Scalability Requires an Engine: A reliable action layer (Pillar 3) with a Unified API and managed retries is essential to move from prototype to production.

Understanding the "Authentication Wall" for AI Agents

You've built a powerful AI agent. Using a framework like LangChain or CrewAI, you've designed a sophisticated workflow that can reason, plan, and execute tasks. There's just one problem: Your agent is trapped in a sandbox, unable to interact with the real world. To be useful, it needs access to user-specific tools like Google Calendar, Salesforce, or Jira. This is where you hit the "Authentication Wall".

Suddenly, you're wrestling with the complexities of AI agent authentication. You're managing multi-step OAuth 2.0 flows, securely storing refresh tokens, and handling credential management for dozens of different APIs. It's a significant engineering challenge, and it's a common reason why promising agent prototypes never make it to production.

But solving authentication isn't the real goal. It's just the gateway to a much larger set of problems. Getting an OAuth token is the first step. The real challenge is building a secure, production-ready, and governable system for an agent to act on a user's behalf. This is a problem of secure AI agent workflow management, not just auth.

A production-ready AI agent infrastructure requires three essential pillars: 1. Secure Authentication, 2. Granular Control, and 3. Reliable Action. This guide walks through the architecture of all three, helping you move beyond the Authentication Wall and build agents that are truly ready for production.

Pillar 1: Secure Authentication (The Gateway to Real-World Action)

Solving the Token Problem: The Role of Managed OAuth, PKCE, and Refresh Tokens

Before an agent can do anything, it needs a key. Securely acquiring that key is the foundational layer of your infrastructure. This is the problem that solutions for managed authentication for AI agents aim to solve. They abstract away the tedious and error-prone process of connecting to each API individually.

This foundational pillar must include:

Managed OAuth: A robust system must handle the entire multi-step OAuth dance for you. This includes generating the correct authorization URL, handling the callback, exchanging the authorization code for a token, and securely storing the credentials.

Modern Standards: The security landscape evolves. The current standard is OAuth 2.1 with mandatory Proof Key for Code Exchange (PKCE). PKCE is critical for headless agents that cannot securely store a client secret, as it prevents authorization code interception attacks. Any modern OAuth for AI agents solution must support this.

Persistent Sessions: Users expect agents to work in the background without constant re-authentication. This requires a system that automatically refreshes expired access tokens. The security best practice here is refresh token rotation, where a new refresh token is issued with every access token refresh, and the old one is immediately invalidated. This significantly reduces the risk of a compromised refresh token providing long-term access.

Secure Credential Storage: Storing tokens, API keys, and other secrets in environment variables or application code is a major security risk. These credentials must be stored in an encrypted vault, completely isolated from your agent's application logic.

Platforms that offer these features provide a necessary service. They solve the immediate pain of getting a token. But this is just the beginning.

So your agent is authenticated. You have the key. The problem is solved, right? Wrong. Now you have a new, bigger problem: an autonomous agent with the full power of a user's account.

Pillar 2: Granular Control (Establishing Guardrails for Autonomous Agents)

Your Agent Has the Keys. Who's Stopping It From Deleting Your Entire Google Drive?

Once you have an OAuth token, you've given your agent the keys to a user's digital kingdom. A standard token grants the agent all of the user's permissions by default. This is a massive security risk, especially for autonomous agents. This is where the second pillar, Granular Control, becomes essential for any enterprise AI agent authentication platform. You need guardrails to ensure an agent can only do what it's supposed to do.

The Principle of Least Privilege: An agent that only needs to read calendar events shouldn't have the power to delete your entire Google Drive. Your infrastructure must enforce the principle of least privilege by de-scoping the agent's permissions. Modern standards like Rich Authorization Requests (RAR) allow an agent to request just-in-time, specific permissions for a single action, rather than asking for broad, standing access. This is a core tenet of AI agent security.

Preventing Credential Leakage: One of the top risks for LLM applications, as identified by OWASP, is credential leakage through the prompt context. If you pass an API key or bearer token directly to the LLM, a clever prompt injection attack could trick the agent into revealing it. The solution is a Brokered Credentials pattern. In this architecture, a secure middle layer makes the API call on the agent's behalf. The LLM decides what to do, but the broker handles the how. The LLM never sees the token, completely neutralizing this risk. This is a critical feature for platforms that securely connect AI agents to APIs.

This sequence diagram illustrates the brokered credentials flow, ensuring the LLM never handles secrets:

Granular Access Control: How do you enforce these fine-grained permissions at scale? The modern approach uses Policy-as-Code engines like Open Policy Agent (OPA) or Cedar. These systems externalize authorization logic, allowing you to define rules like "this agent can only transfer up to $100" or "this agent can only access records created this week". The tool-calling layer queries this policy engine before every action, ensuring every operation is explicitly permitted.

Delegated Authority: For a clear audit trail, you need to know not just what happened, but who authorized it. The On-Behalf-Of (OBO) Token Exchange is the gold standard for this. The agent presents the user's token and its own credentials to an authorization server, which issues a new token containing claims for both the user and the agent. This creates an auditable chain of command, proving the agent was acting with delegated authority from the user.

Simple auth solutions leave you to build this entire governance layer yourself. A true AI agent integration platform provides these guardrails out of the box, preventing catastrophic mistakes and giving you the control needed for enterprise-grade applications.

Now your agent is authenticated and secure. It has a key, and it knows which doors it's allowed to open. You're ready for production. Almost. What happens when the lock changes or there are thousands of different doors?

Pillar 3: Reliable Action (The Engine for Scalable Integrations)

An Agent That Can't Use Its Keys is Just an Expensive Chatbot

Authentication and control are useless if the agent can't perform its job reliably and scalably across a wide range of tools. This is the final and most critical pillar of production-ready infrastructure. It's the engine that turns an agent's intent into reliable action in the real world.

Problem 1: The "N+1" API Problem. Every new tool you want your agent to use means learning a new API, a new data schema, and a new set of failure modes. Integrating with Jira is different from Asana, which is different from Trello. This maintenance burden grows with every new tool.

Solution: The Unified API. A powerful integration platform abstracts this complexity behind a single, consistent interface. Your agent can learn to perform a generic action like tasks.create, and the platform handles the translation to the specific API calls for Jira, Asana, or Trello. This dramatically simplifies agent development and maintenance.

Problem 2: The "What Can You Do?" Problem. How does an agent discover the tools available to it and their specific functions without you hardcoding them? An agent needs to adapt as new tools are added or existing ones change.

Solution: Standardized Tool Discovery. The Model Context Protocol (MCP) is an emerging standard that solves this. It allows an agent to dynamically query the integration platform to discover the hundreds of tools it can use, what actions each tool supports, and what parameters are required. This enables agents to be more autonomous and adaptable.

Problem 3: The "It Broke" Problem. Real-world APIs are unreliable. They go down, they return unexpected errors, they have rate limits, and tokens can expire unexpectedly. A naive implementation will fail constantly.

Solution: A Managed Integration Layer. A production-grade platform provides enterprise-grade infrastructure to handle this messy reality. This includes built-in retries with exponential backoff for transient errors, intelligent rate limit handling, comprehensive logging for debugging, and robust patterns like the Saga pattern for handling partial failures in multi-tool workflows. If one step in a five-step process fails, the system can gracefully roll back the completed steps to maintain a consistent state.

How to Achieve Observability for AI Agent Actions (Logging & Monitoring)

A production-ready system isn't a black box. For DevOps and SREs, observability is non-negotiable. You need deep visibility into your agent's actions to debug failures, monitor performance, and control costs.

Structured Logging: Every tool call must be logged in a structured format (like JSON). These logs should include a trace_id to correlate actions across services, along with critical context like agent_id, user_id, tool_name, status, and duration. This is essential for debugging failed workflows.

Metrics and Monitoring: Your infrastructure should expose key metrics to a monitoring system like Prometheus or Grafana. Track API error rates (4xx, 5xx), p95/p99 latencies for tool calls, and token refresh success rates. Set up alerts for anomalies, such as a sudden spike in 401 errors, which could indicate a widespread credential issue.

Cost and Usage Tracking: Agents can make thousands of API calls. A managed platform should provide dashboards to track tool usage and associated costs, preventing runaway agents from causing unexpected bills from downstream API providers.

This is where the value of a comprehensive platform becomes undeniable. It handles the messy, unreliable reality of working with hundreds of different APIs at scale, allowing you to focus on building intelligent agent logic, not brittle integration code.

How to Integrate the Three Pillars with LangChain or CrewAI

The architectural concepts of Authentication, Control, and Action come together when you provide tools to your agent. A comprehensive platform abstracts these pillars into a simple set of tools that can be directly passed to frameworks like LangChain or CrewAI.

The developer experience is streamlined to instantiating a client and retrieving the available tools for a given user. The platform handles the underlying complexity of token management, security, and reliability.

This complete, runnable example demonstrates the full flow:

# --- Step 1: Installation ---
# Make sure you have the required packages installed.
# pip install python-dotenv langchain langchain-openai langchain-core langgraph composio-langchain

# --- Step 2: Environment Setup ---
# Set your API keys in .env file
# export OPENAI_API_KEY="sk-..."
# export COMPOSIO_API_KEY="comp_..."

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from composio import Composio
from composio_langchain import LangchainProvider

# Load environment variables from .env file
load_dotenv()

# In a real application, this would be the unique ID of your authenticated user.
# It tells Composio which user's connections to use.
USER_ID = "<your-user-id>"  # Replace with a dynamic user ID

# --- Step 3: Initialize the LLM and Composio Client ---
# Instantiate the LLM you want the agent to use.
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Instantiate the Composio client with LangchainProvider.
# It will automatically use the COMPOSIO_API_KEY from your environment.
composio_client = Composio(provider=LangchainProvider())

# --- Step 4: Fetch User-Specific Tools ---
# Fetch all tools for the "jira" toolkit that are available for the specified user.
# The `user_id` parameter is crucial for security and multi-tenancy.
# Composio's brokered credential pattern ensures the LLM never sees the user's token.
try:
    tools = composio_client.tools.get(user_id=USER_ID, toolkits=["jira"])
except Exception as e:
    print(f"Error fetching tools: {e}")
    tools = []

# --- Step 5: Create and Run the Agent ---
if tools:
    # Create the agent using the new LangChain 1.0 pattern.
    # The create_agent function returns a compiled graph that can be invoked directly.
    agent = create_agent(
        llm,
        tools,
        system_prompt="You are a helpful assistant that uses tools to perform tasks."
    )

    # Invoke the agent to perform a task.
    # The agent will reason, select the jira.create_issue tool, and execute it.
    # Note: The new pattern uses a messages format instead of a simple input dict.
    try:
        result = agent.invoke(
            {"messages": [("user", "Create a Jira ticket in the 'PROJ' project to fix the auth bug.")]}
        )
        print("Agent execution result:", result)
    except Exception as e:
        print(f"An error occurred during agent execution: {e}")
else:
    print("No tools were fetched. Agent cannot execute the task.")

The Decision Framework: How to Choose Your Agent Architecture (Build vs. Buy vs. Integrate)

When building your agent's infrastructure, you have three primary paths. Each comes with significant trade-offs in cost, speed, and security.

DIY (Do-It-Yourself): You build the entire stack in-house. This gives you maximum control but requires a massive investment in engineering, security, and ongoing maintenance.

Auth Components (e.g., Nango, Arcade): You use a managed service to handle the initial OAuth headache (Pillar 1). This is a great starting point but leaves you to build the critical governance (Pillar 2) and action (Pillar 3) layers yourself.

Auth-to-Action Platform (e.g., Composio): You use a comprehensive platform that provides an end-to-end solution covering all three pillars. This is the fastest and most secure path for most teams.

The Total Cost of Ownership (TCO) for a DIY solution is often deceptively high. While there's no subscription fee, the hidden costs in engineering salaries, on-call burdens, and continuous security reviews can easily run into hundreds of thousands of dollars.

Comparison Table: Build vs. Buy vs. Integrate

Feature	DIY (In-House)	Auth Components (e.g., Nango, Arcade)	Auth-to-Action (e.g., Composio)
Authentication	Full build required	✅ Managed OAuth & Refreshes	✅ Managed OAuth & Refreshes
Granular Control	Manual build required	❌ (Requires custom layer)	✅ Built-in Governance & Scoping
Credential Security	Manual build required	❌ (LLM can still see token)	✅ Brokered Credentials (No token in context)
Unified API	N/A	❌ (Per-API integration)	✅ Single interface for 500+ tools
Tool Discovery	Manual build required	❌ (Requires custom layer)	✅ MCP for dynamic discovery
Reliability	Manual build required	❌ (Requires custom layer)	✅ Managed Retries, Rate Limiting, Logging
Time to Market	6-12 months	1-2 months	1-2 weeks
TCO	Very High	Medium	Low (Predictable)

Conclusion: Don't Just Buy a Lock. Build a Secure House.

The conversation around AI agent authentication is too narrow. Focusing only on getting a token is like buying a high-tech lock for your front door while leaving all the windows open and forgetting to build a foundation.

Auth-Only Solutions give you a key to one door. It's a useful component, but it's not a complete solution for a production system.

An "Auth-to-Action" Platform like Composio gives you a master-key system for the entire building. It provides the keys (Authentication), a security guard to check permissions at every door (Control), and a unified concierge that can get any job done reliably (Action).

Building truly useful, secure, and scalable AI agents requires thinking about the entire infrastructure, from the moment a user grants consent to the final, successful action.

Stop building patchwork infrastructure. Start building production-ready agents. Explore Composio's platform or read our 5-minute quickstart to see the three pillars in action.

Frequently Asked Questions

What solutions offer authentication management for AI agents connecting to multiple applications?

You have a few paths. You can build it all yourself using raw OAuth. You can use auth-only components like Nango or Arcade which are great at handling the initial token. Or you can use a full auth-to-action platform. Composio is an example of this. It handles the auth but also the security and reliability needed for production.

What AI agent integration platform offer enterprise-level control and governance?

Enterprise control goes far beyond just authentication. It means having granular permissions, clear audit logs, and policy enforcement. Most auth-only tools do not provide this. You need a platform built for governance. Composio for example is designed for this. It lets you define and enforce rules for what an agent can and cannot do.

Which AI agent authentication platforms are recommended for small teams?

Small teams should look for the fastest path to production. Auth-only tools like Arcade or Nango are great starting points for the token. However your team still has to build the security and action layers. A complete platform like Composio can be much faster. It provides all the production-ready components out of the box. This often means a lower total cost of ownership because your team writes less code.

What are the most cost-effective managed OAuth for AI agents?

Cost effectiveness depends on your total cost not just the subscription price. Open source options can seem free but require your team's time for hosting and maintenance. Managed auth services are low cost to start. But you must add the engineering cost of building your own governance and integration layers. Full platforms like Composio can be more cost-effective overall because they save significant engineering time.

What platforms can prevent credential leakage when integrating AI agents with external tools and apps?

This is a major security risk. The best way to prevent leakage is with a pattern called Brokered Credentials. In this pattern the LLM never actually sees the API key or token. Instead a secure service like Composio makes the API call on the agent's behalf. This completely removes the risk of a token leaking through a prompt.

What platforms exist for granting AI agents access to use tools on behalf of users?

This is a key challenge called delegated authority. Any platform you choose needs to handle this. This involves managing complex OAuth flows, refresh tokens, and ideally modern standards like PKCE. Platforms like Composio manage this entire lifecycle. They provide the secure infrastructure so your agent can act on a user's behalf without you building the auth system from scratch.

Tags: agent auth, authentication for ai agents

OpenAI launched Atlas and I killed it with a Chrome extension

Sunil Kumar Dash — Wed, 05 Nov 2025 14:42:09 +0000

OpenAI recently launched ChatGPT Atlas, a fork of Chromium with Agentic capabilities. The UI is clean, rebuilt with SwiftUI, AppKit and Metal, but take that away and it’s the same capabilities, you can already access on ChatGPT’s website.

Is it really that hard to get agentic capabilities in your browser? Do you really need another browser for it? Turns out, no, a Chrome extension can do more than solve your problem. I spent my weekend building one, and here’s how you can make it too.

Here’s the demo:

Here’s the code for the project: https://github.com/ComposioHQ/open-chatgpt-atlas

Why Chrome Extensions Are Actually Perfect For This

Before diving into the build, let me explain why a Chrome extension is the right approach. The first question I had to answer was: Can a Chrome extension do what an AI browser can do?

The answer is yes, and here's why:

1. Extensions have access to everything that matters:

They can take screenshots of the current tab
They can inject JavaScript into any page
They can listen to page navigation events
They can create UI (sidepanel, popups, context menus)
They run with elevated permissions, the webpage doesn't have

2. They're easier to distribute:

No install process, just add to Chrome
Updates happen automatically
Works on any OS that runs Chrome
Users don't have to abandon their existing browser setup

3. They're cheaper to build:

No need to maintain a Chromium fork
No need to handle browser-level features (tabs, bookmarks, updates)
Focus purely on the agent capabilities

The UI is simple. All you need is a sidebar in the browser where the AI agent can take actions, and for anything the agent can't or won't do through browser automation, you can use MCP (Model Context Protocol) to route to external tools.

Architecture

The first step was deciding on the LLM. There were three providers with top-tier models: OpenAI, Anthropic, and Google.

OpenAI and Anthropic both charge for their APIs, with no free tier. This means a lot of people won't be able to access or build on top of them without immediately hitting a paywall. I wanted this to be something other developers could fork and experiment with without worrying about bills.

Google, on the other hand, offers a generous free tier for Gemini models that most people can access and build on top of. The free tier gives you 150 requests per minute for gemini 2.5 pro, which is way more than you'll need unless you're running this commercially. Gemini 2.5 Computer Use is also cheaper and faster than Claude’s Computer Use with Sonnet 4.5.

Setting up a Chrome extension is actually pretty straightforward. The core file is the manifest.jsonIt defines what the extension is and what permissions it needs. What we need is a Chrome extension that sits in the sidebar and can take actions on the open browser. This means we need:

A manifest.json that declares permissions and entry points
A background.ts file that runs as a service worker, listening for messages from the sidepanel
A content.ts that gets injected into webpages and can extract page content and execute actions
UI files: sidepanel.tsx (React), settings.tsx, and their corresponding HTML/CSS

Taking the above requirements into consideration, here's how the file directory looks:

atlas-extension/
├── Core Extension Files
│   ├── manifest.json             // Chrome extension config
│   ├── background.ts             // Message router & coordinator
│   ├── content.ts                // Injected into pages, executes actions
│   ├── sidepanel.tsx             // Main chat interface (React)
│   ├── types.ts                  // TypeScript interfaces
│   ├── tools.ts                  // Composio tool definitions
│   ├── settings.tsx              // API key configuration
│   ├── settings.html
│   └── sidepanel.html
│
├── Config Files
│   ├── package.json
│   ├── vite.config.ts            // Build tool (bundles TS → JS)
│   ├── tsconfig.json
│   └── tsconfig.node.json
│
├── Styling
│   ├── sidepanel.css
│   └── settings.css
│
└── Assets
    └── icons/
        └── icon.png

The Permissions That Make It Work

These are the permissions we need from Chrome that make a browsing agent work:

"permissions": [
  "sidePanel",      // Create sidebar UI
  "storage",        // Save API keys, settings to chrome.storage.local
  "tabs",           // ⭐ CRUser types command in Sidepanel
  "history",        // Read browser history (for context)
  "bookmarks",      // Read bookmarks (for context)
  "webNavigation",  // Track when pages load/unload
  "scripting",      // Inject content scripts dynamically
  "contextMenus"    // Add right-click menu items
],

The most important one is tabs. This is what lets you capture screenshots of the current page, which is essential for computer use. Without screenshots, the AI is blind—it has no idea what the page actually looks like, so it can't make intelligent decisions about where to click or what to type.

The scripting Permission is also critical because it allows you to inject content.ts into any webpage dynamically. This is how you execute actions on the page—clicking buttons, filling forms, scrolling, etc.

System Architecture: How Messages Flow

Here's how the pieces talk to each other:

The background.ts is the central nervous system. It’s always running, and it coordinates everything. When you send a message from the side panel, this worker routes it to the correct flow.

Computer Use: The Browser Automation Loop

Step 1: The agent captures a screenshot of the current page state using chrome.tabs.captureVisibleTab(). This screenshot is the agent's "eyes"—it sees what you see.

Step 2: The screenshot gets sent to Gemini along with your natural language intent ("click the login button") and the page's DOM structure (for additional context).

Step 3: Gemini analyzes the screenshot, identifies the login button visually, and returns a function call:

{
  "action": "click",
  "coordinates": {"x": 450, "y": 320},
  "reasoning": "Found login button at top-right of page"
}

Step 4: background.ts receives this action and forwards it to content.ts running on the current webpage.

Step 5: content.ts executes the click at those coordinates, shows a blue visual indicator to show what happened, and reports success or failure.

Step 6: The loop repeats with a fresh screenshot of the new page state. If the click opened a modal, the next iteration sees the modal and can interact with it. If a page is loading, it waits and adapts.

This repeats up to 30 times per task. Each iteration adapts based on what it sees. It's not running a predetermined script—it's genuinely reacting to the current state of the page.

How content.ts Executes Actions

When background.ts receives an EXECUTE_ACTION message from Gemini (e.g., {type: 'EXECUTE_ACTION', action: 'click', coordinates: {x: 100, y: 200}}), it relays this to content.ts running on the current webpage.

The content script's executePageAction() function handles 12 different browser actions. Here are the important ones:

1. Click: Uses document.elementFromPoint(x, y) to find the element at those coordinates, then fires a click event. If a CSS selector is provided instead, it queries and clicks that element directly.

case 'click':
  const element = document.elementFromPoint(x, y);
  if (element) {
    element.click();
    return { success: true, element: element.tagName };
  }
  break;

2. Fill: Finds the input/textarea element, focuses it (which triggers any React state updates), then uses keyboard_type() to type the text character-by-character. This is important for React apps that listen for input events instead of just checking .value.

case 'fill':
  const input = document.elementFromPoint(x, y);
  if (input && (input.tagName === 'INPUT' || input.tagName === 'TEXTAREA')) {
    input.focus();
    await keyboard_type(input, text);
    return { success: true };
  }
  break;

Why character-by-character? Because if you just set .value = "text", React doesn't know the value changed. You have to dispatch keyboard events for each character so React's synthetic event system picks it up. This was one of those annoying things that took way too long to debug.

3. Scroll: Scrolls the page (or a specific element) up/down/to-top/to-bottom by manipulating scrollTop and scrollLeft or using .scrollIntoView().

4. Keyboard Type: Types text one character at a time using dispatchEvent(new KeyboardEvent('keydown')) and dispatchEvent(new KeyboardEvent('keyup')), mimicking real typing. This is actually faster than setting .value because it doesn't cause React to re-render the entire component tree on every character.

5. Press Key: Presses individual keys (Enter, Tab, Escape, etc.) by dispatching keyboard events. Useful for submitting forms or navigating through interfaces.

6. Key Combination: Presses multiple keys simultaneously (Ctrl+A, Cmd+C, etc.) for complex keyboard shortcuts. This is how you can make the agent copy/paste or select all text.

7. Drag & Drop: Simulates drag-and-drop by dispatching mousedown, mousemove, and mouseup events from source to destination coordinates. Useful for dragging sliders or reordering lists.

8. Hover: Moves the mouse cursor to coordinates and fires mouseover and mousemove events. This is useful for triggering dropdowns or tooltips that only appear on hover. Each action returns a result object (e.g., {success: true, element: 'BUTTON'}) back through background.ts to the sidepanel, so Gemini can see what happened and decide on the next action. The content script also creates a visual indicator—a blue outline and pulsing circle at the action location—that disappears after 600ms. This gives you real-time feedback of what the agent is doing, which is surprisingly important for building trust. Without the visual feedback, the agent feels like a black box.

Flow summary:

Sidepanel calls streamWithGeminiComputerUse()

→ Background.ts captures screenshot

→ Gemini API receives screenshot + DOM

→ Gemini returns function calls

→ Background.ts forwards to content.ts

→ content.ts executes actions

→ Repeat up to 30 times

Tool Router: External API Integration

Computer use is excellent for browser automation, but what if you need to send a Slack message? Or create a GitHub issue? Or search your Gmail?

That's where the Tool Router comes in. Instead of looping with screenshots and browser actions, you hand off the work directly to specialised external services via Composio's 500+ integrated tools.

The key difference: Computer use is iterative and visual (screenshot → analyze → act → repeat), while the Tool Router is a single API call to an external service. When you need to "send a Slack message," the Tool Router targets the Slack API, sends a single request, and the job is completed on their servers.

The Tool Router handles three critical features:

1. Discovery: Searches across all available tools to find tools that match your task. Returns relevant toolkits with their descriptions, schemas, and connection status. For example, if you say "send an email," it searches and finds GMAIL_SEND_MESSAGE, OUTLOOK_SEND_EMAIL, etc., and returns them with their parameters so Gemini knows what to call.

2. Authentication: Checks if you have an active connection to the required toolkit. If not, it creates an auth config and returns a connection URL using Composio's Auth Link. You complete authentication via this link, and your credentials are stored securely.

3. Execution: Loads authenticated tools into context and executes them. Supports parallel execution across multiple tools for efficiency. For example, if you say "find all emails from Bob and create a summary doc," it can: - Search Gmail in parallel - Process results - Call Google Docs API to create the summary - All in one flow

The beauty of this dual approach (Computer Use + Tool Router) is that you can mix them. You can use a computer to navigate to a page and extract information, then use the Tool Router to send that information via Slack. The agent picks which approach to use based on the task.

The Sidepanel: Where You Actually Use It

The sidepanel.tsx is where you interact with the agent. It's a React component that renders in Chrome's sidebar (that panel that slides out from the right side of the browser).

Here's what it does:

1. Chat interface: You type natural language commands ("click the login button", "fill out this form with my details", "send a summary of this page to Slack").

2. Live conversation history: Displays the back-and-forth between you and the agent, including what actions it took and why.

3. Mode switcher: Toggle between two systems:

Computer Use (Gemini): For direct browser automation
Tool Router (Composio): For external API calls to Gmail, Slack, GitHub, etc.

4. Visual feedback: Shows when actions are executing, displays screenshots the agent is analyzing (if you want), and reports errors clearly.

The interface is intentionally minimal. You don't need a complex UI when the agent is doing all the work: just a text input and a conversation history.

AI Coding Tool Costs

I started with Claude Sonnet 4.5 in Cursor. I set a $50 budget and figured that would last me at least a week. It was gone in three days.

The problem with Sonnet isn't that it's bad at coding—it's excellent at coding. The problem is that it's a token-guzzling machine. Here's where the tokens went:

1. Redundant documentation files: Sonnet loves creating TECHNICAL_IMPLEMENTATION.md, ARCHITECTURE.md, CHANGELOG.md, and other markdown files that have no real utility except maybe giving Claude context on the changes it made. Highly inefficient when you're trying to complete a project quickly.

2. Verbose explanations: Every code change comes with a three-paragraph explanation of why it made the change. Great for understanding, terrible for token efficiency.

3. Full-file rewrites: Instead of making targeted edits, Sonnet often rewrites entire files. If you have a 500-line file and need to change one function, Sonnet will regenerate all 500 lines. That's 500 output tokens instead of 20.

Here's what my Cursor usage looked like with Sonnet 4.5:

In the first three days, I'd burned through most of my $50 budget. I was stretching it by switching to Composer mode (which is slower but more thoughtful), but even that wasn't sustainable.

Then Anthropic launched Haiku 4.5, which performs at the same level as Sonnet 4. I was sceptical—usually "performs at the same level" means "performs at 80% of the level for niche tasks"—but I was desperate.

I switched to Haiku 4.5 midway through the project. I completed the remaining work for $30 total.

Here's the difference:

Key observations:

Haiku 4.5:

More focused changes, fewer tokens per edit
Rarely creates unnecessary documentation files
Makes targeted edits instead of full-file rewrites
Suggestion acceptance rate is actually higher (because suggestions are more precise)

Sonnet 4.5:

Better at high-level architecture decisions
More verbose explanations (good for learning, bad for budget)
More likely to rewrite everything
Suggestion acceptance rate lower (because it suggests more changes per edit)

The verdict: For extension development—or really any project where you know roughly what you need to build—Haiku 4.5 is 95% as good for 30% of the cost.

The 5% where Sonnet is better? Initial architecture decisions, figuring out how to structure something you've never built before, debugging peculiar issues. But for "implement this feature" or "fix this bug," Haiku is more than good enough.

What Claude Got Wrong (So You Don't Waste Hours Like I Did)

Let me save you some pain by documenting the places where Claude Code absolutely struggled. These aren't bugs in Claude—they're gaps in its understanding of Chrome extension architecture and Gemini's API.

Issue 1: Text Input Wouldn't Work

Symptoms: The agent could click buttons, scroll pages, and navigate between screens. But it couldn't type text into input fields. Every time it tried, nothing happened.

Claude's diagnosis (first 10 attempts): "The coordinates must be wrong. Let me try calculating them differently."

Claude's diagnosis (next 10 attempts): "Maybe the input isn't focused. Let me add a focus event first."

Claude's diagnosis (next 10 attempts): "The timing might be off. Let me add delays between keystrokes."

Actual problem: Gemini requires human-in-the-loop permission for tasks it considers sensitive, like typing text. By default, it blocks text input actions entirely unless you explicitly tell it not to.

// In your Gemini API config
const response = await fetch('<https://generativelanguage.googleapis.com/v1beta/>...', {
  method: 'POST',
  body: JSON.stringify({
    contents: [...],
    tools: [...],
    safety_settings: [
      {
        category: "HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold: "BLOCK_NONE"  // ← This is what you need
      }
    ]
  })
});

You can also manually reduce the guardrails Gemini has active by default. There's a safety_settings parameter that lets you control how conservative the model is about "dangerous" actions.

Time wasted: 2+ hours across multiple sessions

The kicker: This is documented in Google's Gemini Computer Use guide, but Claude never thought to check there. It was convinced it was a coordinate or timing issue.

Lesson: When working with computer use models, always check their specific documentation for permission and safety settings BEFORE debugging for hours. The model might be refusing to do something, not failing to do it.

Issue 2: Screenshot Capture Hell

Symptoms: The computer use loop would start, send the first screenshot, Gemini would respond with an action, and then the extension would crash when trying to capture the second screenshot.

Claude's diagnosis (attempt 1-5): "The screenshot might be too large. Let me try compressing it more."

Claude's diagnosis (attempt 6-10): "Maybe the format is wrong. Let me try converting PNG → JPG."

Claude's diagnosis (attempt 11-15): "Let me try JPG → PNG instead."

Claude's diagnosis (attempt 16-30): Variations of the above, trying different quality settings, different compression libraries, different encoding methods.

Actual problem: Chrome extensions can't capture screenshots from the sidebar context. The sidebar runs in its own isolated context and doesn't have access to the main window's visual content.

The wrong approach (what Claude kept trying):

// This doesn't work from sidepanel context
const screenshot = await chrome.tabs.captureVisibleTab();
// Error: No tab found

The right approach:

// You need to query for the main window's active tab first
const tabs = await chrome.tabs.query({active: true, currentWindow: true});
if (tabs[0]) {
  const screenshot = await chrome.tabs.captureVisibleTab(tabs[0].windowId, {
    format: 'png'
  });
}

The difference is subtle but critical. The sidebar doesn't have a concept of "current window" in the same way a content script does. You have to explicitly query for the active tab and specify its window ID.

Time wasted: 1+ hour, 30+ different approaches

Moment of realization: I finally found this buried in a GitHub issue from 2019 where someone else had the exact same problem. It's not in Chrome's official documentation.

Lesson: Claude doesn't understand Chrome extension context boundaries. When it fails to capture something that should work, check if you're in the right context (background vs content vs sidepanel vs popup).

This specific fix resolved the frustration from a long debugging session where I was trying every possible variation of the screenshot API without understanding the fundamental issue.

Issue 3: Permission Manifest Confusion

Symptoms: Some Chrome APIs would work in development but fail in production after packaging the extension.

Claude's diagnosis: "The manifest permissions must be incomplete. Let me add more permissions."

What Claude did: Added every permission that sounded remotely related: activeTab, tabs, <all_urls>, webRequest, etc.

Actual problem: Chrome has different permission requirements for MV3 (Manifest V3) extensions vs MV2. Claude kept suggesting MV2 patterns because most Stack Overflow answers are from the MV2 era.

The fix: Understanding the difference between MV3's service workers and MV2's background pages, and adjusting the manifest accordingly.

Time wasted: 30 minutes

Lesson: Always check which manifest version you're using and make sure Claude's suggestions match that version. The APIs are similar but the permission model is different.

Try It Yourself

I've open-sourced the code at github.com/composiohq/open-chatgpt-atlas. Here's how to get started:

Setup (5 minutes):

Clone the repo: git clone ...
Install dependencies: npm install
Build the extension: npm run build
Load in Chrome: Go to chrome://extensions, enable Developer Mode, click "Load unpacked", select the dist folder
Get a Gemini API key: https://ai.google.dev/
Open the extension, go to Settings, paste your API key

First task: Open any webpage, click the extension icon, and try: "Click the search button and type 'AI browser agents'"

Watch the blue flashes as it executes each action. If it fails, check the console for errors (right-click the extension → Inspect).

Thanks for the read. Here is the repository; feel free to star it.

I built a voice AI agent to clean my emails, meetings, and Slack DMs (Composio, Vapi, OpenAI TTS) 🪄

Sunil Kumar Dash — Tue, 23 Sep 2025 11:07:05 +0000

I am the Voice from the Outer World! I will lead you to PARADISE

Paul Atreides uses the Voice as a tool for control and assertion. Imagine commandeering an AI agent with this voice. We built an AI agent using Composio, Vapi, and OpenAI TTS integrated with Gmail, Slack, and Google Calendar. It can summarise emails, schedule meetings, and search for Slack messages. Your entire morning routine is stress-free.

The entire thing was built using Claude Code inside the Cursor IDE.

The problem in Arrakis

Checking Slack and Gmail is a morning ritual I religiously follow, but comprehending each message while still half-asleep feels like swimming through molasses. Voice agents excel at this exact problem - you can ask them to summarise the critical stuff, explain confusing threads, or drill into specific details while you make coffee. The impressions in the screenshot below also indicate that I’m not alone in facing this problem.

Cultivating the Spice

Started with a Next.js app and immediately hit the latency wall that kills most voice projects. Voice demands conversational flow - unlike text interfaces, where users tolerate waiting, voice agents need to respond instantly or the illusion breaks. My initial approach was embarrassingly naive: STT → LLM → Tool Call → TTS. Sequential processing meant 3-5 seconds of awkward silence after each command.

Then I discovered Vapi, which handles the entire voice pipeline elegantly - parallel processing, model swapping, automatic interruption handling. It turned my clunky prototype into something that actually feels conversational.

For integrations, Composio was the obvious choice - it abstracts away the OAuth complexity and gives you clean, reliable connections to Gmail, Calendar, and Slack without writing boilerplate for each API.

On the development side, I'm convinced that running Claude Code inside Cursor is the optimal setup. Standalone Claude Code in terminal lacks proper diffs - you're flying blind with file changes. Cursor alone has good DX but weaker code generation. But Claude Code inside Cursor? You get Claude's superior coding ability with Cursor's visual diffs, giving you a lot more control and visibility over the changes being made.

Where the Spice Flows

User → Vapi Widget: User clicks “Talk to Assistant” to start the voice session.
Widget → LLM: The widget starts a call, sending the system prompt, model, voice, and the tool catalogue from vapiToolsConfig with concrete server URLs.
LLM → Widget: The LLM streams speech and final transcripts back; the widget updates speaking/listening indicators and the transcript UI.
LLM → API Route: When an action is needed (e.g., send email), the LLM triggers a tool call: an HTTP POST to the matching /api/tools/... route.
API self-work (route-helpers): The route extracts the toolCallId and arguments, races execution against a 30-second timeout, and normalises errors/success into Vapi’s expected response shape.
API → Composio: The route calls the relevant wrapper in lib/composio.ts, which invokes composio.tools.execute(...).
Composio → Provider: Composio talks to Gmail/Calendar/Slack APIs and returns a ComposioToolResult.
API → LLM: The API responds with { results: [{ toolCallId, result }] }. The LLM consumes this and continues the conversation with updated context.
LLM/Widget → User: The widget reflects new messages/results in the transcript and UI state.

Following Shai Hulud

Claude Code's recent performance has been frustrating, and I'm not alone in noticing the decline in quality. Despite feeding it comprehensive documentation from Composio and Vapi, it consistently reverted to outdated API patterns. I'd explicitly show it how to implement routes using Vapi's specific request/response schemas, and it would acknowledge understanding, then immediately generate code using deprecated methods. Its debugging process became almost comical - fix one error, create three new ones, then insist the original fix was perfect while ignoring the fresh breakage. The silver lining? It nailed the core architecture, cleanly separating Composio actions into individual route files with a centralised wrapper.

The UI challenge revealed another Claude Code quirk: without explicit visual direction, it defaults to the same tired template every time - hero section, three feature cards, call it a day. Voice interfaces are surprisingly hard to find inspiration for; most hide behind wake words or bury the actual interaction. Thankfully, Vapi's documentation included a pre-built voice widget that I could feed directly to Claude Code as a starting point.

Results

The agent currently handles nine core actions across three platforms: Gmail (fetch, send, and draft), Slack (create channels, list conversations, and send messages), and Google Calendar (create events and find conflicts). Each action executes with sub-500ms latency - fast enough that conversation never breaks flow.

The real power is Composio's extensibility. Adding new tools requires just a few lines of configuration rather than wrestling with OAuth flows and API quirks. Want Notion for meeting notes? Linear for task creation? Each addition makes the assistant exponentially more useful. The vision is simple: reduce the mechanical parts of knowledge work to voice commands.

Vapi’s observability on the dashboard is extremely helpful when trying to debug behaviours with voice agents because, unlike text, you can’t directly get into the trenches. Metrics and call logs provide a clear understanding of the agent’s behaviour.

Next on the roadmap: MCP (Model Context Protocol) support for smarter tool coordination, improved response handling to make conversations feel more natural rather than command-response, and a UI that actually shows what's happening under the hood. The current interface works, but it should feel like magic - visual feedback for active tools, confidence scores for actions, and a preview of what's about to happen before confirmation. The foundation is solid; now it's time to make it shine.

We raised $29M to make your agents stronger, smarter, and better

Sunil Kumar Dash — Tue, 22 Jul 2025 17:10:51 +0000

Support us on Twitter by liking or just quoting with whatever you feel like. We put some real work into this video, do check it out.

// Detect dark theme var iframe = document.getElementById('tweet-1947680602083496319-686'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1947680602083496319&theme=dark" }

Our thoughts on the future of the Agents landscape

AI agents today don't learn from experience. You can engineer prompts endlessly, but your agents won't build intuition over time. They won't learn why API edge cases need special handling or remember your specialised way of interacting with complex systems.

At Composio, we're building the infrastructure that enables AI agents to evolve, with $29M in funding.

The Infrastructure for Intuition

We’re creating a shared learning infrastructure where every interaction makes the entire ecosystem smarter. When one agent masters a tool or discovers an optimal workflow, every agent on our platform benefits instantly.

Humans have always built better tools for themselves over centuries from shared experiences - now, we're bringing that advantage to AI at unprecedented speed and scale.

Evolving Skills, Not Static Tools

Agents built with Composio won’t just execute tasks; they will evolve like a fleet of Waymos, learning from each other.

Imagine an agent finding a founding engineer in San Francisco. Through feedback, it learns valuable heuristics: Twitter outperforms LinkedIn, prioritize daily coders, and verify vesting schedules. These insights are inherited across our platform—no starting from scratch. All Composio agents collectively learn and develop real-time intuition from these interactions.

We're nowhere close. Building infrastructure for collective AI learning means solving problems no one's cracked yet. How do you capture tacit knowledge from millions of interactions? How do you turn edge cases into intuition? How do you make sure skills evolve with experience? Hard problems. But solvable ones.

Join Us

We're building a team that will craft the infrastructure shaping AI's future alongside teammates committed to creating systems that feel magical.

We’re looking for engineers who:

Love building distributed, self-improving systems

Think infrastructure should be elegant and invisible

If you're excited about this, reach out: hiring@composio.dev or DM me on Twitter.

— Soham, CEO, Composio

I cloned this VC-funded AI super agent app in a weekend, here's how🪄✨

Sunil Kumar Dash — Thu, 17 Jul 2025 14:31:09 +0000

General-purpose AI agents like Manus and GenSpark have caught everyone’s attention. And VSs are pouring money into them. You can find many in the YC cohorts. These agents are really cool and provide access to a wide range of external tools used in our daily lives, such as spreadsheets, documents, and PowerPoint slides.

I received a text to build this kind of Agent within 24 hours for a demo. Let’s vibe code this shit. Here’s how I went about it. I opened my Cursor instance and set up the repo. My weapon of choice was Claude 4 Sonnet (thinking) in agent mode.

Vibe Coding Setup

I had to choose between Claude Code and Cursor IDE. For something more open-ended, I’d use Claude Code to let the model explore and build, but due to time constraints, I needed more control, so I went with Cursor Agent. I decided to make a Web App with NextJS and use the AI SDK for the ease of using agents and LLMs with the AI SDK.

Compared to Langgraph, it'd be significantly more complex, and I’d have to define the workflows myself, which isn’t necessary for open-ended tasks. Instead of the Gemini 2.5 Pro + GPT 4.1 approach last time, I went all guns blazing with Claude 4 Sonnet (thinking), hoping the model would be able to handle most development without me managing every aspect.

For the Agent tools, Composio was the choice because I can handle authentication with Google Suite Apps and utilise their APIs as actions in the agent without having to read and plumb Google’s API documentation.

What to avoid?

The worst mistake you can make while vibe coding is making open-ended requests. I made the dumb mistake of giving Claude documentation and asking it to build based on that. The code it wrote was disastrous and just bad. Not to mention, Claude also tends to use a lot of dummy variables, which is the worst part. I rejected all of its changes. I had set up the Next.js project and installed the necessary packages, mainly AI SDK and Composio. What core abilities from GenSpark do I want to replicate? Its ability to read/edit sheets, documents, and presentations.

Getting back on course

I didn’t expect how easy it would be to embed Google Sheets/Docs as iframes on a sidebar. I expected a process, but it was straightforward. I can’t build this for my users without implementing authentication for each user’s Google Sheets account.

I used Composio for easy authentication with Sheets and Docs. Once signed in, the agent can access the user’s files. After handling authentication, the challenging part was enabling the agent to create presentations. There’s no native tool that lets you do it, and I did not want to explore the Google Slides API. I referred to GenSpark and noticed it wrote HTML code.

The super agent recognises the request for a presentation and responds with ‘[SLIDES]’. This triggers the generate slides endpoint, where the super agent passes the topic, content, slide count, and style. In the generate presentation endpoint, an LLM generates an array of slide objects containing: type, title, content, and bullet points. This array is received by the frontend and, using my static HTML code, renders a preview version that the user can download.

Let’s discuss the Google Sheets and Docs integrations. I wanted to add a sidebar to view the sheets/docs being edited in real-time. It’s nice to see the changes instantly as your agent does it. Composio to the rescue. I had the toolkits ready, just had to pass them to the generateText function from the AI SDK. I added the code to render a resizable sidebar for any detected drive doc URL. I integrated a web search tool, and now it was time for the Browser.

In Python, there are multiple Browser-based Agent libraries, but in JS, there are very few of them. I planned to use a famous Browser provider, but it refused to let me sign in. I tried deleting the cookies, but I couldn’t spend time fixing that because of the deadline, so I looked at other options and chose Puppeteer since it was easy to integrate.

I provided Claude with the documents for Composio’s custom tool creation, and Claude created the Puppeteer tool, wrapped it in the custom tool format, and passed it to the Super Agent with the ability to scrape, click, and input text.

The final demo included reading data from Sheets/Docs and using it to generate slides dynamically. It worked successfully and met the deadline.

The code is on GitHub. Fork it, break it, make it better.

How difficult was it to vibe code?

I have to admit there were a lot of times when existing features broke when I tried to add new ones, and a persistent error was Claude’s confusion about using Tailwind v3/v4, creating scenarios where I had to restore checkpoints to ensure the UI didn’t break. I wrote the code for all the route files, and I don’t think AI agents are good at backend logic as they are at the frontend. I used one or two 21st.dev components for the UI.

From Figma designs to pixel-perfect components using Figma MCP & Claude Code 🧙🪄

Sunil Kumar Dash — Mon, 07 Jul 2025 15:26:04 +0000

Figma is one of the best tools to emerge in the last decade or so. Regardless of the organisation's size, everyone uses Figma for everything, from landing pages to dashboards. And if you have been one of those poor souls tasked to make designs into pixel-perfect app components, I understand you. Been there, done that.

The good news is that with all these fancy technologies, LLMs, CLI agents, and MCPs, things are going to make this a whole lot easier.

Yes, I have been using Claude Code a lot lately; it's the best thing that has happened to humanity after Messi's FIFA 22 campaign (Don't get mad, please) and tying MCP servers with it can do wonders.

In this blog post, I will share how you can configure the Figma MCP with Claude Code to build pixel-perfect front-end components.

What is Covered?

Configuring Composio Figma MCP (This is the best Figma MCP server, BTW!). Try it to believe.

Integrate the Figma MCP server with Claude Code to build frontend components. (You can use it with Cursor and Gemini CLI as well, but I like Claude Code more)

Set up Figma MCP server and Claude Code

💁 We'll use Composio to add the Figma MCP server support to our Claude Code.

You don't need to create an account; head over to mcp.composio.dev and, under the Figma integration, generate the command.

The command should look something like this:

npx @composio/mcp@latest setup "<https://mcp.composio.dev/partner/composio/figma/mcp?customerId=><customer_id>" "figma-605dcr-13" --client claude

💡NOTE: You can pretty much use the same command to set up for Cursor as well. The only difference is to change the --client to use cursor and that's it. You can then simply go ahead and start cloning any design.
Upon running this command, you should see something like this:

As you can see, by default, it saves to the ~/.config/Claude/claude_desktop_config.json.
However, I prefer not to save it globally. So, in the project where you plan to run Claude Code, make sure you copy that file to a local .mcp.json file.

This helps separate MCP servers per project, which is very helpful when adding multiple ones in the future for other projects.

Run the following command to copy the file to the current directory:

cp ~/.config/Claude/claude_desktop_config.json .mcp.json

By doing just that, you're almost done with the complete setup.

Run Claude in the project where you've just copied the .mcp.json file, and you should see something like this:

Hit yes, and inside Claude Code, run /mcpYou should see the recent MCP server status as connected, and you can view a list of all the tools as well.

Now, that's all the setup you need to do on the Claude side. There's one small step left, and as you can guess, that's to authenticate with Composio.

Currently, we've only added the server but have not yet authenticated Composio to connect to our Figma account. So, inside Claude Code, ask it to initiate a connection to the Figma MCP server, and it should give you a URL.

Head over to that URL, and you should be authenticated like so:

And just like that, you're done! You can now clone any Figma design, no matter how complex it is!

Demo

💁 In this demo, I'll clone a sample CRM Dashboard design from Figma.

All you need is the link to the Figma file. Just chat with Claude Code, then sit back and relax. Your clone will be ready in seconds.

Prompt: I need you to clone the dashboard from this Figma design: . Use HTML, CSS, and JS. Make sure you clone the exact design. Don't show your creativity, make it exact.

Here's the Figma template:

Here's what it generated:

As you can see, it's almost an exact copy of the original design. You could ask it to build with Tailwind and any JS frameworks like Next.js, but for simplicity, I asked it to use plain HTML, CSS, and JS, and it did a pretty good job.

Here’s the video demo:

You can find the entire code it generated here: Code for the Figma Dashboard.

Conclusion

It's remarkable how quickly things are evolving with MCPs, coding agents, and LLMs. However, there are also emerging challenges, particularly in terms of security, availability, and reliability. Trusting random server providers without a proper safety net can be fatal. It's kinda what Composio stands for. Check out the trust page for more.

Additionally, if you ever build on top of us, please tag us on Twitter and LinkedIn for free credits.

I vibe-coded a $20M YC app in a weekend, here's how🧙‍♂️ 🪄

Sunil Kumar Dash — Mon, 02 Jun 2025 13:04:40 +0000

I realised that many companies offer no-code platforms to their users for automating workflows.
The numbers were kinda shocking.

I spent a week deep-diving into Gumloop and other no-code platforms.
They're well-designed, but here's the problem: they're not built for agents. They're built for workflows. There's a difference.

Agents need customisation. They have to make decisions, route dynamically, and handle complex tool orchestration. Most platforms treat these as afterthoughts. I wanted to fix that.

Although it's not production-ready and nowhere close to handling the requests of companies like Gumloop and similar ones, this is intended to showcase the robustness of Vibe coding and how easily you can build sophisticated apps in a matter of days. You can also carry forward the work to improve it.

Picking my tech stack

NextJS was the obvious choice for the vibe-coding stack. Could I have used FastAPI with a React frontend?
Sure — but just thinking about coordinating deployments, managing CORS, and syncing types made me tired.

For adding a near-unlimited suite of SaaS app integrations, Composio was the obvious choice. It features a JS SDK that enables you to add agent integrations easily.

When it comes to agent frameworks, JS lacks the buffet Python has.

It boiled down to two frameworks: LangGraph and the AI SDK (I’d heard about Mastra AI, but I didn’t want to spend the weekend getting familiar with it).

I chose LangGraph over AI SDK because LangGraph’s entire mental model is nodes and edges — exactly how visual agent builders should work. Every agent is just a graph; every workflow, a path through that graph. AI SDK is great, but not convenient for graph-based agents.

Coding with Vibes

If you’re a vibe-code hater, skip ahead.
Frontend is entirely vibe-coded. I didn’t use Lovable or Bolt.new because it’s easier to open the code in Cursor and tweak it there.

My setup

GPT-4.1 – The sniper: does exactly what you ask, nothing more, nothing less. Great for precise component tweaks.
Gemini 2.5 Pro – The machine-gun: rewrites entire components and understands context across files. Perfect for major refactors.
21st Dev’s MCP Server – uses the Cursor Agent to build beautiful shadow components. Instead of copy-pasting docs, I just describe what I want.

The canvas where users drag-and-drop nodes? Built with React Flow plus a moving grid background from 21st Dev. Took ~30 minutes; doing it by hand would’ve exhausted me.

Building the Components

Strip away the marketing fluff; an AI agent is two things:

An LLM that makes decisions
The tools it can use to take action

That's it. So I built exactly four fundamental nodes:

Input Node – where data enters the system
Output Node – where results emerge
LLM Node – makes decisions
Tool Node – takes actions

…and an Agent Node that combines an LLM + Tools for convenience. Every complex workflow is just a remix of these primitives.

Composio for adding unlimited tool integrations

Writing tool integrations is painful. Managing auth for those tools? That’s where developers go to die.
Every tool has a different auth flow. Multiply that by 100 + tools and you have a maintenance nightmare.

Composio fixes this: one SDK, hundreds of pre-built tools, auth handled automatically. Ship in a weekend instead of spending months on OAuth.

API Routes

Each workflow is a JSON graph. Here’s a tiny example:

{
  "nodes": [
    {
      "id": "input_1",
      "type": "customInput",
      "position": { "x": 100, "y": 100 },
      "data": { "label": "User Query" }
    }
  ],
  "edges": [
    {
      "source": "input_1",
      "target": "agent_1"
    }
  ]
}

I wanted one API route that takes the entire graph and executes it.

When a user hits Run, this happens:

Graph Validation – find the Input node, verify edges connect, check for cycles
Topological Sort – determine execution order (LangGraph does this beautifully)
Node Execution – each node type has its own execution logic
State Management – pass data between nodes while maintaining context

// Sample snippet
switch (node.type) {
  case 'llm': {
    const model  = getModelFromApiKey(node.data.apiKey);
    result       = await model.invoke(previousOutput);
    break;
  }

  case 'tool': {
    const tool   = await composio.getTool(node.data.action);
    result       = await tool.execute(previousOutput);
    break;
  }

  case 'agent': {
    const tools  = await composio.getTools(node.data.tools);
    const agent  = createReActAgent(model, tools);
    result       = await agent.invoke(previousOutput);
    break;
  }
}

Managing Authentication with Tools

Authentication was my personal nightmare.
Composio solved the technical part, but the UX? That took three rewrites.

v1 pain-stack

Manually type action names (spelled perfectly)
Leave my app to authenticate on Composio’s dashboard
Come back and hope it worked

I added a drop-down of actions, but auth was still clunky. So I:

Pulled every available tool from Composio’s API and cached it locally.
Built a modal showing each toolkit, its tools and connection status.
Adapted the UI to the tool’s auth type:

API Keys – password input + link to get the key
OAuth2 (hosted) – Connect button opens a pop-up
OAuth2 (custom) – form for client credentials
Other – dynamic form built from required fields

Once authenticated, the same modal lets you search and add tools in one click.

Agent Orchestration Patterns

Anthropic’s guide “Building Effective Agents” lists several patterns. I created nodes that instantiate these instantly.

1. Prompt Chaining

Pattern: Sequential; output of one agent feeds the next.
Node example: customInput → agent_1 → agent_2 → customOutput

2. Parallelisation

Pattern: Agents run in parallel and their results are aggregated.
Node example:

  customInput → agent_1   (parallel)
  customInput → agent_2   (parallel)
  both       → aggregator → customOutput

3. Routing

Pattern: A router agent decides which branch to use.
Node example:

  customInput → router_agent
  router_agent → agent_1 | agent_2 → customOutput

4. Evaluator-Optimiser

Pattern: Generator agent produces solutions; evaluator checks them; loop until good.
Node example:

  customInput → generator_agent → solution_output
                 ↘ evaluator_agent ↗

5. Augmented LLM

Pattern: An agent node is augmented with tool calls / external data.
Node example: customInput → agent(with tools) → customOutput

After 48 hours of rapid development, I had a working agent platform.

The barrier to building agents has collapsed. You don’t need a 20-person team and six months; you need:

Clear thinking about what agents are (decision-makers with tools)
The right abstractions (everything is a graph)
The wisdom to reuse existing solutions instead of rebuilding them

The irony? I spent more time perfecting the auth modal than building the execution engine. In the age of vibe-code, the hardest problems aren’t technical — they’re about understanding users and having the taste to build well.

The code lives on GitHub. Fork it, break it, make it better.

Finally, the fruits of 48 hrs of vibe-coding:

This was all about vibe coding my way to an actual product. Though it's still maybe not fully ready for the real world, it's 80% there in a weekend, which would have taken months before.

Top 10 awesome MCP servers that can make your life easier 🪄✨

Sunil Kumar Dash — Thu, 24 Apr 2025 13:54:36 +0000

MCP by Anthropic is the talk of the town; it's the one thing everyone is talking about and building around. Why, you may ask? Well, the simple reason is that the tooling layer in agents has always been the most challenging part to solve. The MCP (Model Context Protocol) standardises how developers should build tools and clients for universal adaptability.

Recently, both OpenAI and Google have officially started supporting MCP in their respective agent frameworks, Agentsdk and Agent Development Kit.

This blog post discusses some of the best MCP servers I have tried to improve my productivity over the last two months. But before that, let's go over what MCP even is.

What is MCP (Model Context Protocol), and why should you care?

It’s an open standard developed by Anthropic that standardises how AI applications, LLMs, and tools communicate. It has three distinct components.

Host: Applications like Cursor, Windsurf, Claude Desktop, etc.
Client: Manages the communication between the host application and servers—the middleman.
Server: Servers are tools (File, Git, Shell, Slack, Notion APIs), databases, log files, etc, which can provide additional context to agents.

Anthropic defines MCP as the USB-C equivalent of agentic systems. The computers are the hosts, clients are the ports, and peripheral devices are the servers.

For a more detailed explanation of MCP, check out this blog post: Model Context Protocol.

What are MCP Servers?

MCP servers expose external data to the LLM. They can be local tools like the File System tool or remote API services like Slack, Discord, etc. Servers allow your AI apps to be genuinely agentic.

This post will discuss 10 MCP servers that have helped me save hours.

So, let's get started.

1. Notion for automated note-taking

One of the best productivity hacks for me has been the Notion MCP server. I use Notion to store all the details from my conversations in the Claude app. It can also fetch any document from Notion and add it as additional context to the discussion. I have been using it with Cursor and Claude Desktop, and it’s so good.

For Cursor, I use it to fetch the product requirement document and have it create features accordingly.

How to use Notion MCP in the Claude server

First, make sure Node.js is installed, and run node -v In your terminal
Else, install it from nodejs.org

To get Notion MCP, go to the https://mcp.composio.dev and search Notion. They also handle the OAuth authentication, so you can securely connect with the Notion app without worrying about authentication and authorisation.

You will get a npx command.

npx @composio/mcp@latest setup "replace it with the URL" --client claude

Now, paste the generated code into your terminal and execute it.

The code will automatically add the Notion MCP to your Claude desktop.
Refresh or restart the app; you will see a hammer icon in Claude's chat.
Click on it to see the available actions.
Start by asking in the chat to “Initiate connection with Notion.”
Complete the Auth flow and start asking questions.

For Windsurf and Cursor, you can also follow the instructions.

Check out this tutorial on how to integrate Notion with Claude Desktop.

2. Figma: From Design to Code

You’ll thank the Lord after using Figma MCP in your Cursor workflow. You can code any Figma design files. It will definitely make your life easier as a developer.

How to use Figma MCP in Cursor

Follow the same steps above and make sure your system has Node.js installed.
Go to http://mcp.composio.dev/figma
Generate the npx code.
Run it in your terminal.
Now, re-open the Cursor or refresh it.
You can now see your Figma tools in Cursor settings → MCP.
Now, initiate a connection with Figma by asking in the chat.
Give it the URL to your file in the Figma Project.
Now ask it to write code from the design.
The Cursor agent writes the code.

3. Supabase for managing the database from an IDE

This is yet another popular use case of MCP servers. You can connect Cursor, Windsurf, or Claude Desktop with your Supabase database.

What can you do with it?

Schema Exploration and Documentation: Use the MCP server to read and explain your table structures, relationships, and constraints in plain language.
Read-Only Queries for Insights: Let the MCP generate SQL SELECT statements to retrieve and summarise data for quick analysis.
Explain and Debug Queries: Ask the MCP to interpret or optimise your existing SQL queries and outline the query execution plan in simpler terms.
Generate Migrations in a Dev/Staging Environment: Have the MCP propose schema changes, then review and apply them in a safe environment before production.

How to use Supabase MCP

For a managed Supabase server:

Go to https://mcp.composio.dev/supabase
Get the npx Command and run it in your terminal
Refresh your MCP-compatible host
Initiate a new connection
And start using it

4. Firecrawl MCP for web-crawling

It doesn’t matter if you’re a technical or non-technical person; this can be a great boon in your productivity. Firecrawl is a tool that can help you navigate websites and get content for you. With a Firecrawl MCP in your Chat app, you can search websites and ask for any information.

What can you do with it?

Collect and summarise content from any website or blog across multiple pages.
Gather competitor research data (e.g., product pricing, feature comparisons, or marketing strategies).
Combine web-crawled material with other data sources (e.g., local files or databases) for more profound insights or reports.

How to use the FireCrawl MCP server with Composio

Go to https://mcp.composio.dev/firecrawl
Get the npx Command and run it in your terminal
Refresh your MCP-compatible host
Initiate a new connection
And start using it

5. Memory MCP Server: Persistent memory across chat

If you use Claude a lot, you’d know how irritating it can be sometimes to switch to a different chat window and start the conversation from scratch. Well, memory servers ease this.

This Knowledge Graph Memory Server tool allows Claude to maintain persistent memory across user conversations. It essentially creates a database of user information that can be accessed and updated over time.

How to use the Memory Graph MCP server

You can use this server with Claude. Here’s how you can do it. Go to Claude Desktop → Settings → Developer → Edit Config

Open the claude_desktop_config.json for npx Based on servers. You'd need Node.js for it to work.

{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-memory"
      ]
    }
  }
}

This is also configurable with environment variables:

{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-memory"
      ],
      "env": {
        "MEMORY_FILE_PATH": "/path/to/custom/memory.json"
      }
    }
  }
}

6. Blender MCP: For 3d modelling, Scene creation, and manipulation

Blender MCP is the hottest thing right now. You can connect Claude AI to this and interactively build 3d renders just by prompting it.

Here are some features:

Two-way communication: Establishes a direct connection between Claude AI and Blender through a socket-based server
Object manipulation: Let Claude create, modify, and delete 3d objects directly in your Blender scenes
Material control: Enables Claude to apply and modify materials and colours to objects in your projects
Scene inspection: Allows Claude to analyse and retrieve detailed information about your current Blender scene
Code execution: Empowers Claude to run Python code in Blender, opening up endless customisation possibilities

How to integrate Blender MCP into Claude

Blender 3.0 or newer
Python 3.10 or newer
uv package manager:

If you're on Mac, install uv:

brew install uv

On Windows

powershell -c "irm <https://astral.sh/uv/install.ps1> | iex"

and then

set Path=C:\\Users\\nntra\\.local\\bin;%Path%

Claude for Desktop Integration

Go to Claude > Settings > Developer > Edit Config > claude_desktop_config.json and include:

{
    "mcpServers": {
        "blender": {
            "command": "uvx",
            "args": [
                "blender-mcp"
            ]
        }
    }
}

Cursor integration

Run blender-mcp without installing it permanently through uvx. Go to Cursor Settings > MCP and paste this as a command.

uvx blender-mcp

For Windows users, go to Settings > MCP > Add Server, add a new server with the following:

{
    "mcpServers": {
        "blender": {
            "command": "cmd",
            "args": [
                "/c",
                "uvx",
                "blender-mcp"
            ]
        }
    }
}

7. File Search: Working with files from the MCP hosts

A local tool that will let you work with file systems from the Claude Desktop. You can get any files from your disk, feed them to Claude or Curasor, and work however you want with them.

Here are some features

Read/write files
Create/list/delete directories
Move files/directories
Search files
Get file metadata

Note: The server will only allow operations within directories specified via args.

How to add a File Search MCP server

Add this to claude_desktop_config.json

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/username/Desktop",
        "/path/to/other/allowed/dir"
      ]
    }
  }
}

8. Obsidian MCP Server: Note-taking meets AI

If you’re an Obsidian user and use it frequently, you should have it in your Claude. You can

Access your knowledge base: Claude can directly search, read, and reference all your Obsidian notes.
Create and modify notes: Ask Claude to draft or update notes in your vault.
Query across documents: Find connections between ideas across your entire knowledge system
Extract insights: Have Claude analyse patterns and relationships in your notes

How to add the Obsidian MCP server

Check if you have Node.js installed and install it if it’s not there.
And run the command in the terminal:

npx @smithery/cli install mcp-obsidian --client claude

9. Linear MCP Server: For ticket management

If you're managing projects with Linear, connecting it to Claude unlocks powerful capabilities. You can:

Issue management: Create, update, and close tickets directly through conversations
Project tracking: Get status updates and summaries across your entire workspace
Sprint planning: Generate sprint plans based on backlog analysis
Priority management: Reorganise and prioritise issues through natural language

How to add the Obsidian MCP server

Go to https://mcp.composio.dev/linear
Generate the npx Command for Cursor
Run it in your terminal
Initiate a new connection.
Start working

10. Github: Working with your remote repository

Connecting GitHub to Claude transforms your development workflow. You can:

Code review: Have Claude analyse pull requests and suggest improvements
Commit management: Search, analyse and create commits through conversation
Issue tracking: Create, update and resolve GitHub issues
Repository exploration: Navigate codebases and understand project structures

How to add the GitHub MCP server

Same as before

Go to https://mcp.composio.dev/github
Generate the npx The command for Cursor/Claude
Run it in your terminal
Initiate a new connection with GitHub
Start working

For a complete list of managed MCP servers, check out Composio. There, you will find MCP servers for mainstream application services and niche apps, which you will not find anywhere else.

MCP vs Agent2Agent: Everything you need to know

Sunil Kumar Dash — Wed, 23 Apr 2025 14:34:49 +0000

Am I a bit late to talk about MCP and A2A protocols? I hope not! Both have been all over the internet, and they are mind-blowing!! A race is occurring right now, and nobody wants to be left behind in introducing new models and tools.

Anthropic released MCP (Model Context Protocol) for agents, which got good community traction. Recently, we saw Openai’s integration with MCP. MCP tells you how the agent will communicate with the APIs, making our multiple-tool calling easier.

Now, Google has released an A2A (Agent2Agent) protocol to streamline agent communication. In short, A2A standardises agent-to-agent communication while MCP standardises agent-to-tools communication.

So, yes, they are not competing but complementing each other. Google has extended support for MCP in the Agents Development Kit (ADK).

This blog post explains how they work together to standardise building production-ready AI agents.

Let’s first discuss MCP and then proceed with the A2A protocol to see how both work.

Understanding MCPs – The Role of the Model Context Protocol

MCP stands for Model Context Protocol, an open standard developed by Anthropic. It defines a structured and efficient way for applications to provide external context to large language models (LLMs) like Claude and GPT. Think of it like USB for AI — it lets AI models connect to external tools and data sources in a standardised way.

What’s the Core Problem MCP Solves?

MCP has three critical components:

Client: Maintains a 1-to-1 connection with servers, handles all LLM routing and orchestration, and negotiates capabilities.
Server: API services, databases, and logs that LLMs can access to complete tasks. Servers expose tools that LLMs use.
Protocol: The core standard governing client and server communication.

For an in-depth guide on MCP, its architecture, and internal workings, see Model Context Protocol (MCP): Explained.

In short, MCP lets client developers build apps (Cursor, Windsurf, etc.) and server developers build API servers without worrying about each other’s implementation. Any MCP client can connect to any MCP server and vice-versa.

Each tool implementation is different:

Different field names (start_time vs event_time)
Different auth schemes (OAuth, API key, JWT, etc.)
Different error formats

MCP standardises how servers are built. You still write integration logic for each app (or use Composio), but MCP ensures any server can plug into any client. That makes life easier for millions of developers and abstracts away tool-by-tool quirks.

You can think of it like this:

User: Adds a Google Calendar MCP server to Cursor IDE.
Client: Fetches the server’s tools and injects them into the LLM context.
User: “Schedule a team sync on Thursday at 3 PM.”
MCP Client: LLM decides it needs to call a tool, fills parameters, and executes (after auth).
Calendar Server: Creates the meeting.

Instead of wiring services with brittle code, you now get a clean, modular interface.

Despite its merits, MCP can be tough in production — security, reliability, and multiple servers get hairy. That’s why we at Composio are building robust MCP infrastructure for your AI workflows.

Agent2Agent Protocol by Google

Google introduced the Agent-to-Agent Protocol (A2A), inspired by MCP. Where MCP focuses on agent-to-server calls, A2A focuses on agent-to-agent interoperability.

Imagine a travel assistant planning a trip from Delhi to Mumbai. It can delegate to a train-booking agent, a hotel-booking agent, and a cab-service agent:

“Plan a full trip from Delhi to Mumbai, book my train, find a hotel near the station, and arrange local transport.”

Behind the scenes A2A forms a mini-team of agents, each handling part of the job. That’s modular, connected, and smarter.

A2A Design Principles

A2A enables flexible, secure communication between autonomous agents, regardless of vendor or ecosystem.

A2A in a nutshell

Key advantages

True agentic behaviour – independent cooperation without shared state.
Familiar tech stack – HTTP + SSE + JSON-RPC.
Enterprise-grade security – built-in auth/authz like OpenAPI.
Short- or long-running tasks – real-time progress, state tracking, and human-in-the-loop.
Modality-agnostic – text, audio, video, etc.

How A2A Works

Stage	What happens
Capability discovery	Agents publish Agent Cards (JSON) describing skills, modalities, constraints.
Task lifecycle	A client agent delegates a task; the remote agent updates status until producing an artefact.
Collaboration	Agents exchange messages, artefacts, and context.
UX negotiation	Messages use typed parts (text, image, chart, form, …) tailored to the client’s UI.

Key Concepts of A2A Protocol

1. Multi-Agent Collaboration

Agents share tasks, results, and work across ecosystems.
E.g. a recruiting agent chatting with a company’s hiring agent, or a delivery agent coordinating restaurants.

2. Open & Extensible

Open protocol with 50 + contributors (Atlassian, Box, LangChain, PayPal, etc.).
Uses standards like JSON-RPC and Service/Event descriptions.

3. Secure by Default

Auth / authz via OpenID Connect.
.well-known/agent.json discovery endpoints.

Working of A2A – Examples

Architecture Example

Three agents in a productivity suite:

Calendar Agent – hosted server, pulls availability via MCP.
Document Agent – fetches documents/notes via MCP.
Assistant Agent – user-facing LLM delegating tasks.

Flow

Assistant → Calendar: check availability.
Assistant → Document: fetch & summarise doc.

So A2A handles agent-to-agent chat, while MCP bridges agents to apps.

Agent Discovery (Inspired by OpenID Connect)

Agents advertise at:

yourdomain.com/.well-known/agent.json

It lists name, description, capabilities, sample queries, modalities, etc., so newcomers can discover and interact dynamically.

Agent2Agent vs MCP

Feature	MCP	Agent2Agent
Communication	Agent ↔ External APIs	Agent ↔ Agent
Goal	API integration	Collaboration & interoperability
Layer	Backend	Mid-layer
Tech	REST/JSON	JSON-RPC / events
Inspired by	LSP	OpenID Connect

MCP provides the tools each agent uses, while A2A facilitates the collaboration between agents. They complement each other, ensuring both the execution of individual tasks and the coordination of complex, multi-step processes.

While MCP equips agents with the necessary tools to perform specific tasks, A2A enables these agents to collaborate, ensuring a cohesive and efficient experience.

Both Anthropic’s MCP and Google’s A2A protocols facilitate interaction between AI systems and external components, but they cater to different scenarios and architectures.

Category	Anthropic MCP	Google A2A
Objective	Link one model to external tools	Coordinate autonomous agents
Best fit	Secure enterprise data access	Distributed B2B coordination
Protocol	STDIO/HTTP + SSE	HTTP/S + webhooks/SSE
Discovery	Manual server settings	Dynamic Agent Cards
Pattern	Top-down calls	Peer collaboration
Security	Cross-boundary focus	Same, multi-agent scope
Workflows	Simple request-response	Long-running, stateful

1. Communication

MCP: Structured Schemas
• In MCP (Multi-Call Protocol), the interaction is explicit and schema-driven.
• The assistant knows exactly what tool to call, what arguments to pass, and in what format.
• Flow: AI Assistant → Tool with structured input → Tool returns raw result.
MCP Flow:

• AI sends: get_weather_forecast(Tokyo, 2025-04-22)
• Tool returns: “Sunny, 22°C”
• AI just displays the result.
A2A: Natural Language
• A2A (Agent-to-Agent) is much more conversation-style, using natural language tasks.
• Tasks are expressed like real user queries, and agents internally decide how to interpret them.
• Flow: User Agent → Task in plain English → Target Agent processes → Responds naturally.
A2A Flow:

• User says: “Can you tell me the weather in Tokyo on April 22nd and the current $NVDA price?”
• Agent routes to the appropriate Finance/Weather Agent
• Response might be: “Sure! The forecast for Tokyo on April 22nd is sunny with a high of 22°C. or $NVDA price currently is $101.42 down by 0.064%”

2. Task Management

MCP: Single-Stage Execution
• MCP handles tasks like a classic function call.
• You call the function (or “tool”) and immediately get a response: either a success with the result or a failure (error/exception).
• The whole process is immediate and atomic, one shot, one answer.
A2A: Multi-Stage Lifecycle
• A2A treats tasks like long-running jobs.
• Tasks have multiple possible states:
• pending → waiting to start
• running → work in progress (can even provide partial results!)
• completed → final result ready
• failed → something went wrong
• You can check back anytime to see progress, grab partial data, or wait for the full result.

3. Capability Specification

MCP: Low-Level, Instruction-Based
MCP capabilities are described with very strict schemas, usually in JSON Schema format. They are about precision and control, like telling a machine exactly what to do and how to do it.

{

  "name": "book_table",

  "description": "Books a table at a restaurant",

  "inputSchema": {

    "type": "object",

    "properties": {

      "restaurant": { "type": "string" },

      "date": { "type": "string", "format": "date" },

      "time": { "type": "string", "pattern": "^\\d{2}:\\d{2}$" },

      "party_size": { "type": "integer", "minimum": 1 }

    },

    "required": ["restaurant", "date", "time", "party_size"]

  }

}

A2A: High-Level, Goal-Oriented

In contrast, A2A uses an Agent Card to describe capabilities regarding goals, roles, and expertise. It’s like explaining what someone is good at and trusting them to handle it.

agent_card = AgentCard(

    id="restaurant-agent",

    name="Dining Assistant",

    description="Helps users find and book tables at restaurants.",

    agent_skills=[

        AgentSkill(

            id="table_booking",

            name="Table Booking",

            description="Can search restaurants and book tables as per user preferences.",

            examples=[

                "Book a table for 4 at an Italian place this Friday night.",

                "Find a quiet restaurant near downtown and reserve for two people."

            ]

        )
        ]

)

• MCP allows you add skills (API services, Databases, records, etc) to your agents.
• A2A gives you flexibility, judgment, and delegation power. Think of a team of thoughtful coworkers.
• They’re like pairing an engineer (MCP) with a project manager (A2A). One does exact work; the other handles the chaos.

How to Use MCP with A2A

One way to integrate MCP servers into A2A agents is with Google’s Agent Development Kit (ADK).

Install the ADK

pip install google-adk

Import the Required Modules

# ./adk_agent_samples/mcp_agent/agent.py
import asyncio
from dotenv import load_dotenv
from google.genai import types
from google.adk.agents.llm_agent import LlmAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService  # Optional
from google.adk.tools.mcp_tool.mcp_toolset import (
    MCPToolset,
    SseServerParams,
    StdioServerParameters,
)

# Load environment variables from a .env file in the parent directory
load_dotenv(".env")

Configure the MCP Server and Fetch Tools

# --- Step 1: import tools from an MCP server (HTTP SSE) ---
async def get_tools_async():
    """Gets tools from the Gmail MCP server."""
    print("Attempting to connect to MCP Filesystem server…")
    tools, exit_stack = await MCPToolset.from_server(
        connection_params=SseServerParams(
            url="https://mcp.composio.dev/gmail/tinkling-faint-car-f6g1zk"
        )
    )
    print("MCP Toolset created successfully.")
    return tools, exit_stack

Note: In the example above, we use the HTTP SSE endpoint for the Gmail server at mcp.composio.dev.

For a STDIO-based tool

async def get_tools_async():
    """Gets tools from a local MCP filesystem server (STDIO)."""
    print("Attempting to connect to MCP Filesystem server…")
    tools, exit_stack = await MCPToolset.from_server(
        connection_params=StdioServerParameters(
            command="npx",
            args=[
                "-y",
                "@modelcontextprotocol/server-filesystem",
                "/path/to/your/folder",
            ],
        )
    )
    print("MCP Toolset created successfully.")
    return tools, exit_stack

Create the Agent

async def get_agent_async():
    """Creates an ADK agent equipped with tools from the MCP server."""
    tools, exit_stack = await get_tools_async()
    print(f"Fetched {len(tools)} tools from MCP server.")
    root_agent = LlmAgent(
        model="gemini-2.0-flash",           # Adjust if needed
        name="maps_assistant",
        instruction=(
            "Help the user with mapping and directions using the available tools."
        ),
        tools=tools,
    )
    return root_agent, exit_stack

Define `main`

async def async_main():
    session_service   = InMemorySessionService()
    artifacts_service = InMemoryArtifactService()  # Optional

    session = session_service.create_session(
        state={}, app_name="mcp_maps_app", user_id="user_maps"
    )

    # TODO: Use specific addresses for reliable results with this server
    query   = "What is the route from 1600 Amphitheatre Pkwy to 1165 Borregas Ave"
    print(f"User Query: '{query}'")

    content = types.Content(role="user", parts=[types.Part(text=query)])

    root_agent, exit_stack = await get_agent_async()

    runner = Runner(
        app_name="mcp_maps_app",
        agent=root_agent,
        artifact_service=artifacts_service,  # Optional
        session_service=session_service,
    )

    print("Running agent…")
    events_async = runner.run_async(
        session_id=session.id,
        user_id=session.user_id,
        new_message=content,
    )

    async for event in events_async:
        print(f"Event received: {event}")

    print("Closing MCP server connection…")
    await exit_stack.aclose()
    print("Cleanup complete.")


if __name__ == "__main__":
    try:
        asyncio.run(async_main())
    except Exception as e:
        print(f"An error occurred: {e}")

Run the Application

Execute the script to watch your A2A agent automatically call MCP-hosted tools in response to user queries.

Conclusion

MCP makes it easier for agents to communicate with the tools that wrap external application services, and Agent2Agent makes it easier for multiple agents to communicate and collaborate. Both MCP and Agent2Agent are steps in the direction of standardising agent development. It would be interesting to see how they transform the agentic ecosystem.