Forem: CAMEL AI

Eigent：Open-source Cowork Meets MiniMax M2.1

CAMEL AI — Tue, 27 Jan 2026 16:21:56 +0000

Abstract

In real enterprise environments, many internal tools, dashboards, and legacy systems operate entirely in the browser, forming the backbone of daily business operations.To automate these complex systems, we introduce Eigent, an open-source multi-agent workforce application that runs locally and can be fully set up from source, with a strong focus on browser automation.

In this post, we’ll explore how Eigent, the opensource Cowork leverages CAMEL’s Workforce architecture and browser automation to handle complex, multi-step enterprise tasks. We’ll also take a closer look at Minimax M2.1, analyzing its performance on a real-world enterprise tasks and examining the architectural features that enable it to perform effectively in long-horizon, agentic browser automation scenarios.

Background: What is Eigent and How it supports Minimax M2.1

Eigent is an Open Source Cowork Desktop to Unlock Your Exceptional Productivity. It is built with a multi-agent workforce architecture, supported by general abilities such as browser automation, terminal automation and MCPs. This design enables agents in Eigent to perform tasks much like human workers — operating in real desktop environments, without the need for deep API integrations or constant workflow reconfiguration.

As foundation models continue to advance, integrating them with Eigent’s open-source multi-agent system serves as an open-source cowork for enterprises, enabling developers and enterprise users to apply LLM capabilities directly to real-world use cases quickly and effectively. You can navigate to the Model Settings page in Eigent, locate the OpenAI Compatible section, and input your API key and url. Once the model name is set to MiniMax-M2.1, you are ready to begin. need help? Check out our guide on [configuring your Minimax API key].

Github Repository & how to setup Eigent

GitHub Repository: https://github.com/eigent-ai/eigent

Quick Start: Setting Up the Environment

You have two ways to run Eigent: using the pre-compiled desktop app for immediate usage, or setting up the development environment to inspect the code and customize the agents.

Option A: The "Zero-Config" Desktop App

For users who want to start automating tasks immediately without touching code:

Download the client from the Official Website.
Install the .dmg (macOS) or .exe (Windows).
Launch the app—the local backend starts automatically.

Option B: Developer Setup

To access the source code and run the system locally for development, follow these steps:

1. Prerequisites Ensure you have Node.js (v18-22) and Python installed.

2. Clone and Install

# Clone the repository
git clone https://github.com/eigent-ai/eigent.git
cd eigent

# Install frontend dependencies
npm install

3. Run the Application

# Return to root and run dev mode
npm run dev

Once running, you can configure your LLM providers (Minimax M2.1, etc.) directly in the settings. For more detailed information on configuration, advanced features, and troubleshooting, please refer to our Official Documentation.

Under the Hood: Eigent full stack and CAMEL Workforce Architecture

Eigent System Overview

Eigent constitutes a local-first desktop application with multi-agent orchestration, powered by the CAMEL Workforce as its core engine. The system implements a decoupled, full-stack architecture that operates entirely on the user's local infrastructure. This design strictly ensures data sovereignty, eliminating the privacy risks associated with cloud-resident agent execution.

1. The Frontend

The user interface serves as the control plane for agent configuration and workflow monitoring. Built on React and TypeScript within an Electron framework.

Key technical components include:

State Management: Zustand is employed for handling transient application state, ensuring efficient reactivity.
Visual Orchestration: React Flow is integrated to visualize agent workspace to track real-time agent execution.
Communication: The frontend communicates with the backend via secure local HTTP requests.

2. The Backend

The core logic resides in a local Python server utilizing FastAPI and Uvicorn, which acts as the host environment for the CAMEL multi-agent framework.

Runtime Environment: The backend runs on Python 3.10+, managed by uv for high-performance dependency resolution and environment isolation.
Persistence Layer: PostgreSQL, interfaced via SQLModel/SQLAlchemy ORM, provides robust structured data storage for audit logs, workflow history, and agent states.
Multi-agentAgent Systemramework: The CAMEL framework handles agent orchestration logic (e.g., workforce), interfacing with Large Language Models (LLMs) whether remote (e.g., Minimax) or local (e.g.,via vLLM) for agent running. The CAMEL framework also offers a rich set of toolkits such as browser toolkit, terminal toolkit, document generation toolkit.

CAMEL Workforce: A Multi-Agent System Inspired by Organizational Structures

At the heart of Eigent lies CAMEL Workforce, a multi-agent system architected to resolve complex, real-world tasks through decentralized cooperation. The system utilizes a strict Producer-Consumer pattern, mediated by an asynchronous message channel to manage dependency graphs efficiently.

1. Agent Roles

Coordinator Agent: Functions as the primary dispatcher. It maintains the global state and allocates subtasks to specific workers based on availability and capability.
Task Agent: Taking responsibility for the semantic decomposition of high-level objectives into executable, atomic units.
Worker Agent: Serves as the specialized execution unit. Worker agents consume atomic subtasks and execute them using domain-specific tools.

2. Asynchronous Communication: The TaskChannel

Decoupling between the coordination layer and the execution layer is achieved via the TaskChannel. This asynchronous message queue manages task distribution without blocking the main execution thread.

Execution Flow:

Workforce initiates a task.
Worker nodes poll for assignments.
Upon completion, results are pushed back.

3. Dynamic DAG Construction

Enterprise workflows are rarely linear. CAMEL Workforce implements a dynamic Directed Acyclic Graph (DAG) construction mechanism. When a high-level prompt is received (e.g., "Create Travel Plan"), the Task Agent decomposes this objective into discrete nodes.

The system explicitly maps dependencies, allowing the scheduler to:

Execute independent nodes in parallel (e.g., Search Flight Ticket and Search Hotel run concurrently).
Block dependent nodes until their predecessors reach a DONE state.

4. Fault-tolerant Mechanism

Given the non-deterministic nature of LLMs, Eigent treats failures as expected state transitions rather than fatal exceptions. The architecture implements a robust recovery mechanism utilizing the following strategies:

RETRY: Re-executes the sub-task on the same worker to handle transient errors.
REPLAN: The Task Agent modifies the original sub-task based on the failure log before re-queueing the sub-task.
REASSIGN: The sub-task is migrated from the current worker to a different agent with a compatible skill set.
DECOMPOSE: If a task fails due to excessive complexity, it is recursively broken down into smaller subtasks.

Browser Automation Architecture in Eigent

Yet, a multi-agent workforce architecture can only unlock real enterprise automation when paired with the growing strength of general-purpose capabilities such as browser automation. This is why we emphasize building agents that can operate directly within real business environments rather than relying solely on rigid API integrations.

Eigent adopts a two-layer architecture that separates browser control from agent orchestration:

The TypeScript layer is responsible for all browser interactions. It leverages native Playwright APIs to perform DOM operations, capture structured snapshots, generate SoM screenshots, detect occlusions, and handle advanced browser logic directly within the JavaScript runtime. As Playwright is natively built in TypeScript, this layer gains access to cutting-edge features like _snapshotForAI() and ensures better performance, reliability, and developer ergonomics.
The Python layer handles AI orchestration. It manages LLM calls, agent decision-making, and task planning. This separation allows Python to focus on agent logic, where the Python ecosystem excels in AI and workflow orchestration.
The two layers communicate asynchronously via WebSocket, enabling non-blocking operations. Python sends browser operation requests, TypeScript executes them and returns results. The interaction is transparent to the end user and supports concurrent task execution.

This architecture improves performance, enhances the precision of element interactions, and enables advanced capabilities like dynamic DOM filtering, viewport-aware snapshots, and in-browser SoM rendering. It avoids the limitations of Python-only implementations, such as high latency, limited access to browser internals, and complex image processing logic. By delegating browser tasks to the native execution context, Eigent ensures a robust foundation for agent-based enterprise automation.

During multi-agent execution in enterprise automation scenarios, browser-based automation offers a natural advantage in process visibility. Every step is transparent, inspectable, and easy to debug, making it far more practical for complex and evolving workflows.

Test Minimax M2.1 in Real-World Enterprise Tasks with Eigent browser automation

We have tested Eigent with Minimax M2.1 to automate sales processes using Eigent browser automation capabilities. The tasks for agents are to automate various stages of the real-world sales cycle, including Lead Capture & Creation, Qualification & Pipeline Management, Quotation, Negotiation, Closing, and Product Management.

Across experimental runs, Minimax M2.1 consistently shows three key strengths:

Handles complex page structures well, including iframes and nested elements: It can reliably find the right content and buttons, even in complex layouts.
Checks its own actions to stay accurate and short steps: It uses a feedback loop to correct mistakes and make sure the task is really done right.
Uses tools efficiently and flexibly: It avoids unnecessary steps and knows how to combine tools smartly when needed.

Task:

"We have a new contact at Global Media - Jennifer Martinez (jennifer.m@globalmedia.com) is their new Senior Marketing Manager. Add her to our Salesforce and make sure she’s connected to the right company."

In this task, Minimax M2.1 was required to operate within a highly complex Salesforce interface to complete a realistic business workflow: adding a new contact, Jennifer Martinez (Senior Marketing Manager), to Global Media, and ensuring she was correctly associated with the appropriate company account. This involved navigating multiple UI layers, identifying the correct entry points, creating the contact, populating key fields, and validating the account linkage.

The results show that Minimax M2.1 executed every step accurately and without error, with no mis-clicks or workflow breakdowns. This demonstrates the model’s strong capability in understanding complex enterprise UIs, planning multi-step actions, and reliably executing end-to-end tasks—highlighting its robustness in real-world, browser-based enterprise automation scenarios.

How Minimax M2.1 Improves Task Performance

Minimax M2.1 emerges as a strong choice for autonomous enterprise agents. Built to excel in real-world complex workflows, M2.1 consistently handles long-horizon, multi-step tasks with reliability. It delivers a compelling combination of performance, efficiency, and versatility, making it a practical option for scaling agent-based automation in enterprise environments.

Enhanced Reasoning and Workflow Continuity

One of the key strengths of M2.1 lies in its systematic improvements for real-world complex tasks. Compared to its predecessor, M2.1 produces more concise and efficient reasoning chains, improved responsiveness, and reduced token consumption—resulting in smoother execution of continuous workflows such as agentic task automation.

Rather than relying on simple conversational history, Minimax M2.1 is designed for better context management across multiple steps. This enhanced structured reasoning helps maintain logical continuity during multi-step function calls and reduces the chance of errors later in the workflow, especially in browser-driven task sequences.

Agent and Tool Generalization Capabilities

M2.1 exhibits strong performance across a variety of agent scaffolding frameworks and tooling environments. It generalizes reliably with different tools and supports integrated workflows, enhancing its utility in real office and enterprise automation tasks.

Robustness in Long-Horizon Planning

Enterprise automation often involves uncertainty—handling dynamic UI states, load delays, and unexpected interactions. Through its improved reasoning and execution efficiency, Minimax M2.1 demonstrates resilience in longer task sequences, making it well suited for agentic automation systems that require stability over many steps.

While the gap between top-tier models can be small for standard queries, in scenarios where state retention, complex instruction following, and error recovery are crucial, Minimax M2.1’s enhancements provide a practical foundation for platforms like Eigent. Its ability to produce concise, efficient reasoning and maintain coherent task-level logic makes it an effective choice for complex, multi-step enterprise workflows.

Eigent is fully open-source, and we invite developers, researchers, and enterprise teams to explore, extend, and contribute:

👉 GitHub: https://github.com/eigent-ai/eigent

👉 Huggingface: https://huggingface.co/MiniMaxAI/MiniMax-M2.1

👉 Join our Discord community: https://discord.camel-ai.org

Brainwash Your Agent: How We Keep The Memory Clean

CAMEL AI — Fri, 21 Nov 2025 11:44:47 +0000

Written by Hesam

Three techniques to cut context bloat, keep what matters, and dump the rest.

Your agent only forgets because you let it. You’re actually more in control of the agent’s intelligence than you think, and context engineering is the delicious secret sauce which allows that.

Context engineering has been one of the major focuses of the engineering team at CAMEL. We are constantly thinking about ways to give control over the context to the developers, allowing them to optimize the agent’s memory for maximum performance and efficiency.

Context Engineering Doesn’t Have to Be Complex

It may sound like a complex term, but “context engineering” is actually founded on a very simple idea: Only feed the agent what is necessary to achieve its goal.

As you pollute the context with low-signal redundant information, the model’s intelligence suffers a setback. This context rot hurts the agent’s abilities in various ways, e.g. recalling critical information, choosing the right tools, or following explicit prompt instructions.

You can read this blog post that explains how and why long contexts fail.

This blog post is not an explanation of context engineering techniques. There are many high-quality articles out there explaining the creative ways companies engineer their agent’s context, and I don’t intend to repeat the same information. But as you read more about context engineering and how it is practically applied in the industry, you begin to see very simple techniques you can easily learn and apply to your own agents.

Why This Matters

because if you develop agents, or if you work with them, you must take seriously the methods and techniques to optimize how they perceive the context. Some of these techniques are actually low-hanging fruits. They don’t require extensive implementation and change to your working agents, but they’re just as effective in the performance and cost as the backend LLM that fuels your agents.

In This Blog Post

…we explain three of the techniques implemented in the CAMEL framework that keep agent memory clean and context sharp: Context Summarization, Workflow Memory, and Tool Output Caching.

Yes, simple and intuitive in principle, and intricate in implementation.

You’ll see:

The real problems we hit with agentic workflows
The methods we have used to optimize the context
What remains to be done (and how you can help)

We have also opened a number of issues in the CAMEL repo so you can jump in, challenge yourself, ship fixes, and make agents remember better.

Context Summarization: Keeping What Matters

Let’s imagine a scenario which might sound familiar to you. You prompt your agent to build a simple text-to-emoji app that:

takes an input text from the user,
calls a text-to-image model to create an emoji,
shows it to the user,
stores it in a PostgreSQL database,
and finally, handle the auth so users can login to their account.

The agent builds a perfect app, the UI looks good enough, the emoji images look good, and ooops… the images are not stored in the database, there has to be a bug. So the agent starts searching the web to find why this is not working, checks the versions, and even takes a look at the official documentation. The process takes so much longer than you expected, and now a simple sub-task has become the main problem of the agent and has been taking 10 minutes to solve.

The agent does find the root cause in the end, but there’s a problem here, and let’s look at the hypothetical context of our hypothetical agent to see what’s wrong:

A simple bug-fix or “side quest” may completely overtake the purpose and the token consumption of your agent. If you have used coding agents such as Cursor or Claude Code, you most definitely have experienced this derailment, and it is just as true for general purpose agents as well.

This is the purpose of context summarization. It takes the conversation, and breaks it down to its most critical components, focusing on what matters, throwing in the bin what doesn’t.

Now context summarization is a common context management technique that is used in a number of situations:

The agent has used the majority (e.g. 80%) of its context window.
The context has been derailed by side-quests and you want to refresh it.
You want to reference this session in another run, so you need a summary of what happened.

Summarization is a Swiss knife that you can whip out in various scenarios, and it’s a must-have in your agentic kit.

How Context Summarization is Used in CAMEL

CAMEL provides three main approaches to context summarization:

Automatic token-based summarization: The ChatAgent monitors token usage and automatically triggers summarization.
Manual summarization API: Explicit call by the developer, so you have full control even if you want to summarize the context when you see fit.
Toolkit-based summarization: Agent-accessible tool for summarizing the full context, and also searching for the messages that have been summarized.

Even though these approaches work slightly differently, the core summarization process follows the same pattern.

Now among this workflow, what’s the most critical part? Of course, it’s the prompt. The summarization prompt is what tells the agent how to summarize the context, what things to focus on, and how to handle uninformative bits. The prompt is what truly makes or breaks this method.

This is an evolving bit for us, and we’re constantly looking for ways to improve prompts for maximum clarity and best outcomes—even though we allow developers to also use their custom prompts. We instruct the agent to extract key information from the conversation history, including: the main request of the user, the work that still needs to be done (necessary if you want to pass this to a fresh conversation), the current work that is being done, etc. In case of tokenlimit context summarization, we also pass a minimal list of user messages, which don’t consume too many tokens, but are highly informative, and reduce our reliance on the LLM summarization to keep the full picture in mind (after all, LLM summarizations could be unreliable, or miss some of the bits, so we have to take cautionary measures.)

[Enhance] tokenlimit Summarize up to the Last User Message · Issue #3371 · camel-ai/camel

[Enhance] Context Summarizer Toolkit Prompt · Issue #3372 · camel-ai/camel

[Enhance] ChatAgent Summarize Prompt · Issue #3373 · camel-ai/camel

[Enhance] Unify Context Summarization Backend · Issue #3374 · camel-ai/camel

Workflow Memory: Past Experiences Matter

You ask your agent to get a list of the top available free books on ML mathematics and then create a CSV of each book, a description, subject, prerequisites, link, etc. The agent searches the web, finds a few titles, but can’t read some of the books available on archive.org website. It tries a few things, searches for a while, and finally, figures out a way to do this successfully. The agent has spent five minutes figuring out what it was doing wrong, which is perfectly fine for an agentic run, but it’s a problem if we need to do a similar task again in the future, especially if this is a recurring workflow.

Workflow memory solves this problem with a simple idea:

Record what you learned about solving this task, so you have a clear strategy for similar problems in the future.

The Intricate Details That Matter

Behind the scenes, workflow memory is a wrapper around the context summarization. The key is to keep this summary general enough so it can be useful in similar tasks, but also detailed enough to be beneficial and helpful in practice.

Here is a list of what we ask the agent to summarize and the prompt used to describe each:

Task title: A short, generic title of the main task (e.g. Remind weekly meetings on Slack)
Task description: One-paragraph summary of what the user asked for. No implementation details; just the outcome the user wants.
Solving steps: Numbered, ordered actions the agent took to complete the task. Each step starts with a verb and is generic enough to be repeatable.
Tools: Bullet list of tool calls or functions calls used. For each: name → what it did → why it was useful (one line each). This field is explicitly for tool call messages or the MCP servers used.
Failure and recovery strategy: [Optional] Bullet each incident with symptom, cause (if known), fix/workaround, verification of recovery. Leave empty if no failures.
Notes and observations: [Optional] Anything not covered in previous fields that is critical to know for future executions of the task. Leave empty if no notes. Do not repeat any information, or mention trivial details. Only what is essential.
Tags: 3-10 categorization tags that describe the workflow type, domain, and key capabilities. Use lowercase with hyphens. Tags should be broad, reusable categories to help with semantic matching to similar tasks.

Loading the Right Workflows

How does the agent find the right workflow memory for the current task? We help the agents by three methods of filtering:

the developer can pass a specific session that they find most relevant to the current task.
the workflow memories are saved with the role_name of the agent as the filename (e.g. researcher_agent_workflow.md), and the same agent can find workflows previously saved by itself, which is most likely the one that’s needed.
The agent is provided by the full list of workflow information: the title, concise description, and tags of all workflows. Then it can choose a maximum of N workflows which are most relevant. This selection procedure is then wiped out of memory to save context.

As you might have noticed, we have refrained from RAG to retrieve the workflows. This was a conscious decision to avoid the unnecessary complexity and uncertainty that RAG brings, which is absolutely not needed for this use case. If we’re at a point, in which we have so many workflow[.md] files that we need RAG, a critical principle of workflow memory is defeated: to have a handful of dynamic external memory files for each agent.

Workflows in Research

Workflow memory is a new feature, and naturally, there’s much to learn and improve about it. Its effectiveness has been experimented in research and benchmarked in a paper in which the authors report a significant gain in web-navigation tasks, which you can read and learn more about this technique.

There are multiple areas of improvement when it comes to workflows, which again, have been turned into bite-sized issues for interested developers:
[Enhance] Workflow Memory Summarization Prompt · Issue #3375 · camel-ai/camel

Tool Output Caching (A Cautionary Tale)

Research papers only cherry-pick what works best. But not us. Tool output caching was another effort by CAMEL developers to keep the agentic context clean, but was later reverted. The reason is the concern for information loss and performance degradation. While this is not a “failed” attempt and simply needs more refinement and testing, it pays off to learn about it as a cautionary tale of how over-engineering the context for the sake of “efficiency” may hinder the agent’s intellect.
This represents a foundational challenge of memory management: token efficiency vs accuracy.

Tool Outputs are Boring!

Well not exactly, but they are a challenge to handle. Tools are an essential part of what makes an agent, an agent. However, while tool outputs are absolutely necessary, they’re not usually useful after they serve their purpose.

# Agent searches the web
from camel.toolkits import SearchToolkit
tool_result_1 = SearchToolkit.search_google("AI agent frameworks")
# Returns: 4,250 characters of search results with snippets, URLs, metadata

# Agent reads a large file
from camel.toolkits import FileToolkit
tool_result_2 = FileToolkit.read_file("documentation.md")
# Returns: 8,100 characters of markdown documentation

# Agent makes 10 more tool calls
# Each subsequent LLM call includes all previous tool results
# → 60,000+ tokens of tool output in context
# → Context window polluted by stale tool data

Tool results are often needed once, but stay in the context forever. This is especially a time-bomb in long-horizon real world tasks, in which the agent will make numerous tool calls.

Saving the Tool Outputs Outside Context

One way to handle this is to store tool outputs outside the LLM's context (like saving them to a markdown file on disk) and just keep an ID reference in context. That way, if you need the full output later, you can retrieve it by the ID.

CAMEL’s implementation of this strategy is to simply:

Monitor tool result sizes and identify if length >2000 chars
Keep latest tool output full
Cache older verbose outputs
Replace full output with reference
Include preview (first 160 chars)
Provide retrieval instructions so the agent can load the full outputs if necessary

In theory this would drastically save the token consumption. You must know, if you haven’t been exposed to agents any more than courses and tutorial codes, that tools are more complicated than a get_weather API. Web navigation tools like Playwright or browser automation agents can return extremely large outputs, maybe even the entire DOM of a webpage, which can easily surpass 10,000 of tokens.

A Pinch of Salt.

Like any context engineering technique, you must be careful not to pay for extra efficiency with a drop of accuracy. Here’s some ways the tool output caching might have negative effect.

Information loss: Agent processes verbose tool output → system caches it and replaces with a preview + reference → agent later sees the preview and doesn’t think it needs the full output → makes decision based on incomplete data.
Cognitive Load on Agent: Agent must recognize when/if full output is needed, call the retrieve function at the right time, track which cache IDs relate to which output, and also decide whether the preview is sufficient or not. This is an extra cognitive load that is not directly related to solving the user-provided task.

We believe the complete, fool-proof implementation of the tool output caching can be quite valuable. For the curious reader, we have created an issue to bring this feature back to life, and make sure it serves the purpose of context hygiene.

[Enhance] Revive Tool Output Caching · Issue #3376 · camel-ai/camel

The Future Road Ahead

While these methods are backed by research and common industry best-practices, we make it our mission to ensure they have substantial gains in the CAMEL repository. We commit to this by:

Implementing new techniques to optimize the Agent’s memory.
Fix and improve the existing methods.
Benchmark the existing methods and raise the bar.

What’s particularly fascinating about context engineering and memory management, is the abundance of creative and novel techniques that is made possible by how new this field is.

The fact that an agent can become so much smarter and more efficient, not by changing the LLM or spending more money and compute, but by simply changing how the agent views the conversation and its memory is managed, is pretty thrilling.

What was covered in this blog post was only parts of the effort CAMEL has put into better memory of the agents. Our wish is that through this blog, you are more inspired to work on this section of AI, and maybe even use this opportunity to start your open-source arc by opening PRs for each issue, review and contribute to the ones opened by others, or create new issues if you find ways to fix/improve these techniques.

How CAMEL Rebuilt Browser Automation: From Python to TypeScript for Reliable AI Agents

CAMEL AI — Thu, 13 Nov 2025 13:47:04 +0000

During the development of the CAMEL AI framework, we have been using browser automation tools to enable AI Agents to complete complex web tasks. The initial implementation was pure Python, based on Playwright Python bindings. However, as use cases increased, we gradually discovered some unavoidable pain points.

The first issue was unstable snapshot quality. AI Agents need to understand web content to make correct decisions, and at that time we could only traverse the DOM tree and extract element information using JavaScript scripts ourselves. This process was not only error-prone but also frequently missed important interactive elements.

The second pain point was the contradiction between cost and speed. When page content was complex, we needed to provide AI with visualized screenshots to help understand the layout, but the cost of image tokens was several times that of text, and processing speed was much slower. While pure text snapshots were cheap and accurate, they were prone to errors when encountering complex visual layouts. We needed a mechanism that could intelligently switch between the two.

The third issue was the reliability of form filling. Real-world web forms come in all shapes and sizes: some input boxes are hidden in multiple layers of nesting, some dropdown menus are dynamically loaded, and some date pickers only show the actual input box after clicking. Relying purely on Playwright’s basic APIs made it difficult to handle these edge cases.

So we decided to rethink this problem from an architectural level.

We decided to refactor to TypeScript, but why?

CAMEL is a pure Python framework, and introducing Node.js would increase dependency complexity. However, after in-depth research, we found this was almost an inevitable choice.

Playwright is essentially developed in TypeScript, and the Node.js version is a “first-class citizen”. Many advanced features are implemented first in Node.js and then ported to Python bindings. For example, the _snapshotForAI() API we mentioned earlier can generate AI-optimized DOM snapshots, automatically handling ARIA attributes, element hierarchies, interactivity judgments, and other complex logic. If using pure Python, we would need to write over a thousand lines of JavaScript code ourselves to implement these features, and it would be difficult to guarantee quality.

More importantly, browsers themselves run in a JavaScript environment. When we need to perform advanced operations within a page—such as detecting whether elements are occluded or dynamically injecting visual markers—executing directly in the browser’s JavaScript context is much more efficient than controlling indirectly from the Python side through the CDP protocol.

So the final architecture is as follows:

TypeScript layer handles browser interaction: Manages all logic related to Playwright and DOM operations, directly calling native APIs for optimal performance.

Python layer handles AI orchestration: Manages LLM calls, Agent decision-making, and task flow control, which is Python ecosystem’s strength.

The two communicate asynchronously through WebSocket, without blocking each other. The Python side initiates a browser operation request, the TypeScript side executes and returns the result, and the entire process is transparent to the user.

Advantage	Legacy Python Approach	TypeScript Framework	Benefits
Browser API Integration	Python → JS bridge with overhead	Direct native Playwright API calls	• Lower latency• Better performance• Access to latest features
Asynchronous Operations	Limited async support	Native async/await throughout	• Non-blocking operations• Better concurrency• Efficient resource usage
Element Interaction	Custom JavaScript injection	Native Playwright methods	• More reliable• Better error handling• Cleaner code
Real-time Events	Polling-based updates	WebSocket event streaming	• Instant updates• Lower resource usage• Better responsiveness
Type Safety	Runtime type checking only	Compile-time type checking	• Catch errors early• Better IDE support• Safer refactoring
Performance	Multiple language contexts	Single runtime environment	• Low-latency calls• Lower CPU usage
Browser Features	Limited to Python bindings	Full Playwright API access	• Playwright _SnapShotForAI• Advanced debugging
Error Handling	Cross-language error propagation	Native error boundaries	• Clearer stack traces• Better error recovery• Easier debugging

Multi-Modal Output: Finding Balance Between Cost and Accuracy

After having stable snapshot generation capabilities, we began thinking about the next question: when should we use text, and when should we use images?

In actual use, we found that pure text snapshots are sufficient in most cases. For example, filling out a login form, the AI only needs to know “here is a username input box ref=e123, a password input box ref=e124, and a submit button ref=e125” to complete the task accurately. Text tokens are cheap, processing is fast, and information is very precise.

But there are also scenarios where pure text loses critical information. For example, a complex dashboard with dozens of buttons and charts might have a text snapshot of thousands of lines, making it difficult for AI to understand which button is in which area. Or a visual design task like “move the blue button to the upper right corner” simply cannot be completed without a screenshot.

So after decoupling all operational actions into primitive tools, we also provide browser_get_page_snapshot and browser_get_som_screenshot as optional actions to the agent, allowing the agent to freely switch between these two modes.

Snapshot Optimization: Making AI See More Precisely

Even with _snapshotForAI(), we found there was still room for optimization in the generated snapshots.

Problem 1: Interference from Decorative Elements

Playwright snapshots include all ARIA-accessible elements, including many purely decorative icons, dividers, and decorative text. For example, a navigation bar might have 50 elements, but only 5 links are actually clickable, the rest are decorative. This noise distracts the AI’s attention and increases reasoning costs.

Our approach was to add intelligent filtering logic on the Node.js side. By parsing the DOM hierarchy relationships through snapshot-parser.ts, we identify the true “parent elements”. For example, if a button contains nested icons and text, we only keep the outermost button element and filter out the decorative child elements inside. This filtering is implemented through filterClickableByHierarchy(), with rules including: if a link tag contains an img, remove the img and keep only the link; if a button contains generic elements, remove the generic and keep the button.

Problem 2: Meaningless Off-Screen Elements

Web pages are usually very long, and users can currently only see a small portion (viewport). But snapshots by default include all elements of the entire page, including parts that require scrolling to see. This is noise for AI—it shouldn’t try to click a button that hasn’t been scrolled to yet.

We added a viewportLimit parameter. When enabled, only elements within the current viewport are returned. This filtering is done on the browser side, calling the isInViewport() function to check whether the element’s getBoundingClientRect() is within the visible area. This way, the snapshot the AI sees is more focused, and decision quality is higher.

Problem 3: Misleading from Element Occlusion

This is a more subtle problem. Some elements exist in the DOM tree but are blocked by other elements, so users can’t see or click them. If we include these elements in the snapshot, the AI might try to click and then fail.

In SoM screenshot mode, we implemented occlusion detection. Through the checkElementVisibilityByCoords() function, for each element’s center point we call document.elementsFromPoint(x, y), a native browser API that returns all elements at that coordinate point, sorted by z-index from high to low. If our target element is not at the top layer, it means it’s occluded. We draw dashed boxes (instead of solid lines) for such elements, or filter them out directly.

These optimizations all benefit from the TypeScript architecture. document.elementsFromPoint() is implemented in native browser C++, with extremely high performance. If on the Python side, we could only get coordinate data and then use inefficient algorithms to judge ourselves, which would be both slow and inaccurate.

SoM Screenshot: Why Inject in Browser Rather Than Post-Process?

SoM screenshots present an interesting design challenge. We need to mark all interactive elements in screenshots, drawing a box and number for each element.

The most intuitive approach is: take a screenshot first, then use Python’s PIL library to draw boxes on the image. The pure Python version did exactly this. But we found this solution has several problems:

Poor visual quality. PIL is CPU software rendering, and the drawn lines have jagged edges.

Cross-platform inconsistency. PIL depends on system font libraries, and the font paths are different on Windows, macOS, and Linux, often resulting in “cannot find arial.ttf” errors, ultimately having to use ugly default fonts.

Most importantly, when connecting to an existing browser in CDP mode, for screens with different resolutions, the zoom ratio needs to be adjusted to accurately map element coordinates to the correct positions on the image, which is very cumbersome.

Large data transmission. The Python version needs to first execute a JavaScript script to collect all element information, including coordinates, attributes, metadata, etc., then serialize this data into JSON and send it back to Python. For a page with huge amount of elements, this JSON might contain many redundant fields (such as disabled, checked, expanded, etc., which SoM doesn’t need at all).

The TypeScript version’s approach is: directly inject DOM elements as markers in the browser, then take a screenshot.

The specific process is as follows: execute JavaScript in the browser context through page.evaluate(), create an overlay div covering the entire page, with z-index set to the highest value. Then for each element that needs to be marked, create a label div, set borders, background colors, and text, and position it correctly using CSS. Call requestAnimationFrame() to wait for browser rendering to complete, then take a screenshot. After the screenshot is complete, call page.evaluate() again to delete this overlay.

The benefits of this approach are:

Visual perfection. Markers are drawn by the browser rendering engine, with GPU acceleration, anti-aliasing, and perfect font rendering. We can use CSS to implement rounded corners (border-radius), shadows (box-shadow), and transparency (opacity), resulting in very professional visual effects.

Cross-platform consistency. The browser guarantees rendering consistency, with the same effect regardless of which operating system it runs on.

Small data transmission. We only need to pass in element refs and coordinates, the browser internally completes marker drawing, then returns a simple result object.

Intelligent label positioning. We implemented the findLabelPosition() function, which tries multiple positions (above, below, left, right, diagonal of the element), checks for overlap with existing labels, and automatically selects the optimal position. This is very simple to implement on the browser side because we can directly manipulate DOM element positions. If using PIL on the Python side, we would need to maintain a complex coordinate recording system, which is very inefficient.

Another detail: during visibility detection, we mark completely occluded elements as hidden status and skip drawing labels directly; partially visible elements are marked as partial, with dashed boxes and semi-transparent labels. This allows AI to distinguish which elements are truly interactive.

Our Previous SOM screenshot:

Our Current SOM screenshot:

Other popular open source SOM screenshot:

Tool Registration Mechanism: Avoiding Context Explosion

Early versions had a hidden performance issue: SoM screenshots were saved in the agent’s context. If the Agent needed to take multiple screenshots (such as browsing multiple pages), context usage would grow rapidly.

Our solution is to require the agent to write instructions as prompts when calling the browser_get_som_screenshot tool, allowing the agent to leave corresponding descriptions in the context, while the image itself does not occupy the agent’s context.

This optimization requires coordination with CAMEL’s Agent registration mechanism. Through the RegisteredAgentToolkit base class, we allow the Toolkit to access the Agent instance that registered it. Specifically, when browser_get_som_screenshot is called, if a registered Agent is detected, the Agent’s astep() method is automatically called, passing a Message object containing the image path. The Agent internally uses PIL to load the image and passes it to the multimodal LLM for analysis.

This design achieves clear separation of responsibilities: the Toolkit is responsible for generating screenshots, and the Agent is responsible for analysis and understanding. The connection between them is only through the lightweight interface of file paths.

Form Filling Optimization: Dealing with Real-World Complexity

Real-world web forms are much more complex than imagined. We encountered various strange scenarios during testing and gradually optimized to the current solution.

Batch Processing of Multiple Input Boxes

The simplest optimization: allow filling multiple input boxes in one command. When calling browser_type(), pass a dictionary {ref1: text1, ref2: text2} to complete it all at once. This avoids multiple round trips containing snapshot changes and reduces the risk of page state changes.

Intelligent Dropdown Menu Detection

Some input boxes are actually disguised dropdown menus. For example, a search box will pop up a suggestion list when you type. Our browser_type() implementation has a shouldCheckDiff logic: if the input box type might trigger a dropdown menu (role includes combobox, searchbox, etc.), we capture a snapshot before and after input, then calculate the diff to extract newly appeared option elements. This diff result is returned to the Agent, so the Agent knows “there are now these options to choose from”.

Snapshot Difference for Dynamic Content

This function is implemented in getSnapshotDiff(). The principle is to compare two snapshot texts and find newly added elements of specific types (such as option, menuitem). We use regular expressions to match [ref=...], record all refs in the first snapshot, then find newly appeared refs in the second snapshot. This way the AI can see “after clicking this input box, 5 options appeared”.

Special Handling for Read-Only Elements

Date pickers are usually a read-only input box that pops up a calendar component after clicking. Our performType() function first checks the element’s readonly attribute and type (date, datetime-local, time, etc.). If it’s this type of element, click it first, wait 500ms, then look for the actual input box in newly appeared elements (matched by placeholder).

Nested Input Box Search

Some buttons, after clicking, don’t directly replace the original position but appear in nested child elements. Our strategy is: if fill() fails on a certain ref, search for input, textarea, [contenteditable], etc. in its child elements and try to fill. This search uses Playwright’s locator() API, which supports complex CSS selectors.

Error Recovery Mechanism

All this complex logic is wrapped in try-catch blocks. If a step fails, we record detailed error information (including the element’s tagName, id, className, placeholder, and other debugging information), but don’t throw an exception directly; instead, we try the next strategy. Only when all strategies fail do we return a descriptive error message to the Agent.

These optimizations were accumulated gradually in actual use. Each time we encountered a form that the Agent couldn’t handle, we analyzed which link had the problem and then made targeted improvements.

Future Plans

Improved Page Snapshots

A high-quality snapshot is fundamental to the efficiency of browser-use agents. An ideal snapshot should maintain a high signal-to-noise ratio and low redundancy, preserving meaningful information while filtering out noise.

Current implementations still include numerous empty or generic elements, and the filtering of off-viewport elements remains imprecise. In some cases, invisible or irrelevant nodes are still captured, leading to unnecessary data and potential confusion for the model.

Future work will focus on optimizing snapshot extraction to ensure that only semantically relevant and visually significant elements are retained. This includes accurate viewport filtering, hierarchical aggregation of non-interactive elements, and enhancing semantic representation for key UI components. The goal is to create a snapshot format that is compact yet information-dense, improving model understanding and token efficiency.
Support for Purely Visual Model Evaluation Environment

Recently, several model providers have introduced purely visual, coordinate-based “computer-use” models, such as Gemini 2.5 Computer Use, which interact with GUIs directly through pixel-level clicks, similar to human users. Under this paradigm, performing computer-use tasks no longer requires complex engineering layers to expose element-level interaction APIs, which significantly simplifies the implementation of GUI interaction for model developers.

To support this new paradigm, we can provide a simulation and evaluation environment where model-predicted click coordinates can be validated against ground-truth element positions derived from snapshots. This framework would enable accuracy assessment and reinforcement learning for purely visual GUI agents. Such an environment could serve as foundational infrastructure for training and benchmarking vision-based computer-use models, reducing engineering overhead for model developers.
Reliable Long-Term Memory Paradigm

Browser-use is a specific form of GUI interaction where an agent operates within a web environment through sequential actions and feedback. As the action sequence grows, the accumulation of intermediate feedback poses challenges: exceeding context length limits, reducing efficiency of historical retrieval, and increasing the risk of information loss.

This issue is even more pronounced for purely visual models, as visual context lacks structured semantics and cannot be easily indexed or summarized. Building a robust long-term memory paradigm is therefore essential.

Future efforts will explore mechanisms for efficient retrieval and compression of historical states and feedback, combining semantic summarization, hierarchical caching, and adaptive context reconstruction. The ultimate goal is to enable agents to maintain awareness and continuity over extended sequences, preserving both behavioral traceability and task coherence.

CAMEL's GitHub: https://github.com/camel-ai/camel
Official Website: https://www.camel-ai.org/

Thanks to the Community

All of the above refactors are task-motivated. Through extensive testing and collaboration with community members, we gradually developed and implemented these ideas together. Since browser use is a highly engineering-oriented part of Agent era, it can only be continuously improved through rich and diverse testing. We would like to express our gratitude to everyone in the community for their contributions.