Forem: CAMEL-AI

What Building a Hybrid Browser Toolkit Taught Us About the Web

Nomadev — Wed, 01 Oct 2025 19:10:57 +0000

If you’ve ever tried browser automation, you know the drill:
You spin up Selenium, Playwright, or Puppeteer, point it at a page, and suddenly you’re wrestling with flaky selectors, weird screenshots, or the dreaded “element not found” even though it’s right there.

I’ve been there. It feels like teaching a robot to surf the web by giving it a pair of oven mitts. Sure, it clicks and scrolls, but half the time it’s guessing.

At CAMEL-AI, we ran into this wall too. Our original Camel BrowserToolkit was a first attempt at solving it. It did the basics — take screenshots, inject custom IDs, and click things. But it was… let’s say, not elegant. It worked more like asking an AI to click on pictures instead of actually understanding the page.

That got us thinking:
What if the toolkit could “see” the page like a human and understand the structure like a dev?

From Monolith to Hybrid

The big shift came when we re-architected things. Instead of one heavy Python process, we now have a Hybrid setup using Python and TypeScript.

Python is still your scripting layer. That means you can write automation in a language most of us are comfortable with.
TypeScript is the engine under the hood. It runs Playwright natively, handles async operations, and talks directly to the browser.

The two communicate over WebSockets. So Python gives high-level commands, while TypeScript executes them efficiently.

Introducing the CAMEL Hybrid Browser Toolkit

Enter the Hybrid Browser Toolkit. We've rebuilt the toolkit from the ground up as a TypeScript–Python hybrid. In this new design, TypeScript (running on Node.js) handles the browser directly via Playwright's fast native APIs, and Python remains your friendly front-end interface.

What does that buy you? Faster performance, access to all the latest Playwright features (like the new _snapshotForAI), and true async event-driven power – without sacrificing the ease of Python scripting.

The result is a layered architecture: your Python code talks to a TypeScript server over WebSockets. The TypeScript layer manages browser instances, DOM queries, screenshots, etc., all in the same high-performance JavaScript environment. Python just sends commands and gets structured results.

This split means lower latency and better concurrency. As one example, Node's Playwright doesn't spawn a fresh process for every browser window like the Python version did, so it can manage many tabs with far less CPU and memory overhead.

In short, Python becomes the brain giving high-level instructions, and TypeScript is the muscle doing the work efficiently.

What's Different Under the Hood

In the legacy toolkit, every action that needed to find or click an element typically involved injecting a random ID into the page via a script, then querying it. That worked, but it felt hacky.

In the hybrid toolkit, we leverage standard accessibility (ARIA) selectors and Playwright's new tools. Now you can do things like:

await page.locator('[aria-label="Submit"]').click();
await page.getByRole('button', { name: 'Submit' }).click();
const snapshot = await page._snapshotForAI();
// snapshot now has structured data on all elements and their ARIA roles

Playwright's _snapshotForAI() (an internal API) lets us get a rich DOM snapshot: every interactive element, its role (like button, link, textbox), labels, etc. We assign each element a ref ID and use those for all interactions. This replaces the old random-ID trick with a semantic mapping.

It also means the same snapshot data fuels both text mode and the visual "set-of-marks" screenshots.

Set-of-Marks Screenshots

Speaking of screenshots, the new toolkit's SoM (Set-of-Marks) screenshots are crisp and clever. We inject a small script into the page that outlines every clickable element with a little numbered marker (their ref ID).

This isn't just a dumb screenshot – it knows about element overlap and tries not to mark hidden elements. If a button has an icon and text, it merges them into one mark. It even picks good positions for labels so they don't scribble over each other. (This injection-based approach in the browser is more reliable than our old memory-only screenshots.)

Enhanced Stealth Mode

We've also beefed up stealth mode. By default, Playwright can be detected by many sites (indeed, "stock" Playwright is often blocked by modern anti-bot measures.

The new toolkit launches browsers with a full suite of anti-detection flags, customizable user agents, headers, etc. You can tweak a StealthConfig object to set exactly which flags or headers to use. And we maintain this even across persistent contexts or CDP connections.

The bottom line: you get a much more human-like browser fingerprint without extra work.

Memory-Efficient Screenshots

Other small but nice improvements include how we handle screenshots and images. In the old toolkit, screenshots were held entirely in memory and passed around as objects. Now we save screenshots to disk and only pass around file paths.

This keeps memory usage low, especially when you take many screenshots in a run. The agent can still request the image (and even run vision-based analysis on it), but the heavy data lives on disk.

Smarter Form Filling

We also made form-filling smarter. You can now send multiple inputs in one command, and the toolkit will try to find the right input fields (even if you accidentally point at a container).

It watches for dropdowns appearing after you type and will return just the new options (a "diff" snapshot), so you don't get overwhelmed by the whole page again. If something goes wrong, the tool tries simple recovery steps too.

Key Features at a Glance

Multi-Mode Operation: The toolkit has three modes:

Text Mode: DOM-based automation, returning textual snapshots of element lists.
Visual Mode: Screenshot-based, with interactive elements highlighted.
Hybrid Mode: Smart switching between text and visual as needed.

TypeScript Core: All browser work is done in a Node.js/TypeScript server. That means native Playwright calls (no bridging) and full async/await support. We get TypeScript's compile-time checks and the latest APIs instantly.

Better Element Handling: Use real ARIA selectors and Playwright locators instead of injected IDs. E.g. click by aria-label or role. Plus, _snapshotForAI returns structured data with semantic roles.

Instant Snapshots: Every action (click/type/etc.) that changes the page returns an updated snapshot by default, so you see the new state immediately in text mode.

Advanced Screenshot (SoM): Annotated screenshots with numbered marks for each element. Optionally, an AI can analyze the image (like "find all sign-up buttons").

Intelligent Typing: Typing into fields automatically detects dropdowns (autocomplete) and only returns the new suggestions (diff snapshot). If you point to a container, it will find the actual input inside and type there.

Powerful Stealth: Multiple Chrome flags, custom user agent/headers, persistent context, etc., to reduce bot detection. (After all, many sites try to fingerprint automation.
Flexible Connections: You can launch a fresh browser via Playwright, attach to an existing Chrome/Edge via CDP (Chrome DevTools Protocol), or even hook into an AI agent via the Model Context Protocol (MCP).

Tool Registry: The toolkit neatly separates "tools" (actions) from the core. Screenshots go to files, not memory, so you can handle them in custom agents or pipelines without huge overhead.

Try It: Session & Navigation Tools

Let's see some examples. First, create a toolkit instance and open the browser:

from camel.toolkits import HybridBrowserToolkit

# Launch a real browser (non-headless for debugging)
toolkit = HybridBrowserToolkit(headless=False)
result = await toolkit.browser_open()

print(result['result'])    # "Browser opened."
print(f"Tabs: {result['total_tabs']}, Active: {result['current_tab']}")
print("Initial Snapshot:", result['snapshot'])

Your first call must be browser_open(). That spins up Chromium/Chrome/Edge and returns a snapshot of whatever the default page is (typically about:blank or your start URL). You'll get something like:

Result: Browser opened.
Tabs: 1, Active tab index: 0
Initial Snapshot:
- link "Get Started" [ref=1]
- link "Documentation" [ref=2]
- link "GitHub" [ref=3]
- ...

Now navigation:

# Open a new tab and navigate to example.com
result = await toolkit.browser_visit_page("https://example.com")
print(f"Visiting example.com: {result['result']}")
print("Snapshot:", result['snapshot'])
print(f"Tabs now: {result['total_tabs']}, Active: {result['current_tab']}")

# Go back and forward
await toolkit.browser_back()      # go back in history
await toolkit.browser_forward()   # then forward again

browser_visit_page(url) opens the URL in a new tab and switches to it. Each call makes a new tab.

browser_back() and browser_forward() move in the history of the current tab. They both return the updated page snapshot and tab info.

For example, after visiting a couple of pages:

await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_visit_page("https://example.com/about")
result = await toolkit.browser_back()
print(f"Back: {result['result']}, now at {result['snapshot']}")

Page Inspection Tools

To see what's on the page without doing anything, use:

snapshot = await toolkit.browser_get_page_snapshot()
print(snapshot)

This returns a textual list of all interactive elements in the current tab (links, buttons, inputs, etc.), each with a [ref=id]. By default it lists the full page, but you can initialize with viewport_limit=True to only see elements visible on screen. E.g.:

- link "Home" [ref=1]
- button "Sign In" [ref=2]
- textbox "Search..." [ref=3]
- link "Products" [ref=4]
- ...

For a visual view, try:

result = await toolkit.browser_get_som_screenshot()
print(result['result'])
# e.g. "Screenshot captured with 12 interactive elements (saved to: ./screenshots/page123_som.png)"

This takes a screenshot of the page and marks every element. You can also ask the toolkit to analyze it with an AI, e.g.:

result = await toolkit.browser_get_som_screenshot(
    read_image=True, 
    instruction="Find all buttons for submitting forms"
)
print(result['result'])
# e.g. "Screenshot captured... Agent analysis: Found 3 form buttons: [ref=5], [ref=9], [ref=12]"

Behind the scenes, it saved an image file and ran an agent (if requested) to look at it. The raw image path is in result['screenshotPath'] if you need it.

To inspect tabs, use:

tab_info = await toolkit.browser_get_tab_info()
print(f"Total tabs: {tab_info['total_tabs']}")
for tab in tab_info['tabs']:
    status = " (current)" if tab['is_current'] else ""
    print(f"- {tab['title']} @ {tab['url']}{status}")

You'll see each tab's ID, title, and URL. This is handy to pick a tab to switch to:

# Switch to tab by ID (the 'id' field from tab_info)
await toolkit.browser_switch_tab(tab_id=some_tab_id)

Interaction Tools

Now for real interactions:

Click an Element

Click an element by its ref:

result = await toolkit.browser_click(ref="5")
print(result['result'])   # e.g. "Clicked on button 'Submit'"

If the click opened a new tab, result will include newTabId, and current_tab/total_tabs will update accordingly. You can then browser_switch_tab to it.

Type into Input Fields

Type into an input:

# Single input
await toolkit.browser_type(ref="3", text="hello world")

If the element with ref=3 triggers an autocomplete dropdown, the toolkit will detect it. Instead of returning the full page again, it gives you result['diffSnapshot'] containing just the new options (this is the "intelligent dropdown detection"). For example, typing "San" might return:

- option "San Francisco" [ref=23]
- option "San Diego" [ref=24]
- option "San Antonio" [ref=25]

Then you can click one of those by ref. If you have multiple fields to fill, just pass a list:

inputs = [
    {'ref': '3', 'text': 'John'},
    {'ref': '4', 'text': 'Doe'},
    {'ref': '5', 'text': 'john.doe@example.com'}
]
result = await toolkit.browser_type(inputs=inputs)
print(result['details'])  # shows success/failure per field

Select Dropdowns

Select (for <select> dropdowns):

await toolkit.browser_select(ref="country-select", value="US")

You must provide the option's value attribute, not visible text. (If needed, you can browser_get_page_snapshot() first to see element refs.)

Enter Key

Enter key (submit form etc.):

await toolkit.browser_enter()

This simulates pressing Enter in the currently focused field. It's handy after typing search terms.

Scroll

Scroll the page:

await toolkit.browser_scroll(direction="down", amount=600)

Use "up" or "down", with optional pixel amount. It returns the new snapshot. You can loop scrolls to load more content:

prev = ""
while True:
    res = await toolkit.browser_scroll("down", 800)
    if res['snapshot'] == prev:
        break  # no new content
    prev = res['snapshot']
    await asyncio.sleep(1)

Mouse Control

Mouse control by coordinates:

await toolkit.browser_mouse_control(control="click", x=350.5, y=200)
await toolkit.browser_mouse_control(control="dblclick", x=123.4, y=456.7)
await toolkit.browser_mouse_control(control="right_click", x=400, y=300)

Useful for canvas or image-map interactions.

Drag and Drop

Mouse drag-and-drop:

await toolkit.browser_mouse_drag(from_ref="item-5", to_ref="trash-bin")

Drag the element with ref="item-5" onto ref="trash-bin". Handy for reordering or file moves in web UIs.

Press Keys

Press keys/combinations:

await toolkit.browser_press_key(keys=["Tab"])
await toolkit.browser_press_key(keys=["Control+a"])  # select all
await toolkit.browser_press_key(keys=["Alt+Left"])   # back in history
await toolkit.browser_press_key(keys=["F5"])         # refresh

Send any key or combo. The toolkit uses Playwright's key syntax.

Tab Management

Working with multiple tabs is easy:

Switch Tab

Switch tab by ID (from browser_get_tab_info):

await toolkit.browser_switch_tab(tab_id=some_tab_id)

This activates that tab and returns its snapshot.

Close Tab

Close a tab:

await toolkit.browser_close_tab(tab_id=some_tab_id)

After closing, it returns info on the remaining tabs.

You can, for instance, close all but the first tab by iterating through them:

tab_info = await toolkit.browser_get_tab_info()
for tab in tab_info['tabs']:
    if not tab['is_current']:
        await toolkit.browser_close_tab(tab_id=tab['id'])

Console Commands

Console commands: You can execute arbitrary JS on the page:

result = await toolkit.browser_console_exec("return window.location.href")
print("Current URL:", result['result'])

And view console logs:

logs = await toolkit.browser_console_view()
for msg in logs['console_messages']:
    print(f"[{msg['type']}] {msg['text']}")

Advanced & Utility

Wait for Manual Step

Wait for manual step: Sometimes you need a human (e.g. to solve a CAPTCHA). Use:

res = await toolkit.browser_wait_user(timeout_sec=60)
if "completed" in res['result']:
    print("User resumed, snapshot after:")
    print(res['snapshot'])
else:
    print("Wait timed out.")

This pauses execution and shows the last snapshot. When the user presses Enter (or timeout), it returns control.

Combine It All

Combine it all: Here's a mini example putting a few tools together:

toolkit = HybridBrowserToolkit(headless=False)
try:
    await toolkit.browser_open()
    await toolkit.browser_visit_page("https://example.com")
    # Look for a product link and click it
    snap = await toolkit.browser_get_page_snapshot()
    # Suppose ref=7 is "Products"
    await toolkit.browser_click(ref="7")
    # Now add to cart and checkout
    await toolkit.browser_click(ref="add-to-cart")
    await toolkit.browser_click(ref="checkout")
    # Fill checkout form
    inputs = [
        {'ref': 'name', 'text': 'Alice'},
        {'ref': 'email', 'text': 'alice@example.com'},
        {'ref': 'address', 'text': '1 Developer Way'}
    ]
    await toolkit.browser_type(inputs=inputs)
    await toolkit.browser_select(ref="shipping", value="standard")
    await toolkit.browser_console_exec("return document.querySelector('form').checkValidity()")
    await toolkit.browser_click(ref="place-order")
finally:
    await toolkit.browser_close()

This was just a taste. The Hybrid Browser Toolkit provides all the basic navigation and interaction tools you'd expect, plus some powerful extras (like smart screenshots and AI-assisted analysis) to help you automate complex tasks smoothly.

Operating Modes: Text vs. Visual vs. Hybrid

Text Mode is the default: every action returns a text snapshot. It's lightweight and great for pure data tasks (like scraping or filling forms). Each element is listed with a [ref=ID] and a label. If you initialize with full_visual_mode=True, then actions don't auto-return snapshots (fast mode); you can still call browser_get_page_snapshot() manually when you need it.

Visual Mode uses screenshots. The browser_get_som_screenshot() tool we saw is the core of this mode. It's ideal for verifying layouts, catching visual glitches, or when a human needs to see something. You'll often toggle visual mode on when you need to confirm that a button is visible, or to show the agent exactly what's on screen.

Hybrid Mode is smart: it uses text mode by default, but seamlessly takes and interprets screenshots when needed (or as requested). For example, you might click through forms in text mode, then do one final screenshot with AI analysis to "spot check" the result.

A good rule of thumb:

Use Text Mode for most automation (fast, headless, easy parsing).
Switch to Visual Mode when you need the UI context (e.g. for CAPTCHAs, complex UIs, or human verification).
Combine Both as needed. E.g., click by refs in text mode, then verify with a screenshot.

Connection Modes: Playwright vs CDP vs MCP

Finally, how do we connect to the browser?

Standard Playwright (default)

The toolkit launches and manages its own browser instance. Just HybridBrowserToolkit() and call browser_open(). You can set headless=True/False, user_data_dir for persistence, timeouts, etc. Use this when you just want an isolated browser.

Chrome DevTools Protocol (CDP)

This lets you attach to an already running browser (Chrome/Edge/Chromium) that was started with --remote-debugging-port. For example, start Chrome manually:

google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-profile

Then in Python:

import requests
resp = requests.get('http://localhost:9222/json/version')
ws = resp.json()['webSocketDebuggerUrl']

toolkit_cdp = HybridBrowserToolkit(cdp_url=ws)
# No need to call browser_open(); it's already running
tab_info = await toolkit_cdp.browser_get_tab_info()
print(f"Connected to {tab_info['total_tabs']} tabs")

CDP is the same protocol Chrome DevTools uses to talk to the browser chromedevtools.github.io, so any browser with debugging enabled can be controlled. You can even set cdp_keep_current_page=True to make the toolkit use the current page instead of opening a new one.

MCP (Model Context Protocol)

This is for connecting the toolkit to an AI assistant (like Claude via LLMs) so the AI can call these browser tools as if they were native functions. Here's how to set it up:

1. Install the MCP Server

git clone https://github.com/camel-ai/browser_agent.git
cd browser_agent
pip install -e .

2. Configure Claude Desktop

Add to your Claude configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "hybrid-browser": {
      "command": "python",
      "args": ["-m", "hybrid_browser_mcp.server"]
    }
  }
}

3. Restart Claude Desktop

After adding the configuration, completely restart Claude Desktop. The browser tools will appear when you click the 🔌 icon in the chat interface.

Available Browser Tools

Once connected, you'll have access to:

Navigation: browser_open, browser_visit_page, browser_back, browser_forward
Interaction: browser_click, browser_type, browser_select, browser_scroll
Screenshots: browser_get_som_screenshot (captures page with clickable elements marked)
Tab Management: browser_switch_tab, browser_close_tab
Advanced: browser_console_exec, browser_mouse_control

Basic Usage Example

# Claude can now control browsers with simple commands:
await browser_open()
await browser_visit_page("https://example.com")
await browser_type(ref="search", text="AI automation")
await browser_click(ref="submit-button")
await browser_get_som_screenshot()
await browser_close()

Customization

Modify browser behavior in browser_agent/config.py:

BROWSER_CONFIG = {
    "headless": False,    # Show browser window
    "stealth": True,      # Avoid bot detection
    "enabled_tools": [...] # Specify which tools to enable
}

Closing Thoughts

In summary, the Hybrid Browser Toolkit is a major upgrade over the old screenshot-only BrowserToolkit. We still give you a friendly Python API to work with, but under the hood we're speaking the browser's native language via TypeScript.

That means faster, more reliable interactions and access to shiny new features like Playwright's accessibility snapshots. Whether you need lightning-fast DOM scraping or human-like visual checks (or both!), this toolkit handles it.

It also plays well with modern workflows. Want to connect to an existing Chrome? No problem (thanks to CDP). Want your AI agent to browse the web? Check out MCP integration.

From practical navigation (click, type, scroll) to advanced tricks (Set-of-Marks screenshots, smart autocomplete typing, multi-tab management), everything's here.

Give it a spin, and let us know what you build with it. Welcome to the new era of browser automation with CAMEL's Hybrid Browser Toolkit – it's like taking off those gloves and driving with all the precision you wanted, at full speed.

Happy automating!

We hired AI to do Growth Engineering and here’s what happened

Nomadev — Wed, 17 Sep 2025 10:16:00 +0000

In open source projects, time is precious. Maintainers juggle bug fixes, feature requests, community support, and documentation, all while trying to keep code secure and releases organized. One repetitive but crucial task is reviewing pull requests and preparing release updates. It's necessary, but it eats up hours that could be spent innovating.

In our work at CAMEL-AI, open-source contributions move fast. Every week, our team spends time reviewing pull requests, highlighting key changes, and preparing release notes. It’s important work, but also repetitive — hours get lost in scanning PRs, checking impact, and formatting updates.

This time, instead of doing it manually, we asked ourselves: what if a multi-agent system could take over this process?

That’s when we decided to try it with Eigent and a custom MCP server for GitHub. The idea was simple: let AI agents handle the weekly workflow, from fetching PRs to summarizing them and even drafting release-ready notes and short posts.

What if automation could handle the grunt work for you? That's where Eigent step in.

Eigent is the world's first Multi-agent Workforce desktop application, empowering you to build, manage, and deploy a custom AI workforce that can turn your most complex workflows into automated tasks. It's a modular, multi-agent system that can break down complex tasks and handle them through specialized agents working in coordination.

Eigent's multi-agent coordination platform boosts productivity by turning your workflows into automated tasks. Built on the open-source CAMEL framework, it brings parallel execution, customization, and privacy to your AI automation.

What can Eigent do for you? For first-time readers, consider Eigent as a flexible agentic assistant. You can create different "workers" (AI agents) with domain-specific skills (e.g. coding, documentation, DevOps) and have them collaborate on tasks. Some examples of technical workflows Eigent can simplify include:

GitHub automation with AI agents: Reviewing code changes, summarizing pull requests, triaging issues.
Release note generation: Automatically compiling highlights of what's new in each release.
Documentation and code analysis: Extracting key points from docs or codebases, suggesting improvements.
Open-source workflows: Keeping track of project activity, generating reports for contributors, etc.

In this guide, we'll show you how to configure a custom GitHub MCP server inside Eigent and set up an agent workflow that:

Fetches new pull requests from a repo
Extracts and analyzes PR data
Formats the highlights into release-ready notes
Generates a short social post (e.g. for Twitter/X)

Let's dive into the step-by-step guide!

Step 1: Open Eigent and Navigate to MCP & Tools Settings

Once you have Eigent running, begin by opening the Settings panel. In the Settings, find and click on the "MCP & Tools" section. This is where you can configure external tools and servers for your AI agents. We'll use this area to add a new custom MCP server for GitHub tasks.

Eigent's Settings interface. Navigate to the **MCP & Tools* tab to configure external AI tools and servers.*

In the MCP & Tools tab, you'll see a list of available tools and any configured MCP servers. By default, Eigent might include some basic tools (e.g. web search, code execution). To add our own, look for an "Add MCP Server" button (usually a + or a labeled button) and click it. This will open a dialog where you can input a JSON configuration for the new server.

Step 2: Add a Custom MCP Server via JSON Configuration

Eigent allows advanced users to add custom agent servers by providing a JSON config. In the Add MCP Server dialog that opened, you'll see a text area to paste JSON. We're going to add a sequential-thinking MCP server - this is a general-purpose AI reasoning engine that can coordinate tasks (perfect for breaking down complex prompts). We will also tie it into GitHub by providing the GitHub integration toolset and our credentials.

Adding a new MCP server via JSON configuration. Paste in the JSON definition for the **sequential-thinking* server.*

The JSON defines how Eigent should launch the external agent server. For our use case, we'll use Node's npx to run the Sequential Thinking server package, and include the official GitHub MCP tool. Below is the JSON structure to use (as provided by Eigent's docs and examples):

"mcpServers": {
  "sequential-thinking": {
    "command": "npx",
    "args": ["-y", "@modelcontextprotocol/server-sequential-thinking"]
  }
}

Step 3: Configure the GitHub MCP Server Settings (Include Your PAT)

Before finalizing the MCP server setup, include your GitHub Personal Access Token (PAT) in the configuration. This token will allow the agent to authenticate with the GitHub API and fetch repository data. You should generate a PAT from your GitHub account (with at least read access to repos; for public repos a classic token with default public scopes is sufficient). In the JSON, we'll add an environment variable for the token and specify the GitHub toolset.

Configuring the GitHub MCP server by adding environment variables. Provide your **GitHub PAT* in the JSON config so the agent can access the GitHub API.*

To integrate the GitHub tools, modify the JSON as follows:

Add the GitHub MCP server container to the arguments.
Set the environment variable for your token.

For example, you can extend the "args" array to include the GitHub server image and use the "env" field for the token:

"mcpServers": {
  "sequential-thinking": {
    "command": "npx",
    "args": [
      "-y", "@modelcontextprotocol/server-sequential-thinking",
      "ghcr.io/github/github-mcp-server"
    ],
    "env": {
      "GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_yourGitHubTokenHere"
    }
  }
}

In this configuration, we pass the official GitHub MCP server (hosted at ghcr.io/github/github-mcp-server as an argument to the sequential thinking agent. The sequential agent will spin up the GitHub toolset internally. We also set GITHUB_PERSONAL_ACCESS_TOKEN in the environment so the agent can authenticate to GitHub. (Make sure to replace *"ghp_yourGitHubTokenHere"** with your actual PAT.)*

Once the JSON is ready, click Install or Add to save the MCP server. Eigent will download and initialize the server in the background. After a moment, you should see the new server listed in your MCP tools, indicating a successful installation.

Step 4: Add a GitHub-Focused Worker (Agent) Using the New MCP Server

Now that the MCP server is configured, we need to create a Worker that uses this server. In Eigent, a "Worker" is essentially an AI agent persona that can carry out tasks using a specified toolset or MCP server. Navigate back to the main Workforce or Agents screen (often the home screen showing your AI workers). Look for an "Add Worker" or "+" button to create a new agent.

When the Add Worker dialog appears, enter a name and description for your new agent. For example, name it "GitHub MCP" and describe it as "Helps around GitHub Tasks". Most importantly, assign the Agent Tool to the MCP server we just added (it might appear in a dropdown as "sequential-thinking" or whatever name you gave it). This ensures your new worker will utilize the GitHub-enabled sequential thinking agent.

Creating a new Worker agent for GitHub tasks. Give it a name (e.g. "GitHub PR Reviewer") and select the **GitHub MCP* server as the agent's tool.*

After filling in the details and selecting the correct MCP server, save the worker. You should now see a new agent in your AI workforce list. This agent is essentially your GitHub automation assistant, equipped with the ability to reason through tasks and interact with GitHub data.

Step 5: Prompt the Agent to Summarize Pull Requests

With the GitHub-enabled agent up and running, it's time to put it to work. Open a chat or command interface with your new worker (in Eigent, clicking the worker might open a chat panel where you can give it instructions). We'll provide a task prompt asking the agent to review pull requests from a repository and summarize them.

As an example, try a detailed prompt like this one:

Review all the 30 latest pull requests from the repo https://github.com/camel-ai/camel. Select the top 5 by impact (lines changed, files touched, or discussion depth). For each selected PR, generate a release-ready update in this format: ✨ Feature: <catchy one-liner summary> 💡 Why it matters: <short bullet-point explanation> 🙏 Thanks @<GitHubAuthor>...

Entering a prompt for the GitHub agent to review recent PRs and produce summaries. This complex instruction asks the AI to fetch the latest 30 PRs, pick the most impactful ones, and format a brief release note for each.

In the chat, paste or type in the prompt (as shown above) and hit send. This instructs the agent to automate a common open-source workflow: analyzing recent pull requests in the camel-ai/camel repo and preparing a synopsis of important changes. You can customize the repository URL or criteria as needed - for instance, use your own project's repo link. The key is that our agent now has the tools (via MCP) to fetch GitHub data and the reasoning ability to summarize it.

Step 6: Watch Eigent Automatically Break Down the Task and Fetch Data

Once you send the prompt, Eigent's multi-agent engine kicks in. The request is fairly complex, but Eigent will handle it by dividing the work into manageable subtasks. Behind the scenes, the Sequential Thinking MCP server interprets the instruction and decides on a plan. It may do something like:

Fetch the list of the latest 30 PRs from the specified repository (using the GitHub MCP tool).
Analyze each PR's metadata (lines changed, files, comments) to determine "impact".
Pick the top 5 PRs based on the criteria.
For each of those PRs, compose a summary in the requested format (✨ Feature, 💡 Why it matters, 🙏 Thanks...).
Possibly also prepare a condensed version for X (Twitter) if requested, or any additional subtasks inferred.

Eigent actually displays the subtask breakdown in the interface, so you can see the agent's thought process. It might list steps it's taking, which makes it transparent and debug-friendly. For example, the agent may explicitly show a step to retrieve PR data and then a step to filter them by impact. This showcases Eigent's dynamic task planning: "Eigent dynamically breaks down tasks and activates multiple agents to work in parallel, automating complex tasks much faster than traditional single-agent workflows" Eigent Docs.

The GitHub agent (powered by the MCP server) fetching repository data. Here the agent executed a subtask to retrieve PR details via the GitHub API, returning JSON data almost instantly.

In our case, the first subtask is to call GitHub and get details of the latest 30 PRs. The agent, using the GitHub MCP, does this in seconds and obtains a JSON array of PR info (IDs, titles, authors, lines changed, etc.). Next, the agent evaluates which PRs have the largest impact. Another subtask might involve sorting or filtering the list by those metrics. Once the top 5 PRs are identified, the agent generates the summary for each.

Finally, the agent produces the output: a neatly formatted set of release-ready updates for the top 5 PRs. The result is typically presented in the chat as Markdown text (since we asked for a release update format). Each update might look like:

✨ Feature: Added comprehensive model table and requirements badges to docs

💡 Why it matters:

Provides quick, up-to-date model info right in the documentation
Helps users assess at a glance what's available and what's required
Elevates project transparency and onboarding experience

🙏 Thanks @wendongfan for this integration

PR link: https://github.com/camel-ai/camel/pull/1343

(The above are illustrative examples.)

You would see five such entries corresponding to the top PRs. The agent might also provide a shorter "X-posting" version (e.g. a tweet-worthy one-liner) if that was part of the prompt. The outcome is that you have, in a few moments, a draft of changelog/release notes highlights, complete with acknowledgments to contributors.

Empowering OSS Workflows with Agentic Automation

In this tutorial, we configured Eigent to automate an open-source maintenance task—summarizing GitHub pull requests—using an AI agent. We introduced a custom GitHub MCP server into Eigent, created a dedicated worker, and successfully generated release note snippets from live repository data. The process demonstrates the power of agentic automation for OSS contributors: instead of manually combing through PRs, maintainers can rely on AI agents to do the heavy lifting. By leveraging Eigent's MCP integration and multi-agent coordination, even complex workflows (like triaging dozens of PRs) can be handled efficiently by AI, freeing you to focus on higher-level decisions.

Eigent makes it approachable for both developers and non-developers to harness multi-agent AI. With a few simple steps, you can configure MCP for open-source workflows and let your personalized AI workforce assist you. This was just one example—Eigent can be tailored to many scenarios, from writing summaries and managing issues to testing code or updating documentation. As the platform evolves, the possibilities for GitHub automation with AI agents will only grow.

Give Eigent a try in your own projects, and enjoy the productivity boost of having an AI-powered team on your side! The future of open-source collaboration might just be a mix of human passion and tireless AI assistants working together. 🚀

Happy automating!

The New Era of Automation: How OWL, CRAB, and MCP Are Bridging the Last Mile

Nomadev — Mon, 14 Apr 2025 20:34:23 +0000

The field of autonomous agents is experiencing a renaissance. These AI systems—designed to reason, interact with tools, and complete complex tasks—are making rapid and tangible progress. From cutting-edge research frameworks to powerful platforms enabling agents to manage incredibly intricate workflows. These systems are no longer just promising demos, they’re beginning to reshape how we think about digital labor and automation.

A key enabler of this progress is the Model Context Protocol (MCP), introduced by Anthropic. MCP serves as a new standard for connecting AI assistants to the systems where data lives—including content repositories, business tools, and development environments. It has quickly gained traction, especially with Cursor and Windsurf's integration. OpenAI recently announced their support for MCP in their agent SDK, marking a significant step for the ecosystem. We have also integrated it into the CAMEL framework to embrace the MCP ecosystem.

Despite these advancements, agents still face a fundamental limitation: they struggle with long-term decision-making and adaptation. While they can execute well-scoped tasks, they falter on multi-step objectives that require learning, revising plans, or reacting to change. Current agents follow instructions but don’t truly evolve through experience.

This gap stems from the static nature of internet training data. Language models learn from passive text, not from interaction. To gain real autonomy, agents must operate and evolve within environments—digital or physical spaces where they can perceive, act, and learn from experience. Only through this feedback loop can agents begin to improve through trial and error.

To address this “last mile” challenge in agent automation, we introduce OWL and CRAB, two agent automations projects and MCP integration that are designed specifically for interactive environments.

OWL: Optimized Workforce Learning

OWL (Optimized Workforce Learning), built on top of the CAMEL-AI Framework, is our recently released project for real-world task automation. OWL has shown promise in task automation, achieving an impressive average score of 58.18 on the GAIA benchmark—ranking #1 among open-source submissions.

Watch the video

How OWL Works

OWL is a multi-agent system for automating digital tasks through the use of a browser, terminal, code execution, function calls, and MCP tools. The project has integrated:

Browser Automation: Sophisticated browser interaction capabilities using the Playwright framework, allowing for scrolling, clicking, input handling, downloading, navigation, and more.
Online Search Capabilities: Support for multiple search engines (including Google, DuckDuckGo, Baidu, Bocha, Wikipedia) enabling real-time information retrieval and knowledge acquisition.
Code Execution: Ability to write and execute Python code using an interpreter, enabling programmatic solutions to complex problems.
Document Parsing: Advanced extraction of content from various document formats (Word, Excel, PDF, PowerPoint), with conversion to text or Markdown format.
Multimodal Processing: Robust handling of internet or local videos, images, and audio data through specialized toolkits (ImageAnalysisToolkit, VideoAnalysisToolkit, AudioAnalysisToolkit).
Extensive Toolkit Integration: Access to a comprehensive set of built-in toolkits including ArxivToolkit, GitHubToolkit, GoogleMapsToolkit, and many more specialized tools built in the CAMEL framework.

The core of OWL’s functionality is built on the CAMEL framework’s RolePlaying module, which creates unique initial settings for different agents through predefined prompts. This system primarily utilizes two main agents:

1. UserAgent: Responsible for breaking down tasks and providing instructions

2. AssistantAgent: Executes instructions using various pre-configured tools or tool agents

This architecture enables OWL to handle complex workflows through dynamic agent interactions, making it particularly effective for task automation across diverse domains.

Furthermore, OWL employs a multi-agent system with context isolation for handling long-horizon tasks. Specialized sub-agents maintain isolated context windows for their domain (e.g., WebAgent keeps browser interaction history separate from main agent context).

OWL with MCP Integration

MCP has emerged as the “USB interface” of the LLM field, becoming a universal solution for addressing AI information silos, with its ecosystem growing daily. OWL supports the MCP protocol to call MCP servers within its ecosystem, achieving more standardized and efficient tool invocation.

Here’s a step-by-step guide to implementing MCP with OWL:

1. Setting Up MCP Servers

First, install the required MCP servers:

# Install MCP Playwright Server
npm install -g @executeautomation/playwright-mcp-server
npx playwright install-deps

2. Configure MCP Servers

Create a configuration file named mcp_servers_config.json with the following structure:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@executeautomation/playwright-mcp-server"]
    }
  }
}

3. Implementation in OWL

Here’s how to integrate OWL with MCP in your code:

import asyncio
import sys

from camel.models import ModelFactory
from camel.toolkits import MCPToolkit
from camel.types import ModelPlatformType, ModelType
from camel.societies import RolePlaying
from camel.logger import set_log_level

from owl.utils.enhanced_role_playing import arun_society

set_log_level(level="DEBUG")

async def main():
    # Initialize MCP toolkit and connect
    mcp_toolkit = MCPToolkit(config_path="mcp_servers_config.json")

    try:
        await mcp_toolkit.connect()

        # Get task from command line or use default
        task = sys.argv[1] if len(sys.argv) > 1 else (
            "Using a web browser, search Google Scholar for Andrew Ng's academic profile. Create a comprehensive report that includes: (1) his main research directions in AI and machine learning, (2) at least five of his most influential published papers with citation counts, (3) his affiliated institutions throughout his career, and (4) a summary of his impact on the field."
        )

        # Setup model
        model = ModelFactory.create(
            model_platform=ModelPlatformType.OPENAI,
            model_type=ModelType.GPT_4O,
        )

        # Create and run society
        society = RolePlaying(
            task_prompt=task,
            user_role_name="user",
            user_agent_kwargs={"model": model},
            assistant_role_name="assistant",
            assistant_agent_kwargs={
                "model": model,
                "tools": mcp_toolkit.get_tools(),
            },
        )

        answer, chat_history, token_count = await arun_society(society)
        print(f"\033[94mAnswer: {answer}\033[0m")

    finally:
        try:
            await mcp_toolkit.disconnect()
        except Exception:
            print("Disconnect failed")


if __name__ == "__main__":
    asyncio.run(main())
‍

Example Use Case

Consider this task: “Using a web browser, search Google Scholar for Andrew Ng's academic profile. Create a comprehensive report that includes: (1) his main research directions in AI and machine learning, (2) at least five of his most influential published papers with citation counts, (3) his affiliated institutions throughout his career, and (4) a summary of his impact on the field.”

The OWL framework with MCP can handle this by:

Utilizing autonomous agents to decompose and tackle different aspects of the task
Leveraging the Playwright MCP Server to navigate academic websites and extract paper information
Coordinating the agents through OWL’s role-playing mechanisms to complete the task

Benefits of OWL + MCP Integration

1. Standardized Tool Access: MCP offers a unified interface for interacting with tools and data sources.
2. Ecosystem Expansion: New MCP servers can be seamlessly integrated to enhance OWL’s capabilities.
3. Security: MCP’s architecture safeguards sensitive data through its robust design.
4. Flexibility: Users can easily switch between any AI models that support the MCP standard.
5. Efficiency: Development time for complex multi-agent systems is significantly reduced.
‍

OWL’s Future Directions

OWL’s development roadmap focuses on enhancing its capabilities in several key areas:

Expanding Tool Integration: Incorporating more specialized toolkits to address domain-specific challenges
Improving Multi-Agent Coordination with RL: Incorporating environmental feedback to train the multi-agent systems with reinforcement learning
Strengthening Reasoning Capabilities: Developing more sophisticated planning and decision-making mechanisms
Broadening Environment Compatibility: Ensuring seamless operation across different computing environments

The recent integration of MCPToolkit, FileWriteToolkit, and TerminalToolkit represents significant progress toward these goals, enhancing OWL agents with MCP tool calling, file writing capabilities, and terminal command execution.

CRAB: Cross-environment Agent Benchmark

CRAB stands for CRoss-environment Agent Benchmark, is the first agent framework that supports cross-device task execution. This project aims to build a benchmark that enables agents to perform tasks across multiple environments. For instance, within the CRAB framework, an agent can read a message on a smartphone and then operate a PC based on the message content.

Crab Demo

What Is an “Environment” in CRAB?

The term environment is crucial in CRAB. In the example above, there are two environments: an Ubuntu PC and an Android smartphone. In fact, an environment can be any device, application, or even a more complex multi-device system—as long as it has a well-defined action space and observation space.

Why Cross-Environment Matters

Cross-environment capability is a crucial consideration in our framework, enabling agents to interact simultaneously with multiple devices or applications. This involves coordinating across environments, leveraging information between them, and passing messages. Much like humans who naturally navigate diverse environments—each with different action/observation spaces and logic, to solve complex problems, this capability is vital. However, it stands in contrast to most existing agent benchmarks, which are typically limited to interactions within a single device or application.

CRAB introduces the first cross-environment agent benchmark, CRAB Benchmark v0, which includes 120 tasks spanning more than 20 applications on Ubuntu desktops and Android smartphones. We believe that scaling agent environments is a key step toward building capable and practical agents.

The cross-environment capability unlocks tremendous potential for real-world applications. One exciting possibility is applying CRAB to IoT scenarios—imagine controlling all your devices through a single intelligent agent assistant. In industries such as networking and cloud computing, managing a large number of heterogeneous devices is a constant challenge. Our cross-environment paradigm offers a promising path forward in these domains.

What’s Next: CRAB’s Updating Directions

We are actively improving CRAB and planning several key upgrades in the upcoming version:

Usability: Simplifying configuration and improving code readability. Introducing MCP (Model Connector Protocol) for seamless integration with any model or framework.
Extensibility: Adopting a modular design that makes it easy to add new environments or virtual device implementations. We’ll also introduce a plugin system to support easy customization of existing modules.
Robustness: Our current VM implementations rely on QEMU/KVM and the Google Android Emulator, which are not very stable and Linux-dependent. We plan to switch to more stable and convenient alternatives like Docker.
Automation: Reducing the amount of manual labor needed to conduct experiments.

We’ll be integrating more components into our official GitHub repo, including:

Popular Benchmarks: OSWorld, WebArena, and more
New Environments: Windows, macOS, iOS, web browsers, specific applications, OpenAI Gymnasium, etc.
Visual Prompt Tools: OmniParser, Ferret-UI, Grounding DINO, etc.
Advanced GUI models: OpenAI Operator, Claude Computer Using, etc.
Multi-Agent Systems: Frameworks like CAMEL and OWL, protocols like MCP

OWL + CRAB: A Unified Agent Operating System

The integration of OWL and CRAB creates a potent ecosystem for developing, testing, and scaling agents.

OWL can execute complex, multi-step digital tasks using its sophisticated reasoning and toolkits within a defined environment.
CRAB can provide and manage the diverse, interconnected environments (like PCs, smartphones, specific apps) where these tasks unfold, enabling agents to operate across previously siloed systems.

Complementary Capabilities

OWL and CRAB complement each other in several important ways:

1. Development and Evaluation

OWL provides the framework for building sophisticated multi-agent systems.
CRAB offers standardized methods for evaluating their performance.

2. Task Automation and Environment Adaptation

OWL is good at automating complex tasks.
CRAB ensures these capabilities work consistently across different environments.

3. Tool Integration and Benchmark Standardization

OWL’s extensive toolkit integration is balanced by CRAB’s rigorous benchmarking approach

Data Generation Potential

Combining these projects enables the generation of high-quality training data. Once established, the environments can be used to:

Create Diverse Scenarios: Generate a wide range of task scenarios across different environments.
Capture Agent Interactions: Record how agents navigate these scenarios, including both successful and unsuccessful approaches.
Develop Improvement Metrics: Analyze interaction data to uncover patterns and strategies that correlate with better performance.
Train New Agent Models: Use the synthetic data and identified success signatures to guide the training process through RLHF, targeted fine-tuning, and supervised learning.

This data generation capability creates a virtuous cycle where agent performance continuously improves through iterative testing and refinement.

The Critical Role of Environment in Agent Scaling

CAMEL-AI has identified environment as one of the three key dimensions in the scaling laws of agents—alongside:

the number of agents
memory capabilities

This highlights how crucial environment design is to advancing agent technology.

Why Environments Matter for Agent Scaling

Environments provide the context in which agents operate and learn. They define:

The Action Space: What agents can do and how they interact with the world
The Observation Space: What information agents can perceive
The Reward Structure: How agent behaviors are reinforced
The Task Complexity: The range of challenges agents must overcome

As environments become more diverse and complex, they drive the development of more sophisticated agent capabilities. This creates a scaling effect—better environments lead to better agents, which in turn can handle more complex environments.

Cross-Environment Challenges

The ability to operate across different environments represents a significant leap in agent capabilities. It requires:

Abstraction Skills: Understanding common principles that apply across environments
Adaptation Mechanisms: Adjusting strategies based on environment-specific constraints
Transfer Learning: Applying knowledge gained in one environment to another
Meta-Learning: Learning how to learn in new environments quickly

CRAB’s focus on cross-environment benchmarking directly addresses these challenges, providing a structured way to measure and improve these critical capabilities.

Environment-Driven Intelligence

CAMEL-AI’s hypothesis on the scaling laws of agents emphasizes that intelligence emerges from the interplay between agents and their environments. This aligns with Marvin Minsky’s Society of Mind concept—suggesting that intelligence is not monolithic, but emerges from diverse interactions. Environments serve as crucial testing grounds, stretching and refining agent capabilities. By developing increasingly complex environments, we drive the creation of more sophisticated agents—mirroring how human intelligence evolved through natural and social interactions.

Future Directions in Environment Design

As agent technology advances, environment design will likely focus on:

Increased Realism: Mimicking real-world complexity
Dynamic Adaptation: Evolving in response to agent capabilities
Multi-Agent Ecosystems: Encouraging rich agent-to-agent interactions
Cross-Modal Integration: Combining sensory and interaction modalities

The combination of OWL's advanced agent capabilities and CRAB's rigorous environment specifications offers an ideal platform for exploring these frontiers.

Conclusion

The integration of OWL, CRAB, and MCP represents a significant step forward in solving the “last mile” challenge of agent automation.

By creating environments where agents can learn from experience, operate across platforms, and leverage standardized tool interfaces, we’re building the foundation for truly autonomous systems. As these projects continue to evolve, they promise to unlock new possibilities for AI agents—from more effective task automation to cross-environment coordination and continuous improvement through interaction. The future of agent technology lies not just in better models, but in better environments—environments that allow those models to learn, adapt, and grow through experience.

‍Join us in exploring this frontier of AI research and development—where the boundaries between environments dissolve, and agents gain the power to navigate our complex digital world with increasing autonomy and effectiveness. Ready to join? Click the link or paste it into your browser to apply now.

The combination of OWL and CRAB provides an ideal platform for exploring these directions, with OWL's sophisticated agent capabilities complemented by CRAB's rigorous environment specifications.

OWL GitHub: https://github.com/camel-ai/owl

CRAB GitHub: https://github.com/camel-ai/crab

🐉 Loong: Synthesize Long CoTs at Scale through Verifiers

Nomadev — Wed, 09 Apr 2025 18:07:51 +0000

Recent Large Reasoning Models such as DeepSeek-R1 have demonstrated that general reasoning capabilities of LLMs greatly improve when base models undergo post-training with Reinforcement Learning (RL) with a verifiable reward. Mathematics and programming have particularly benefited from this approach, as these domains can be verified quite easily—allowing accurate interpretation of LLM responses and effective comparison to the ground truth on a semantic level. This idea that ease of verification is crucial to improving domain-specific capabilities has become widely accepted in the research community.

Another critical prerequisite which is often overlooked is the abundance of high-quality datasets, featuring questions paired with verified correct answers in the domains of Math and Coding. These curated datasets provided the necessary signal for models to learn to construct coherent Chains-of-Thought (CoTs) leading reliably to correct answers.

However, many other domains also require reliable reasoning—such as logic, graph theory, physics, and finance. These domains lack comparable datasets, and human-supervised data production at scale is prohibitively expensive. Without abundant correct answers to learn from, models cannot easily acquire domain-specific reasoning patterns. This raises a crucial question: Can similar reasoning performance be achieved in domains beyond math and programming?

In this blog, we introduce Project Loong - focusing on scaling up synthetic data generation with verifiers for a broad range of domains. We believe that synthetic data generation is essential—not only for addressing gaps in data-scarce domains, but also for enhancing reasoning capabilities in areas like math and programming by expanding dataset availability.

Closing the Verification Gap in Synthetic Data for RL

Naturally, a natural gap exists between synthetic questions and their answers, as the correctness of synthetic answers isn't inherently guaranteed. To close this gap entirely, one would need human supervision which is prohibitively expensive. We try to close this gap as much as possible without involving a human in the loop.

To do this, we developed a multi-agent system that generates synthetic questions and corresponding answers from a seed dataset. These synthetic questions are then posed to the agent we want to train, and we employ various domain-specific verifiers to compare the agent's responses against the synthetic answers to check for semantic equivalence.

One of our main ideas is grounded in a simple hypothesis: an LLM equipped with a code interpreter can solve questions significantly more reliably compared to one relying solely on its own chain-of-thought reasoning in natural language.

This makes intuitive sense, as many fields beyond computer science—such as physics, neurophysiology, economics, and computational biology—frequently rely on code-based solutions to solve problems in their own domain.

The Loong Environment

Since we are mostly interested in doing RL, we have structured all components into a unified Gym-like environment, providing a clear interface for RL experimentation.

Our environments compromises three main components:

Seed Dataset

We begin by manually collecting domain-specific datasets consisting of questions and ground truth answers. Each question in the seed dataset is ensured to be solvable using code. If available, we also record the code that leads to the ground truth. The purpose of the seed dataset is not to be a large -scale dataset to use directly for training, but as a means to bootstrap the synthetic data generation process by seeding the generative process of the LLM.

Dataset Overview

The repository currently includes a total of 3,551 questions spanning 8 diverse domains (and growing):

Advanced Math: 1,615 questions
Advanced Physics: 434 questions
Computational Biology: 304 questions
Finance: 320 questions
Graph & Discrete Math: 179 questions
Logic: 110 questions
Mathematical Programming: 68 questions
Security & Safety: 521 questions

Synthetic Data Generator

Our Synthetic Data Generator can be seen as a blackbox, that is seeded by a seed dataset, and generates an arbitrary number of synthetic questions and synthetic answers to those questions based on the seed dataset. The environment makes no further assumptions about the inner workings of the generator. This means that any algorithm can be used under the hood for creating synthetic data. We currently support few-shot prompting over the seed data, as well as a mutli-agent system, where we use self-instruct, evol-instruct or other data generation pipelines for generating questions and a solver agent for the synthetic answers.

It is important to stress, that we do not expect these synthetic answers to always be correct. While we assume that we will obtain more correct solutions than with a naive CoT due to the code execution providing accurate computations, we are well aware that a lot of synthetic answers will still be wrong.

However, this is not a problem since we don’t learn from this raw synthetic data. We will further filter it in the next step and only learn from this filtered synthetic data.

Verifier

While the Synthetic Data Generator produces ample synthetic data, it's essential to filter out incorrect solutions before using them for training. To do this effectively, we validate synthetic answers using two independent approaches:

Deriving one solution directly through the Synthetic Data Generator’s code execution.
Independently generating another solution via natural-language Chain-of-Thought (CoT) reasoning.

If these independent solutions agree, it's highly likely that the answer is correct. Although rare, there's still a possibility of false positives (both approaches incorrectly agreeing). However, given the fundamentally different methods involved, we believe this will not occur often enough to be detrimental to model training.

Each environment also includes a verifier that semantically compares the LLM response with the synthetic answer, ensuring they are effectively equivalent. This verification step is crucial for accurately filtering semantic equivalences, significantly reducing false negatives (cases where semantically correct answers would otherwise be wrongly rejected).

The CoT-generating agent is the model we ultimately aim to train. During RL training, this agent receives positive rewards only when its final CoT-generated answer is semantically confirmed by the verifier to match the synthetic answer, thus ensuring it learns exclusively from likely-correct synthetic data.

A code snippet to get started with the Loong Environment

The code snippet below shows a simplified version of how to use the Loong environment. Implementation details that are not conducive to improving the understanding on a cursory level have been omitted. For a detailed explanation on how to use the single step environment, please refer to this cookbook.

from camel.environments import SingleStepEnv
from camel.datasets import FewShotGenerator, StaticDataset
from camel.verifiers import PythonVerifier
from camel.agents import ChatAgent
from datasets import load_dataset

# Load and initialize a seed dataset
dataset = load_dataset("camel-ai/loong", split="graph_discrete_math")
seed_dataset = StaticDataset(dataset)

# Set up the verifier
verifier = PythonVerifier(required_packages=["numpy", "networkx"])

# Define a model backend to use for the generator
model = ...

# Set up synthetic data generation
generator = FewShotGenerator(seed_dataset=seed_dataset, verifier=verifier, model=model)

# Initialize the Loong environment
env = SingleStepEnv(generator, verifier)

# Define the agent that shall interact with the environment
agent = ChatAgent()

# Example environment interaction
obs = await env.reset()
agent_response = agent.step(obs.question)  # a step for the agent
next_obs, reward, done, info = await env.step(agent_response)

Contribute to Project Loong 🐉

Researchers and developers can use the Loong environment to generate synthetic data across a variety of domains. We have already collected seed datasets for a few domains, including Mathematics, Graph Theory, Mathematical Programming and Logic. The seed data, as well as cookbooks can be found on Github. Additionally, we encourage you to collect your own seed datasets and leverage Loong to generate synthetic data for your domain.We have have unified and uploaded all the seed dataset we collected to HuggingFace:check here

Additionally, we encourage you to collect your own seed datasets and leverage Loong to generate synthetic data for your domain.

We are currently working on using the environment that we built to do post-training on top of LLMs of different sizes to see whether we can see an improvement in the general as well as domain-specific reasoning capabilities. We are still experimenting with different reward setups, focusing mainly on accuracy rewards, following the approach of DeepSeek. More details, as well as our results will be released in our upcoming preprint paper.

At CAMEL, we believe that environments are a vital component for improving domain-specific agent reasoning. If a problem can be framed clearly within an environment, agents have the potential to master it autonomously.

With Loong, we aim to address a key challenge in synthetic data generation: ensuring data quality through verifiability. Our goal with Loong is to make it easier to build reliable reasoning datasets in domains where curated data is scarce.

We invite researchers and developers to contribute seed datasets, verifiers, and ideas to help improve and extend our project. Ready to join? Click the link or paste it into your browser to apply now.

Scaling Environments for Agents

Nomadev — Mon, 07 Apr 2025 12:13:04 +0000

At CAMEL-AI.org, we are committed to pushing the boundaries of artificial intelligence through multi-agent systems. This blog post restates our mission, discusses current limitations and trends of AI agents, and outlines our initiative to build environments for the data-driven future of AI agents.

Mission: Finding the Scaling Laws of Agents

Our mission has always been clear and unwavering: to uncover the scaling laws of agents and build the foundational infrastructure for multi-agent systems that can drive the future of artificial intelligence. From the beginning, we have been committed to exploring how agents scale in complexity, environments, and evolution.

Dimensions of Scaling Laws of Agents

We focus on three key dimensions:

1. Number of Agents: How do agents behave when scaled to large numbers? What emergent abilities arise from their interactions? We aim to study these phenomena and uncover patterns that reveal new capabilities as agent systems grow in scale.

2. Environments: How do we create environments designed to enable agents to learn complex reasoning, long-term decision-making, adaptive behavior, and allow agents to acquire new knowledge or skills through interaction? Our focus is on developing environments that simulate real-world complexity while providing reward signals that effectively drive agent learning and evolution.

3. Evolution: How can agents evolve through interactions within their environment? We are building reinforcement learning environments and memory systems for agents to create agents that can generalize across tasks, adapt to new challenges, and continuously improve through experience.

In this blog, we are focusing on the importance of scaling environments. Environments are not just containers for agent activity; they are essentially the missing data for agents that cannot be acquired simply by scraping the internet. Environments provide the dynamic, interactive contexts necessary for agents to learn adaptive behaviors and develop long-term decision-making capabilities.

The Rise of End-to-End Reinforcement Learning for LLM Agents

The initial approach to making AI agents functional relied heavily on prompt engineering by crafting specific instructions to guide LLM agents. This involved techniques like:

Role-Based Prompts: Instructing agents to follow predefined roles or personas to simulate specific behavior.
Few-Shot Prompting: Providing examples within prompts to teach agents how to use tools or perform complex reasoning.
Output Formatting: Using tricks to ensure models generate structured outputs, such as JSON responses.

While these techniques are effective in prototyping agent systems, they come with significant limitations that hindered robustness, adaptability, and scalability. Prompt-based agents often fail when encountering complex or unforeseen scenarios. Their rigid behavior patterns make them ill-suited for tasks requiring dynamic decision-making. Prompts can unintentionally introduce biases or lead to hallucinated outputs, especially when interacting with tools or external components. Crafting effective prompts for increasingly complex tasks requires significant expertise, time, and trial-and-error, making it difficult to scale across diverse applications.

These challenges underscore the need for a paradigm shift—moving away from reliance on pure prompt engineering toward end-to-end reinforcement learning for LLM agents.

From Prompt Engineering to End-to-End Autonomy

End-to-end RL for LLM agents has been considered a promising direction for addressing the shortcomings of prompt engineering. These agents are trained holistically on tasks, rather than relying on manually crafted prompts for every scenario.

Recent advancements in RL for LLM agents have emerged from leading research labs and startups. Notable examples include OpenAI's Operator and Deep Research, xAI's Grok 3, and DeepSeek's R1. OpenAI's Operator combines GPT-4o's vision capabilities with reinforcement learning, allowing it to interpret screenshots and interact with GUIs effectively and perform web-based tasks such as ordering groceries, booking reservations, and creating memes without requiring custom API integrations. OpenAI's Deep Research leverages reinforcement learning to autonomously navigate complex browsing and reasoning tasks across diverse domains. Trained with end-to-end reinforcement learning, it plans and executes multi-step trajectories, backtracking and adapting to real-time information as necessary. xAI's Grok 3 trained on the Colossus supercluster with ten times the computational power of previous models, Grok 3 (Think) was trained using reinforcement learning to refine its chain-of-thought reasoning. It refines its problem-solving strategies by thinking for seconds to minutes, correcting errors, exploring alternatives, and delivering accurate answers across various tasks, including mathematics, coding, and world knowledge. DeepSeek's R1 series models utilize RL to develop advanced reasoning capabilities. Initially, DeepSeek-R1-Zero demonstrated that complex reasoning behaviors, such as extended chain-of-thought and self-correction, could emerge purely through RL without supervised fine-tuning. Building upon this foundation, DeepSeek-R1 incorporates a small "cold-start" dataset alongside iterative RL and supervised fine-tuning to enhance output coherence and user-friendliness while maintaining state-of-the-art reasoning performance.

As the field continues to evolve, we foresee an increasing number of vertical agent startups incorporating reinforcement learning to train LLM agents to tackle specific industry challenges. For instance, a recent post from the Cursor team, creators of an AI-powered code editor, indicates that Cursor AI is working on building RL models in real-world coding environments to automate coding.

Environment is the Missing “Data” for Agents

We are excited about the future of RL for LLM agents, as AI already matches human capabilities in many tasks. RL offers a promising path to achieving superhuman intelligence, and we may witness more "Lee Sedol moments," like AlphaGo’s historic victory, in the area of LLM agents across different domains. However, its full potential remains unrealized because the critical “data” for effective agent training is missing: realistic, standardized environments. While internet data may offer vast amounts of information, it lacks the interactive, adaptive, and diverse settings required for an agent to learn long-term decision-making through trial and error. Agents trained solely on static internet data struggle to understand temporal dynamics and complex cause-and-effect relationships in the real world.

Equally challenging is the design of robust reward functions. Without carefully crafted reward signals, it becomes difficult to train agents to exhibit desired behaviors. Developing dedicated verifiers to assess LLM responses can be instrumental in defining reward functions that ensure reward signals remain reliable and aligned with long-term objectives.

At CAMEL-AI.org, we believe that overcoming the challenges of reinforcement learning for LLM agents requires a community-driven approach. Our open-source framework is designed to facilitate global collaboration among researchers and developers, enabling the creation of scalable environments and robust reward mechanisms. Thanks to our contributors, we already have the foundational building blocks in place, including environments, verifiers, data generation pipelines, and toolkits that are essential for further development.

Fill out this form and join us in shaping a future where reinforcement learning reaches its full potential

What’s Inside the Best Open-Source General AI Agent?

Nomadev — Fri, 21 Mar 2025 08:47:58 +0000

Last week felt like a wild for AI agents. If you’ve been following the space, you probably saw the buzz around MANUS. Manus AI grabbed everyone’s attention. A general AI agent system (impressive, no doubt). But there was a catch: it wasn’t open-source, and you needed an invite just to try it out. Cool tech, but limited access.

We wanted to change that.
So we did.

We rolled out OWL, an autonomous, open-source general AI agent built on top of the CAMEL-AI framework. No paywalls. 100% open and ready to use.

And in just 5 days?
→ 11.2K+ GitHub stars
→ Ranked #1 on GAIA among open source general agents
→ A community that’s already building, testing, and scaling with it

This is more than a project. OWL is** our answer to the need for accessible, scalable, and autonomous agent frameworks**. We’ve made some real advancements in pushing the boundaries of what autonomous AI agents can do, without the barriers of closed systems.

Let’s take a closer look at why OWL is making waves and why it might be exactly what you’ve been looking for.

A Modern, Modular Tech Stack

At CAMEL-AI, we’re all about making life easier for developers, AI researchers, and anyone exploring multi-agent systems.

With OWL, you get:
✅ Seamless multi-agent orchestration, thanks to the CAMEL-AI framework
✅ Built-in Docker support for easy deployment—whether you’re on cloud or local
✅ A clean, modular Python setup for flexibility and fast prototyping

👉 Ready to try OWL? Jump straight to GitHub or keep reading ↓

State-of-the-Art Model Support

With OWL, you’re not tied to one model. We built it to be flexible, adaptable, and compatible with the best AI has to offer.

Cloud-based & Local Models
→ Supports GPT-4o, Qwen, Mistral, Claude 3.5 Sonnet, DeepSeek, and more.

Run Locally (Privacy-First)
→ Use Ollama, vLLM, and SGLang for on-premises deployments—no cloud needed.

Blazing-Fast Inference
→ Works with Groq and SambaNova backends for lightning-fast performance.

👉 See Supported Models

Toolkits That Does It All

We built a full autonomous multi-agent system, packed with 30+ toolkits.

Real-Time Search & Extraction
→ Pull data from Google, Wikipedia, and scrape at scale with Firecrawl.

Multimodal Processing
→ Analyze images, videos, and audio files effortlessly.

Browser Automation
→ Automate browser tasks using Playwright, Zapier, and Browseruse.

Document Parsing & Code Execution
→ Handle Word, Excel, PDF files via Chunkr. Run Python code natively.

MCP Integration
→ Built-in support for Anthropic’s Model Context Protocol (MCP) for seamless tool interoperability.

👉 See the Full Toolkit List

Why OWL Is Different (And Why It Matters)

We’re not here to compete on hype. We’re here to offer a fully autonomous, open-source alternative that anyone can build with.

100% Open-Source → no fees, no invite codes
Runs Locally or in the Cloud → privacy & scalability
Ranks #1 on GAIA Benchmark → among open-source agent frameworks
Backed by an Active Community → join us on Discord, Reddit, and WeChat

They promised the future of AI agents, We open-sourced it. 🦉

It’s clear the AI community has been waiting for something like this that anyone can build on, without gatekeeping.

OWL is just our first step, and we’re beyond excited to see what you create with it.

👉 Try OWL today → GitHub Link
👉 Join the community → CAMEL-AI Discord

From everyone at CAMEL-AI, thank you for your amazing support.
Let’s keep building the future of open-source AI, together! 🐫🦉🚀

Massive thanks to our incredible community! Because of your support. Let’s keep building the future of open-source AI together!

How Data Drives LLM Pretraining: Methods, Tips, and Best Practices

Nomadev — Thu, 06 Mar 2025 07:21:41 +0000

How Data Fuels LLM Pretraining

Data serves as the lifeblood of LLM pretraining, determining the extent of the model’s language understanding and its ability to generalize across tasks. The quality, diversity, and scale of the data directly influence the model’s performance. By processing billions of words from varied sources, LLMs learn to recognize patterns, interpret nuances, and adapt to different linguistic contexts.

Without rich and diverse data, the model’s capabilities are inherently limited, as it would struggle to generalize beyond the patterns seen during training.

For instance, a lack of diversity in data can lead to overfitting, where the model excels in specific contexts but performs poorly in others.

Exploring Data Types for LLM Pretraining

LLMs primarily rely on unstructured textual data, such as books, articles, and online content. These sources offer a wide range of language styles and topics, making them ideal for building general-purpose models. Web scraping is a common method to collect such data, often pulling from websites, blogs, forums, and other user-generated content.

While structured data such as tables or spreadsheets is less commonly used due to its lack of linguistic richness, some specific use cases may incorporate structured data when it’s highly relevant to a particular domain (e.g., medical records, scientific datasets). These are typically less abundant in comparison to unstructured text data.

Effective Data Collection and Preparation for LLMs

The data collection process begins with clear objectives. For general-purpose LLMs, the goal is to gather diverse and representative text data that covers a wide range of topics and styles. Web scraping from multiple sources ensures that this data is varied and can reflect different contexts, linguistic features, and domains.

The raw data often contains noise, irrelevant information, or repeated content, which must be filtered out to maintain quality.

CAMEL-AI offers convenient integrations with popular extraction and data ingestion tools like MinerU, UIO, Jina Reader, Apify.

These tools help streamline the data collection process, reducing manual effort and enhancing data quality.

While bad sources can be discarded using heuristics (e.g., filtering out overly repetitive or obviously irrelevant text), irrelevant information is more challenging to remove on a large scale. A common approach involves monitoring the loss plot during training.

When a sharp spike occurs, it often indicates problematic data. At this point, the dataset is revisited, and specific data points (e.g., content from subreddits like https://www.reddit.com/r/mmmmmmmmm/) are removed, as they confuse the model and degrade its learning.

In the case of the subreddit mentioned, the repetitive content (e.g., dozens of “m’s” in a row) conflicts with the model’s learned pattern of language, leading to inefficiencies in training.

Key steps include:

Define Clear Objectives: Establish what topics and styles are needed.
Utilize Diverse Sources: Collect data from websites, blogs, forums, and social media.
Filter Out Noise: Remove irrelevant, repetitive, or low-quality content.

When using platforms like CAMEL-AI, data preparation becomes straightforward. CAMEL's integrated Retrievers and Memory Management techniques allow developers to filter out noise, irrelevant information, and repetitive content to maintain dataset quality.

Data Preprocessing and Tokenization Techniques

Once the data is cleaned, it undergoes preprocessing, which primarily involves tokenization. Tokenization is the process of breaking down text into smaller, manageable units (tokens), such as words or subwords.

These tokens are initially represented as one-hot encoded vectors, where every entry is 0, except for one which is 1. The position of the 1 in the vector corresponds to a specific token, allowing us to map the textual representation of a token to its vector representation.

Once the tokens are ready, they are passed through the model, where embedding representations are learned during the training process. These embeddings capture the semantic properties of tokens, but this occurs only after the initial tokenization and vectorization process.

Common techniques for tokenization include WordPiece or Byte-Pair Encoding (BPE), which break down words into smaller, more granular subword units to handle rare or unseen words more effectively.

These methods break down words into smaller, more granular subword units to handle rare or unseen words more effectively.

In the CAMEL-AI ecosystem, you can efficiently experiment with various tokenization and embedding techniques detailed in the Embeddings Module documentation.

Ensuring Quality Control and Dataset Balance in LLM Pretraining

To ensure the model performs well across a variety of tasks, dataset quality control measures are essential. These include removing harmful or nonsensical text (such as hate speech, misinformation, or irrelevant content), ensuring diverse linguistic features, and de-duplicating content.

Balancing the dataset is crucial to ensure that it’s not overrepresented by any single type of text.

Quality Control Tips

Remove Harmful Content: Filter out hate speech, misinformation, and irrelevant text.
De-duplicate Data: Ensure each piece of content is unique.
Maintain Genre Balance: Combine informal social media posts with formal academic articles.
Ensure Demographic Representation: Actively include content from underrepresented groups to prevent bias.

Data is at the heart of training large language models, shaping how they learn, adapt, and perform across different tasks. A well-curated dataset—rich in quality, diversity, and balance—can make all the difference in achieving a powerful and reliable model.

If this article helped you, let us know!

Your feedback means a lot and helps us create even better content.

We’re also kicking off a Data Generation Blog Series, where we’ll explore topics like Data Collection, Post-Training, Pretraining Data, CoT Reasoning Data Generation and more

Stay tuned for what’s coming next!

That's Everything 🚀

Got questions about 🐫 CAMEL-AI? Join us on Discord!

Whether you want to share feedback, explore the latest in multi-agent systems, get support, or connect with others on exciting projects, we’d love to have you in the community! 🤝

Check out some of our other work:

🐫 Creating Your First CAMEL Agent – Free Colab
📊 Graph RAG Cookbook – Free Colab
🧑‍⚖️ Create A Hackathon Judge Committee with Workforce – Free Colab
🔥 3 Ways to Ingest Data from Websites with Firecrawl & CAMEL – Free Colab
🦥 Agentic SFT Data Generation with CAMEL and Mistral Models, Fine-Tuned with Unsloth – Free Colab

Thanks from everyone at 🐫 CAMEL-AI! 🎉

Camel aiFollow

https://camel-ai.org is working on finding the scaling laws of agents. The first and the best multi-agent framework. Discord: http://discord.camel-ai.org.

Agents with Human in the Loop : Everything You Need to Know

Nomadev — Thu, 27 Feb 2025 10:21:25 +0000

The rapid advancement of deep learning and Large Language Models (LLMs) has propelled AI agents from specialized tools to autonomous systems capable of handling complex, multi-step tasks. These agents demonstrate remarkable capabilities in language understanding, decision-making, and self-refinement. However, challenges such as hallucinated results, unreliable predictions, and lack of oversight limit their trustworthiness, particularly in high-stakes domains like robotics, software development, and decision automation.

To enhance AI reliability, researchers have developed Human-in-the-Loop (HITL) frameworks, which integrate human expertise at key decision points to improve efficiency, accuracy, and accountability. HITL systems strike a balance between automation and human judgment, ensuring that AI escalates uncertain or critical decisions to experts while efficiently handling routine tasks autonomously. Conformal prediction, iterative feedback loops, and interactive validation are among the core techniques that empower HITL frameworks to minimize errors and increase adaptability in dynamic environments.

This review explores the latest advancements in HITL techniques for multi-agent LLM systems, focusing on both research innovations and industrial applications. We examine state-of-the-art frameworks that implement human oversight mechanisms, as well as real-world deployments where HITL solutions enhance AI-driven workflows in robotics [1], software engineering [2], and autonomous agents. By analyzing these developments, we highlight the evolving role of human-AI collaboration in building more robust, transparent, and responsible AI systems.

Authors

Outline
AI Agents / Human-in-the-loop background
Human-In-The-Loop inf Research Literature
Current Human-in-the-loop solutions
Summary of Human-in-the-loop
Looking Ahead: The Future of Human-in-the-Loop AI
Reference

Human-In-The-Loop in Research Literature

KnowNO Framework

In dynamic and unfamiliar environments, large models and robots often face a common problem: making overly confident yet incorrect predictions. A team of researchers from Princeton University and Google DeepMind addressed this issue by introducing the KnowNo framework [1]. This system helps robots recognize when they’re uncertain and allows them to ask for help from humans when necessary, using a concept called conformal prediction (CP).

How Does KnowNo Work?

The KnowNo framework integrates large language models (LLMs) and conformal prediction techniques in a structured pipeline. Here’s how it operates step by step:

Generating Candidate Plans: The process begins with an LLM creating a list of possible action plans. These are presented as a multiple-choice question (MCQ) format, including an “E” option for "none of the above." To come up with these options, the LLM considers:
- The robot’s observations (e.g., what it “sees” or detects in the environment).
- The task instructions provided by the user.
- Examples of how similar problems were solved before.
By combining these inputs, the LLM builds a rich context and generates a list of possible actions.
Assessing Uncertainty and Narrowing Down Choices: Using conformal prediction, the system evaluates the confidence level for each candidate plan. The goal is to identify a set of plausible actions while filtering out those deemed too uncertain. If the system narrows down to a single, high-confidence option, it proceeds to execute it. Otherwise, it knows it’s unsure and triggers the next step.
Seeking Human Help: When the robot’s prediction isn’t confident enough, it asks a human for assistance. This ensures that even in ambiguous scenarios, the robot can rely on human expertise to move forward.
Executing the Plan: Once the robot has a clear next step—whether determined autonomously or with human input—it executes the action.

Key Insights from KnowNo:

Smart Use of Uncertainty: The framework transforms robot planning into a question‐answering format in which the LLM generates plans and CP techniques determine which ones are reliable. By explicitly handling uncertainty, the system can decide when it is safe to act independently and when to involve humans.
Handling Complex Tasks: KnowNo doesn’t stop at single decisions. For multi-step tasks, it aligns uncertainty across the entire sequence. This involves recalibrating predictions step by step to ensure consistency and minimize human intervention while maintaining accuracy.

In terms of the experiment, in scenarios such as simulated tabletop rearrangement, multi-step tabletop rearrangement on hardware, and hardware-based mobile robotic arm kitchen tasks, comparisons were made with baseline methods like Simple Set and Ensemble Set. The results demonstrate that KnowNo consistently achieves the target task success rate. Under varying error rate settings, it achieves higher success rates with less human assistance and shows adaptability to different LLMs.

Experiment Highlights:

The researchers tested KnowNo in various scenarios, including:

Simulated tabletop object rearrangement.
Physical robots performing multi-step object manipulation.
A mobile robotic arm operating in a kitchen setting.

Compared to baseline methods like Simple Set and Ensemble Set, KnowNo stood out by:

Achieving target success rates even in challenging conditions.
Reducing the need for human help without compromising task performance.
Adapting well to different LLMs and task complexities.

Some Thoughts and Suggestions:

While KnowNo is an impressive step forward, a few areas could be improved:

Handling Human Error: The system assumes humans always provide accurate help, which may not hold true in real-world scenarios. Introducing a model to simulate or account for human mistakes could make the framework more robust.
Efficiency Concerns: Generating and calibrating prediction sets for complex, multi-step tasks can be computationally expensive. Exploring more efficient calibration methods, such as hierarchical or incremental strategies, could significantly reduce the runtime costs.

The HULA framework: Bridging Automation and Human Expertise in Software Development

The HULA (Human-in-the-loop LLM-based Agents) framework [2], proposed by researcher from Monash University and The University of Melbourne, enables software engineers to guide intelligent agents in software development tasks. By balancing automation with human expertise, HULA incorporates human feedback at every stage, improving the quality and efficiency of software development. The authors also showcase the integrations of the HULA framework into Atlassian JIRA.

The HULA framework consists of three main agents that collaborate to enhance the software development process:

AI Planner Agent: This agent identifies files related to the issue and formulates a coding plan.
AI Coding Agent: Based on the coding plan, this agent generates code changes that address the specified problem.
Human Agent: This role is fulfilled by software engineers who provide feedback on the performance of the AI agents and collaborate throughout the process.

The workflow of the HULA framework can be broken down into several key stages:

Setting up a Task: The software engineer selects a task and links it to the relevant code repository. Each task is accompanied by descriptive information that outlines its requirements.
Planning: The AI Planner Agent uses the task description to understand the work and its context. It identifies relevant files for the task, which the software engineer can review, edit, and confirm. After identifying the files, the AI Planner creates a coding plan to modify them and resolve the issue. The Human Agent then reviews this plan, provides additional instructions, and may regenerate it if needed. The Human Agent can also modify the list of relevant files and adjust the change plan. After several iterations, the Human Agent confirms the plan, allowing the process to proceed to the next stage.
Coding: Once the software engineer approves the coding plan, the AI Coding Agent generates code changes for each file. The Human Agent reviews these changes and can provide further instructions if they don't meet expectations, prompting the AI Coding Agent to regenerate them. The AI agent also uses additional tools to optimize the code. This iterative process continues until the code passes validation or reaches the maximum number of attempts.
Raising a Pull Request: Once the Human Agent agrees with the code changes, the generated code modifications are submitted as a pull request for review by other developers or processed as appropriate.

The team evaluate the HULA framework in three stages to measure its effectiveness:

(1) An offline evaluation of HULA without human feedback to fully automate the process using SWE-Bench and internal dataset of JIRA issues. It is also known as a pre-deployment evaluation to ensure the HULA framework achieves an acceptable performance before deployment.

(2) An online evaluation of HULA augmented by human feedback using real-world JIRA issues. This is conducted in the actual development practice with 45 software engineers at Atlassian, it provides further insights into HULA’s performance from actual usage conditions.

(3) An investigation of the practitioners' perceptions on the benefits and challenges of using HULA. The team conducted an online survey, which included of 8 questions focusing on HULA's performance and 3 questions about user feedback.

In the offline evaluation, the performance of HULA for SWE-Bench is comparable to SWE-agent Claude, which ranks 6th on the SEW-Bench leaderboard. However, the authors found HULA achieves a lower accuracy on the JIRA dataset compared to the SWE-Bench dataset. This suboptimal performance could be due to the increased diversity of input, both in programming languages and repositories.

For the SWE-Bench dataset, issues typically had detailed descriptions with key information, like module names or code snippets. However, real-world JIRA issues usually consist of informal knowledge transfer, like meetings or chats, instead of detailed documentation in the internal dataset. Therefore, in the online evaluation with Human Agent, 8% of the JIRA issues had successfully merged HULA-assisted PRs containing the HULA-generated code ****into the code repositories.

By comparing the offline and online evaluations, we conclude that the detail of input can highly affect the performance of LLM-based software development agents. However, practitioners highly agree when they can engage in the process by reviewing and enriching the issue descriptions. Furthermore, in the investigation, most participants agreed that the coding plan was accurate and the generated code was easy to read and modify, which helped reduce their initial development time and effort. Also, a few participants acknowledged that HULA’s workflow could promote good documentation, but it requires more effort to provide detailed issue descriptions.

Current Human-in-the-loop solutions

HumanLayer

HumanLayer is a YC-backed company in F24 batch raised $500K in its pre-seed round. They are working on providing an API and SDK that integrates human decision-making with AI agent workflow. With HumanLayer, AI agent is able to request human approval at any step in its execution, as the product handles routing of requests or messages to the designated group through their preferred channel. It is framework agnostic and can be easily integrated into any agent frameworks that has tool-calling functions.

HumanLayer is designed to revolutionize the future of AI by empowering the next generation of Autonomous Agents. These agents are no longer reliant on human initiation; instead, they operate independently in what we call the “outer loop,” actively working toward their goals by utilizing a variety of tools and functions. Communication between humans and agents is now agent-initiated, occurring only when a critical function requires human approval or feedback. This shift unlocks a new level of efficiency and autonomy, allowing AI to evolve in ways that were once unimaginable.

Key features

Approval Workflows: Rapidly launch the SDK to ensure human oversight for critical function calls. Denied messages will be fed back to agent context window, allowing agents to learn and conduct automatic approval/deny based on past human interactions.

What elements can be controlled by a human in this workflow:
1. Creation of approval request
2. Pausing AI workflow until reaching decision outcome
3. Execution of pre-defined tasks for rejection cases

HumanLayer cloud for receiving approvals

For this part of functionality, HumanLayer backend is responsible for handling of approval requests and routing to target groups for decision outcome.
Humans as Tools: Integrate multiple human contact channels into agent's toolchain for AI agent to review human feedback (not approval).

What elements can be controlled by a human in this workflow:
1. Creation of approval request
2. Pausing AI workflow until reaching decision outcome
3. Passing the response back to the LLM
HumanLayer SDK handles message routing and collecting response/input from human.
Custom Responses and Escalation: Pre-fill response prompts for seamless human-machine interaction and coordinated approvals across multiple teams and individuals

Users can define structured response options to guide or format human inputs.

HumanLayer currently supports these channels of communication in dashboard settings: Slacks, email, SMS, WhatsApp. Users are able to configure advanced options such as direction integration with react applications or composite channels with custom rules.

GotoHuman

GotoHuman is a human-in-the-loop solution designed to integrate human oversight into AI-driven workflows, ensuring accurate and context-aware decision-making.

Some of the features are:

Custom Review Forms: Quickly create tailored forms to display content or capture human input, allowing your team to review AI-generated content, approve workflow steps, or provide necessary input.
Human Review Requests: Utilize Python or TypeScript SDKs, or directly call the API, to request human reviews when AI-generated content or workflow steps require approval.
Human Decision Awaiting: Review requests are automatically shared with your team in an authenticated environment. Short-lived public links can also be activated for external reviewers.
Response Reception: Upon completion of a review, results are sent to your custom webhook, allowing your workflow to proceed seamlessly.

gotoHuman is designed to work with any AI framework, library, or model, providing flexibility in integration. It offers SDKs for Python and TypeScript. By integrating gotoHuman, teams can maintain human supervision within AI workflows, enhancing safety, compliance, and precision.

Redouble AI

Redouble AI is a young, YC-backed company that raised $500K in September 2024. Publicly available information about the company is limited. Currently, there are no published papers, open-source code, or user feedback accessible.

Redouble AI is the solution to scale human-in-the-loop for AI workflows in regulated industries.

Dynamically learns from your unique domain-specific human feedback data.
Provides recommendations on whether to send the output of your LLM pipeline to a human for review.
Monitor the insights from your human reviewers at scale while also flagging suspicious reviews to ensure consistent final outputs.
Integrate easily with your existing pipeline with just a couple of simple API calls.

Model Context Protocol (MCP)

MCP (Model Context Protocol) is an open standard proposed by Anthropic, designed to provide a unified interface for AI assistants to interact with external systems (such as files, APIs, and databases), similar to how USB-C serves as a universal standard in hardware.

It addresses the challenges of integrating AI models with heterogeneous data sources and tools, improving response accuracy and relevance through a standardized communication mechanism.

In a human-in-the-loop setup, an AI agent can leverage MCP servers as integration tools within platforms like Slack to send notifications and seek human guidance before executing critical actions.

For example, if the agent detects a potential scheduling conflict in an automated calendar update, it can use the Slack MCP server to send a message asking a human operator for suggestions or explicit approval before proceeding.

MCP’s implementation involves in:

By deploying MCP servers (such as read_file and read_dir functions), AI can directly invoke external functions without repeatedly writing adapter code.
Supports bidirectional communication: AI can access data as well as respond to tool-triggered actions.

Why we need MCP? Because we have the following industry challenges:

Data Silos: AI models are constrained by fragmented data sources, making cross-system collaboration difficult.
Inefficient Development: Different systems require custom integrations (e.g., API variations, authentication methods), leading to redundant code and high maintenance costs.
Security Risks: Separate security protocols for each platform complicate access management.

To address these challenges, MCP provides:

Unified Interface: Standardized tool invocation processes reduce the technical barriers for developers. Providing a universal specification supporting mainstream programming languages (such as Python and TypeScript), enabling developers to quickly build MCP clients or servers. For example, Claude’s desktop application includes built-in MCP support for "plug-and-play" functionality.
Long-Term Compatibility: Abstracting the protocol layer minimizes maintenance burden due to system API changes.
Bidirectional data interaction: AI can read/write external systems (e.g., database queries, file editing) via MCP. Some use cases are: Enterprise tools (Slack, Google Drive), development environments (Git, VS Code extensions), etc.
A tool ecosystem:
- Anthropic provides SDKs and open-source libraries (e.g., @mcp-foundation/*) to accelerate enterprise system integration.
- Extensibility: Supports custom feature extensions to accommodate private deployments.

By using MCP, we can achieve:

Cost Reduction: Cuts over 60% of custom integration development time.
Enhanced Security: Centralized access management reduces data leakage risks.
Ecosystem Collaboration: Drives AI evolution from "closed inference" to "open system agents," laying the foundation for AGI deployment.

CAMEL with Human-In-The-Loop

CAMEL-AI is an open-source community dedicated to finding the scaling laws of agents. CAMEL framework implements and supports various types of agents, tasks, prompts, models, and simulated environments.

The Human-In-The-Loop features in CAMEL facilitates collaborative interactions between AI agents and human participants. It is designed to simulate dynamic exchanges where AI agents take on specific roles (e.g., AI Assistant and AI User) to complete tasks, while a human acts as a critic or supervisor to guide the process. This framework is ideal for tasks requiring creativity, problem-solving, or iterative refinement.

In current version of CAMEL, it supports two important ability:

Human-In-The-Loop: The ability for agent to consult human during the execution of the task (by using HumanToolkit)

This ability provides agents the ability to consult human, the basic use case is like a chatbot:

from camel.toolkits import HumanToolkit
human_toolkit = HumanToolkit()
agent = ChatAgent(
    system_message="You are a helpful assistant.",
    model=model,
    tools=[*human_toolkit.get_tools()],
)
response = agent.step(
    "Test me on the capital of some country, and comment on my answer."
)

The basic examples turns our agent into an interactive chatbot. The true power of Human-in-loop shows when making use of multiple agents (Workforce module in Camel). For example, this use case shows how to use agents to help design a travel plan, and ask the user for the feedback to modify the plan.

# ignoring imports...
human_toolkit = HumanToolkit()
search_toolkit = SearchToolkit()

task = Task(
    content="Make a travel plan for a 2-day trip to Paris. Let user decide the final schedule in the end.",
    id="0",
)

# This agent researches and designs the travel plan.
activity_research_agent = ChatAgent(
    system_message="""
    You are a travel planner. You are given a task to make a travel plan for a 2-day trip to Paris.
    You need to research the activities and attractions in Paris and provide a travel plan.
    You should make a list of activities and attractions for each day.
    """,
    model=openai_model,
    tools=[*search_toolkit.get_tools()]
)

# This agent reviews the plan, and consults user for feedback!
review_agent = ChatAgent(
    system_message="""
    You are a reviewer. You are given a travel plan and a budget. 
    You need to review the travel plan and budget and provide a review. 
    You should make comments and ask the user to adjust the travel plan and budget.
    You should ask the user to give suggestions for the travel plan and budget.
    """,
    model=openai_model,
    tools=[*human_toolkit.get_tools()] 
)

workforce.add_single_agent_worker(
    "An agent that can do web searches",
    worker=activity_research_agent,
).add_single_agent_worker(
    "A reviewer",
    worker=review_agent,
)

task = workforce.process_task(task)
print(task.result)

Human approval: The ability for the agent ask approval to execute some tasks. The following example demonstrates how to define two tools for agents to execute, one is normal, the other one is more sensitive, which requires user approval.

from humanlayer.core.approval import HumanLayer
hl = HumanLayer(api_key=humanlayer_api_key, verbose=True)

# add can be called without approval
def normal_task(args):
    """Normal tasks for agent to execute"""
        ...

# but multiply must be approved by a human
@hl.require_approval()
def sensitive_task(args):
    """ Sensitive task that requires user approval"""
        ...

For more details, you can check with the CAMEL cookbook here.

Summary of Human-in-the-loop

This post presents a comprehensive overview of recent developments in human-in-the-loop (HITL) approaches for multi-agent frameworks, highlighting their significance in enhancing AI decision-making by integrating human expertise. It covers a variety of methodologies that address different AI challenges, particularly in uncertainty management, software development, AI workflow oversight, and autonomous agents.

Specifically, we reviewed the KnowNo framework, a conformal prediction-based system for robotic planning that enables LLMs to assess uncertainty and request human intervention when necessary, reducing reliance on incorrect high-confidence predictions. We then examined the HULA framework, a human-in-the-loop LLM agent designed to assist in software development, particularly in issue tracking and code generation, by iteratively refining AI-generated outputs with human feedback. Additionally, we discussed HumanLayer, GotoHuman, and Redouble AI, which provide solutions for integrating human oversight into AI workflows, ensuring that AI agents consult humans for approvals or corrections before executing critical actions. Another key development is the Model Context Protocol (MCP) by Anthropic, which establishes a standardized interface for AI models to seamlessly interact with external data sources, addressing interoperability challenges in AI-driven workflows.

The CAMEL framework, an open-source multi-agent framework, has integrated human-in-the-loop decision-making and human approval processes for AI agents, enhancing the adaptability and accountability of multi-agent systems. This approach shifts AI systems away from static, rule-based automation toward adaptive, self-correcting agents that engage humans strategically to improve decision-making.

Looking Ahead: The Future of Human-in-the-Loop AI

Moving forward, we can expect the human-in-the-loop paradigm to become increasingly central to AI system design and deployment. Below are some key trends and possibilities:

Scaling HITL for Large-Scale Deployment: Implementing HITL solutions across industries (e.g., law, healthcare, and finance) requires efficient human-AI collaboration models that balance automation with oversight. Existing frameworks must evolve to support more dynamic, real-time decision-making environments.
Beyond Human Approval—Towards Human-AI Synergy: Current frameworks mostly involve humans in a corrective or oversight role, but future advancements could explore proactive collaboration, where humans and AI co-create solutions in real time, functioning as co-pilots to each other.
Learning from Human Feedback: Future HITL systems should not only rely on human oversight but also learn from human decision-making patterns, expertise, and contextual judgments. By modeling human preference, AI can better anticipate when intervention is needed and reinforce its decision-making processes to align with human expertise.

The evolution of human-in-the-loop AI will drive greater autonomy, accountability, flexibility, and ethical alignment in AI systems. By leveraging the best of both worlds—high-speed, data-driven AI processing and carefully integrated human judgment—we can pave the way for more effective, trustworthy, and sustainable AI adoption.

Reference

Ren, A. Z., Dixit, A., Bodrova, A., Singh, S., Tu, S., Brown, N., Xu, P., Takayama, L., Xia, F., Varley, J., Xu, Z., Sadigh, D., Zeng, A., & Majumdar, A. (2023). Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. arXiv preprint arXiv:2307.01928v2. Retrieved from https://arxiv.org/abs/2307.01928v2
Takerngsaksiri, W., Pasuksmit, J., Thongtanunam, P., Tantithamthavorn, C., Zhang, R., Jiang, F., Li, J., Cook, E., Chen, K., & Wu, M. (2024). Human-In-the-Loop Software Development Agents. arXiv preprint arXiv:2411.12924. Retrieved from https://arxiv.org/pdf/2411.12924
Robots That Ask For Help: https://arxiv.org/abs/2307.01928
Human-In-the-Loop Software Development Agents: https://arxiv.org/pdf/2411.12924
HumanLayer: https://www.humanlayer.dev
Gotohuman: https://www.gotohuman.com/
Redouble AI: https://www.ycombinator.com/companies/redouble-ai
Model Context Protocol (MCP): https://www.anthropic.com/news/model-context-protocol, https://x.com/alexalbert__/status/1861079762506252723
CAMEL: critic of Human in the loop: https://github.com/camel-ai/camel/blob/master/examples/ai_society/role_playing_with_human.py
Camel human-in-loop cookbook: https://docs.camel-ai.org/cookbooks/advanced_features/agents_with_human_in_loop_and_tool_approval.html