Forem: Nomadev

Everything you need to know about AI this week

Nomadev — Sat, 04 Oct 2025 20:40:45 +0000

Hey there, Nomadev here! If you're reading this, you're probably as excited about AI as I am. Every week brings something new, with fresh models, smarter agents, and ideas that push the limits of what we can build.

So I thought I’d make it easier for you to stay in the loop.
Here’s a quick roundup of everything that happened in AI this week, covering updates from OpenAI, Anthropic, Nvidia, IBM, and more.

Whether you’re a developer, a researcher, or just an AI enthusiast who loves seeing how fast this space moves, this is your quick catch-up. Let’s dive in! 🚀

1. OpenAI Sora 2

Sora 2 is OpenAI’s next big leap in video generation, turning simple text prompts into high-fidelity, cinematic visuals.

Whether you want to create storyboards, concept videos, or just explore visual storytelling with AI, Sora 2 makes it feel effortless.

This isn't just an upgrade, it's a preview of how creative workflows will change.

Here’s how Sora 2 makes things magical:

Realistic Motion: Smooth, consistent movement across scenes with fewer artifacts.
Scene Coherence: Objects and subjects stay consistent throughout the generated video.
Camera Control: Prompts can influence angle, zoom, and transitions for more directed storytelling.
Better Prompts = Better Results: Try describing mood, lighting, and movement for high-quality outputs.
Built for Creators: Ideal for product ads, mood boards, music videos, and explainer visuals.

OpenAI is gradually rolling it out to trusted partners.

In the meantime, follow the latest updates from OpenAI on X and start crafting prompts of your own.

Sora 2 isn't just a video tool, it's your creative assistant on turbo mode.

2. Qwen3-VL-30B-A3B Instruct & Thinking

Alibaba’s Qwen team launched Qwen3-VL-30B-A3B Instruct & Thinking, a new generation of large vision-language models.

With just 3B active parameters, it delivers powerhouse performance while staying lightweight and efficient.

This release brings all the capabilities of Qwen3-VL in a smaller but sharper package.

Here’s why Qwen3-VL-30B-A3B stands out:

Big performance, small footprint: 3B active parameters rival GPT‑5 Mini and Claude 4 Sonnet on many benchmarks.
Wide skill range: STEM problem solving, VQA, OCR, Video understanding, and Agent tasks.
Consistent wins: Outperforms bigger models across multiple evaluation benchmarks.
All-in-one model: Instruct and “Thinking” variants available for different reasoning tasks.
Optimized for developers: A more accessible, high-quality open model for experimentation and deployment.

Qwen3-VL-30B-A3B isn't just another model release, it's a serious competitor bringing high-end performance into a smaller footprint.

Follow their updates on Qwen's official X account to stay ahead.

3. AntLing Ring-1T: 1 Trillion Open-Source Thinking Model

AntLingAGI just launched Ring-1T, the first open-source model with 1 trillion parameters focused on deep thinking and reasoning tasks.

Early benchmarks show strong results across math, logic, and natural language problems, with impressive performance on high-difficulty benchmarks.

Here’s why Ring-1T is a big deal:

1 Trillion Parameters: One of the largest open-source models ever released.
Strong benchmarks: Scored 92.6 on AIME25, 84.5 on HMMT25, and 94.7 on CF (Competition Format) tasks.
IMO-level math: Solved IMO25 Q3 in one shot, and showed progress on Q1, Q2, Q4, and Q5.
Open and evolving: Still improving, and already competitive with top models like GPT-5 Thinking and DeepSeek Terminus.
Designed for thinkers: Built for reasoning-heavy tasks like coding competitions, math olympiads, and logic-based prompts.

If you're exploring reasoning agents or want to go beyond general chat models, Ring-1T is worth a test run.

4. Claude Sonnet 4.5: Best Model for Agentic Coding

Anthropic launched Claude Sonnet 4.5, their strongest model yet for building agentic systems and reasoning with computers.

It’s designed specifically to help developers create smarter agents, solve logic-heavy tasks, and write more reliable code.

Here’s what makes Claude Sonnet 4.5 stand out:

Top-tier agentic coding: Scored 77.2% on SWE-Bench (code reasoning) and 82.0% when allowed to use external tools.
Big jump in logic tasks: Terminal-Bench, high school math, and graduate-level reasoning scores have all improved.
Tool-use ready: Claude Sonnet 4.5 shows strong performance in l2-bench, which measures how well agents use tools.
Best at math and reasoning: Scored 100% on AIME 2025, 83.4% on GPAQ, and leads in multilingual Q&A tasks.
Built for real-world agents: Especially good at using computers, coding agents, and handling complex toolchains.

If you're building serious AI agents or systems that interact with files, tools, or structured data, Claude Sonnet 4.5 is worth a deep dive.

Check out Anthropic's official post for more benchmarks.

5. GLM-4.6: Agentic, Reasoning, and Coding Powerhouse

Zhipu AI introduced GLM‑4.6, a flagship model built for real-world coding, reasoning, and long-context understanding.

It’s designed for developers building advanced agentic systems, and performs especially well in tasks that require tool use, search, and logic.

Why GLM‑4.6 is worth a look:

Agent-ready: Optimized for agentic workflows, coding, and task planning.
Handles long context: Supports input lengths of up to 200,000 tokens.
Competitive benchmarks: Scores on AIME 25, GPQA, HLE, and Terminal-Bench match or exceed Claude Sonnet 4.5 and DeepSeek.
Strong in reasoning and tools: Excels in long reasoning chains, tool use tasks, and multistep problem solving.
Built for devs: Targeted toward real coding and automation use cases, not just chatbot-style interaction.

You can follow updates and access the API from Zhipu's official post.

GLM‑4.6 is shaping up to be one of the most versatile open models in the agent ecosystem.

6. Coral v1: Launch and Monetize AI Agents

// Detect dark theme var iframe = document.getElementById('tweet-1973071657821675594-847'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1973071657821675594&theme=dark" }

Coral Protocol launched v1, unlocking a full-stack system to orchestrate, monetize, and deploy AI agents at scale.

It aims to power the “agent economy” by letting anyone turn AI agents into reusable and monetizable software units.

Here’s how Coral v1 makes it happen:

No lock-in: Reuse AI agents built with any framework or language.
Launch-ready: Deploy agents without having to build infrastructure from scratch.
Open marketplace: Publish agents for others to use, collaborate, or buy.
Built-in monetization: Sell or license your agents and earn from contributions.
Built for scale: Perfect for solo devs and startups looking to ship quickly.

If you're working on AI products and want to plug into a growing ecosystem, Coral v1 might be the missing link.

7. NotebookLM: Customizable Chat Experience

NotebookLM introduced a new personalization layer for its chat interface, letting users tailor the conversation flow to better fit their learning and working styles.

Whether you're using it as a study guide or an assistant, it now adapts more deeply to your preferences.

Here’s what’s new in NotebookLM:

Custom response length: Choose how short or long the answers should be.
Adjustable conversation style: Toggle between concise, explanatory, or question-driven replies.
Learning Guide mode: A new style that tests and reinforces your understanding of the material.
Focus on depth: Designed to go beyond summarization and promote deeper reasoning.
Great for students and researchers: Ideal for learning workflows or guided comprehension.

You can explore the updates from NotebookLM's announcement here.

This update brings NotebookLM closer to becoming a personalized tutor in your pocket.

8. Comet by Perplexity: Now Available Globally

// Detect dark theme var iframe = document.getElementById('tweet-1973795224960032857-745'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1973795224960032857&theme=dark" }

Comet, Perplexity’s personal AI assistant, is now officially available to everyone worldwide.

After spending 84 days in waitlist mode, millions of users have joined to explore a new way of using the internet.

Here’s what Comet brings to the table:

Personal AI assistant: Designed to help you search, think, and learn faster online.
Conversational search: Combines real-time information retrieval with chat-style interaction.
Waitlist is over: No more invites needed. Anyone can try it out right now.
Focus on usability: Built for everyday internet users who want smarter answers, faster.
AI-enhanced browsing: Think of it as a more intelligent search engine that talks back.

You can watch the official announcement and explore Comet from Perplexity's post.

Whether you're researching or just curious, Comet gives you a fresh way to browse.

9. DeepSeek-V3.2-Exp: Faster, Smarter, Cheaper

// Detect dark theme var iframe = document.getElementById('tweet-1972604768309871061-543'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1972604768309871061&theme=dark" }

DeepSeek released V3.2-Exp, their latest experimental model built on the V3.1-Terminus architecture.

It introduces DeepSeek Sparse Attention (DSA), enabling faster inference and training, especially for long-context processing.

Here’s what’s new:

Sparse Attention mechanism: Improves speed and efficiency on large inputs.
Live across platforms: Available now via App, Web, and API.
50%+ cheaper: Major price cuts for API usage.
Focused on real-world use: Great for apps needing fast and scalable model calls.

If you're optimizing cost, speed, and context length in your LLM workflows, DeepSeek-V3.2-Exp is worth exploring.

10. Granite 4.0 by IBM: Lightweight Models for Local Use

// Detect dark theme var iframe = document.getElementById('tweet-1973784183492485277-155'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1973784183492485277&theme=dark" }

IBM just released Granite 4.0, a new series of small language models built for agentic tasks, RAG, and document analysis.

What’s most exciting is the Micro (3.4B) variant that runs entirely in your browser using WebGPU, with no server needed.

Why Granite 4.0 matters:

Runs locally: The Micro model can run 100% on-device in your browser with Transformers.js.
Privacy-friendly: No data is sent to a server, and it can work offline after loading.
Fast and efficient: Designed for quick in-browser inference and lightweight deployments.
Great for edge cases: Perfect for building low-latency, privacy-sensitive apps.

If you're looking for compact models that work offline, Granite 4.0 opens up exciting new possibilities.

11. Nano Banana by Google DeepMind: Production-Ready Image Model

// Detect dark theme var iframe = document.getElementById('tweet-1973781293977735435-473'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1973781293977735435&theme=dark" }

Google DeepMind released a full guide for using Nano Banana, a production-ready image generation model that’s part of the Gemini 2.5 Flash stack.

It’s designed for developers who need dynamic image outputs with more control and flexibility.

Why Nano Banana stands out:

Image-only mode: Generate just visuals without extra text or padding.
Creative freedom: Specify aspect ratios and fine-tune composition.
Built for scale: Suitable for production environments, not just demos.
Gemini-compatible: Works seamlessly within Google’s Gemini AI ecosystem.

If you're building visually dynamic apps or want to embed reliable image generation into your product flow, Nano Banana is worth checking out.

12. Unlock the Full Potential of Your Mac with Spec

// Detect dark theme var iframe = document.getElementById('tweet-1973753629367836979-824'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1973753629367836979&theme=dark" }

Spec brings AI memory and smart automation to your Mac.

It proactively drafts replies, summarizes documents, and helps you stay organized across all your tools — before you even ask.

Why Spec feels like a brain for your Mac:

Cross-app memory: Connects knowledge across iMessage, Slack, email, and calendar.
No more context switching: Everything feels unified and accessible.
Helpful before you ask: Anticipates needs and assists proactively.
Boosts productivity: Great for people juggling comms, docs, and deadlines.

If your daily workflow lives across multiple apps, Spec might be the assistant you didn't know you needed.

13. Composer: The First AI Agent for Document Processing

// Detect dark theme var iframe = document.getElementById('tweet-1973039396539465904-512'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1973039396539465904&theme=dark" }

Composer is built to tackle one of the most common but painful workflows — document processing.

It promises production-grade accuracy in under 10 minutes with no complex setup.

Here’s why Composer is exciting:

99% accuracy: Some early users reported near-perfect results on schema-heavy docs.
Minutes, not hours: Gets up and running in under 10 minutes.
Built for scale: Designed for teams that process lots of structured documents.
Agentic optimization: Learns and adapts to recurring document types and formats.

If you're in finance, HR, legal, or ops — Composer could be your new best friend.

14. C1 by Thesys: Generative UI for LLMs

// Detect dark theme var iframe = document.getElementById('tweet-1972924633499554246-577'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1972924633499554246&theme=dark" }

C1 is a powerful new API that lets LLMs respond with rich interactive UIs, not just plain text.

Created by Thesys, it aims to change how AI interfaces work — moving beyond chat into charts, forms, and cards.

What makes C1 different:

Interactive responses: LLMs can now return UI components like charts or inputs.
Great for apps: Useful in dashboards, admin panels, education tools, and assistants.
Plug-and-play: Easily integrates with existing LLM pipelines.
Launch perks: They offered up to 5M tokens free on Product Hunt.

If you've been dreaming of "ChatGPT meets Notion-style UI," this API brings it closer to reality.

15. CrewAI Launches AMP: The OS for AI Agents

// Detect dark theme var iframe = document.getElementById('tweet-1974174579863261540-584'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1974174579863261540&theme=dark" }

CrewAI just launched AMP, their Agent Management Platform, designed to be the operating system for AI agents in production.

It’s already seeing major adoption with Fortune 500 companies and public use cases.

What AMP brings to the table:

Agent OS: Centralized interface to deploy, monitor, and manage AI agents.
Massive scale: 100K+ executions in 15 days, 30+ live use cases.
Enterprise-ready: Used by large corporations and public companies.
Workflow visualizer: Clean UI to build complex agent pipelines visually.

If you’re scaling agentic systems and want production-level observability and orchestration, AMP is built for you.

15. CrewAI Launches AMP: The OS for AI Agents

// Detect dark theme var iframe = document.getElementById('tweet-1974174579863261540-440'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1974174579863261540&theme=dark" }

What AMP brings to the table:

Agent OS: Centralized interface to deploy, monitor, and manage AI agents.
Massive scale: 100K+ executions in 15 days, 30+ live use cases.
Enterprise-ready: Used by large corporations and public companies.
Workflow visualizer: Clean UI to build complex agent pipelines visually.

If you’re scaling agentic systems and want production-level observability and orchestration, AMP is built for you.

That’s it for this week’s AI updates.

If you found something useful in here, feel free to drop a message or tag me on X. I’ll be doing this every week, so you can always come back for a quick catch-up.

Follow @thenomadevel on X for the full thread and more updates like this.

See you next week.

What Building a Hybrid Browser Toolkit Taught Us About the Web

Nomadev — Wed, 01 Oct 2025 19:10:57 +0000

If you’ve ever tried browser automation, you know the drill:
You spin up Selenium, Playwright, or Puppeteer, point it at a page, and suddenly you’re wrestling with flaky selectors, weird screenshots, or the dreaded “element not found” even though it’s right there.

I’ve been there. It feels like teaching a robot to surf the web by giving it a pair of oven mitts. Sure, it clicks and scrolls, but half the time it’s guessing.

At CAMEL-AI, we ran into this wall too. Our original Camel BrowserToolkit was a first attempt at solving it. It did the basics — take screenshots, inject custom IDs, and click things. But it was… let’s say, not elegant. It worked more like asking an AI to click on pictures instead of actually understanding the page.

That got us thinking:
What if the toolkit could “see” the page like a human and understand the structure like a dev?

From Monolith to Hybrid

The big shift came when we re-architected things. Instead of one heavy Python process, we now have a Hybrid setup using Python and TypeScript.

Python is still your scripting layer. That means you can write automation in a language most of us are comfortable with.
TypeScript is the engine under the hood. It runs Playwright natively, handles async operations, and talks directly to the browser.

The two communicate over WebSockets. So Python gives high-level commands, while TypeScript executes them efficiently.

Introducing the CAMEL Hybrid Browser Toolkit

Enter the Hybrid Browser Toolkit. We've rebuilt the toolkit from the ground up as a TypeScript–Python hybrid. In this new design, TypeScript (running on Node.js) handles the browser directly via Playwright's fast native APIs, and Python remains your friendly front-end interface.

What does that buy you? Faster performance, access to all the latest Playwright features (like the new _snapshotForAI), and true async event-driven power – without sacrificing the ease of Python scripting.

The result is a layered architecture: your Python code talks to a TypeScript server over WebSockets. The TypeScript layer manages browser instances, DOM queries, screenshots, etc., all in the same high-performance JavaScript environment. Python just sends commands and gets structured results.

This split means lower latency and better concurrency. As one example, Node's Playwright doesn't spawn a fresh process for every browser window like the Python version did, so it can manage many tabs with far less CPU and memory overhead.

In short, Python becomes the brain giving high-level instructions, and TypeScript is the muscle doing the work efficiently.

What's Different Under the Hood

In the legacy toolkit, every action that needed to find or click an element typically involved injecting a random ID into the page via a script, then querying it. That worked, but it felt hacky.

In the hybrid toolkit, we leverage standard accessibility (ARIA) selectors and Playwright's new tools. Now you can do things like:

await page.locator('[aria-label="Submit"]').click();
await page.getByRole('button', { name: 'Submit' }).click();
const snapshot = await page._snapshotForAI();
// snapshot now has structured data on all elements and their ARIA roles

Playwright's _snapshotForAI() (an internal API) lets us get a rich DOM snapshot: every interactive element, its role (like button, link, textbox), labels, etc. We assign each element a ref ID and use those for all interactions. This replaces the old random-ID trick with a semantic mapping.

It also means the same snapshot data fuels both text mode and the visual "set-of-marks" screenshots.

Set-of-Marks Screenshots

Speaking of screenshots, the new toolkit's SoM (Set-of-Marks) screenshots are crisp and clever. We inject a small script into the page that outlines every clickable element with a little numbered marker (their ref ID).

This isn't just a dumb screenshot – it knows about element overlap and tries not to mark hidden elements. If a button has an icon and text, it merges them into one mark. It even picks good positions for labels so they don't scribble over each other. (This injection-based approach in the browser is more reliable than our old memory-only screenshots.)

Enhanced Stealth Mode

We've also beefed up stealth mode. By default, Playwright can be detected by many sites (indeed, "stock" Playwright is often blocked by modern anti-bot measures.

The new toolkit launches browsers with a full suite of anti-detection flags, customizable user agents, headers, etc. You can tweak a StealthConfig object to set exactly which flags or headers to use. And we maintain this even across persistent contexts or CDP connections.

The bottom line: you get a much more human-like browser fingerprint without extra work.

Memory-Efficient Screenshots

Other small but nice improvements include how we handle screenshots and images. In the old toolkit, screenshots were held entirely in memory and passed around as objects. Now we save screenshots to disk and only pass around file paths.

This keeps memory usage low, especially when you take many screenshots in a run. The agent can still request the image (and even run vision-based analysis on it), but the heavy data lives on disk.

Smarter Form Filling

We also made form-filling smarter. You can now send multiple inputs in one command, and the toolkit will try to find the right input fields (even if you accidentally point at a container).

It watches for dropdowns appearing after you type and will return just the new options (a "diff" snapshot), so you don't get overwhelmed by the whole page again. If something goes wrong, the tool tries simple recovery steps too.

Key Features at a Glance

Multi-Mode Operation: The toolkit has three modes:

Text Mode: DOM-based automation, returning textual snapshots of element lists.
Visual Mode: Screenshot-based, with interactive elements highlighted.
Hybrid Mode: Smart switching between text and visual as needed.

TypeScript Core: All browser work is done in a Node.js/TypeScript server. That means native Playwright calls (no bridging) and full async/await support. We get TypeScript's compile-time checks and the latest APIs instantly.

Better Element Handling: Use real ARIA selectors and Playwright locators instead of injected IDs. E.g. click by aria-label or role. Plus, _snapshotForAI returns structured data with semantic roles.

Instant Snapshots: Every action (click/type/etc.) that changes the page returns an updated snapshot by default, so you see the new state immediately in text mode.

Advanced Screenshot (SoM): Annotated screenshots with numbered marks for each element. Optionally, an AI can analyze the image (like "find all sign-up buttons").

Intelligent Typing: Typing into fields automatically detects dropdowns (autocomplete) and only returns the new suggestions (diff snapshot). If you point to a container, it will find the actual input inside and type there.

Powerful Stealth: Multiple Chrome flags, custom user agent/headers, persistent context, etc., to reduce bot detection. (After all, many sites try to fingerprint automation.
Flexible Connections: You can launch a fresh browser via Playwright, attach to an existing Chrome/Edge via CDP (Chrome DevTools Protocol), or even hook into an AI agent via the Model Context Protocol (MCP).

Tool Registry: The toolkit neatly separates "tools" (actions) from the core. Screenshots go to files, not memory, so you can handle them in custom agents or pipelines without huge overhead.

Try It: Session & Navigation Tools

Let's see some examples. First, create a toolkit instance and open the browser:

from camel.toolkits import HybridBrowserToolkit

# Launch a real browser (non-headless for debugging)
toolkit = HybridBrowserToolkit(headless=False)
result = await toolkit.browser_open()

print(result['result'])    # "Browser opened."
print(f"Tabs: {result['total_tabs']}, Active: {result['current_tab']}")
print("Initial Snapshot:", result['snapshot'])

Your first call must be browser_open(). That spins up Chromium/Chrome/Edge and returns a snapshot of whatever the default page is (typically about:blank or your start URL). You'll get something like:

Result: Browser opened.
Tabs: 1, Active tab index: 0
Initial Snapshot:
- link "Get Started" [ref=1]
- link "Documentation" [ref=2]
- link "GitHub" [ref=3]
- ...

Now navigation:

# Open a new tab and navigate to example.com
result = await toolkit.browser_visit_page("https://example.com")
print(f"Visiting example.com: {result['result']}")
print("Snapshot:", result['snapshot'])
print(f"Tabs now: {result['total_tabs']}, Active: {result['current_tab']}")

# Go back and forward
await toolkit.browser_back()      # go back in history
await toolkit.browser_forward()   # then forward again

browser_visit_page(url) opens the URL in a new tab and switches to it. Each call makes a new tab.

browser_back() and browser_forward() move in the history of the current tab. They both return the updated page snapshot and tab info.

For example, after visiting a couple of pages:

await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_visit_page("https://example.com/about")
result = await toolkit.browser_back()
print(f"Back: {result['result']}, now at {result['snapshot']}")

Page Inspection Tools

To see what's on the page without doing anything, use:

snapshot = await toolkit.browser_get_page_snapshot()
print(snapshot)

This returns a textual list of all interactive elements in the current tab (links, buttons, inputs, etc.), each with a [ref=id]. By default it lists the full page, but you can initialize with viewport_limit=True to only see elements visible on screen. E.g.:

- link "Home" [ref=1]
- button "Sign In" [ref=2]
- textbox "Search..." [ref=3]
- link "Products" [ref=4]
- ...

For a visual view, try:

result = await toolkit.browser_get_som_screenshot()
print(result['result'])
# e.g. "Screenshot captured with 12 interactive elements (saved to: ./screenshots/page123_som.png)"

This takes a screenshot of the page and marks every element. You can also ask the toolkit to analyze it with an AI, e.g.:

result = await toolkit.browser_get_som_screenshot(
    read_image=True, 
    instruction="Find all buttons for submitting forms"
)
print(result['result'])
# e.g. "Screenshot captured... Agent analysis: Found 3 form buttons: [ref=5], [ref=9], [ref=12]"

Behind the scenes, it saved an image file and ran an agent (if requested) to look at it. The raw image path is in result['screenshotPath'] if you need it.

To inspect tabs, use:

tab_info = await toolkit.browser_get_tab_info()
print(f"Total tabs: {tab_info['total_tabs']}")
for tab in tab_info['tabs']:
    status = " (current)" if tab['is_current'] else ""
    print(f"- {tab['title']} @ {tab['url']}{status}")

You'll see each tab's ID, title, and URL. This is handy to pick a tab to switch to:

# Switch to tab by ID (the 'id' field from tab_info)
await toolkit.browser_switch_tab(tab_id=some_tab_id)

Interaction Tools

Now for real interactions:

Click an Element

Click an element by its ref:

result = await toolkit.browser_click(ref="5")
print(result['result'])   # e.g. "Clicked on button 'Submit'"

If the click opened a new tab, result will include newTabId, and current_tab/total_tabs will update accordingly. You can then browser_switch_tab to it.

Type into Input Fields

Type into an input:

# Single input
await toolkit.browser_type(ref="3", text="hello world")

If the element with ref=3 triggers an autocomplete dropdown, the toolkit will detect it. Instead of returning the full page again, it gives you result['diffSnapshot'] containing just the new options (this is the "intelligent dropdown detection"). For example, typing "San" might return:

- option "San Francisco" [ref=23]
- option "San Diego" [ref=24]
- option "San Antonio" [ref=25]

Then you can click one of those by ref. If you have multiple fields to fill, just pass a list:

inputs = [
    {'ref': '3', 'text': 'John'},
    {'ref': '4', 'text': 'Doe'},
    {'ref': '5', 'text': 'john.doe@example.com'}
]
result = await toolkit.browser_type(inputs=inputs)
print(result['details'])  # shows success/failure per field

Select Dropdowns

Select (for <select> dropdowns):

await toolkit.browser_select(ref="country-select", value="US")

You must provide the option's value attribute, not visible text. (If needed, you can browser_get_page_snapshot() first to see element refs.)

Enter Key

Enter key (submit form etc.):

await toolkit.browser_enter()

This simulates pressing Enter in the currently focused field. It's handy after typing search terms.

Scroll

Scroll the page:

await toolkit.browser_scroll(direction="down", amount=600)

Use "up" or "down", with optional pixel amount. It returns the new snapshot. You can loop scrolls to load more content:

prev = ""
while True:
    res = await toolkit.browser_scroll("down", 800)
    if res['snapshot'] == prev:
        break  # no new content
    prev = res['snapshot']
    await asyncio.sleep(1)

Mouse Control

Mouse control by coordinates:

await toolkit.browser_mouse_control(control="click", x=350.5, y=200)
await toolkit.browser_mouse_control(control="dblclick", x=123.4, y=456.7)
await toolkit.browser_mouse_control(control="right_click", x=400, y=300)

Useful for canvas or image-map interactions.

Drag and Drop

Mouse drag-and-drop:

await toolkit.browser_mouse_drag(from_ref="item-5", to_ref="trash-bin")

Drag the element with ref="item-5" onto ref="trash-bin". Handy for reordering or file moves in web UIs.

Press Keys

Press keys/combinations:

await toolkit.browser_press_key(keys=["Tab"])
await toolkit.browser_press_key(keys=["Control+a"])  # select all
await toolkit.browser_press_key(keys=["Alt+Left"])   # back in history
await toolkit.browser_press_key(keys=["F5"])         # refresh

Send any key or combo. The toolkit uses Playwright's key syntax.

Tab Management

Working with multiple tabs is easy:

Switch Tab

Switch tab by ID (from browser_get_tab_info):

await toolkit.browser_switch_tab(tab_id=some_tab_id)

This activates that tab and returns its snapshot.

Close Tab

Close a tab:

await toolkit.browser_close_tab(tab_id=some_tab_id)

After closing, it returns info on the remaining tabs.

You can, for instance, close all but the first tab by iterating through them:

tab_info = await toolkit.browser_get_tab_info()
for tab in tab_info['tabs']:
    if not tab['is_current']:
        await toolkit.browser_close_tab(tab_id=tab['id'])

Console Commands

Console commands: You can execute arbitrary JS on the page:

result = await toolkit.browser_console_exec("return window.location.href")
print("Current URL:", result['result'])

And view console logs:

logs = await toolkit.browser_console_view()
for msg in logs['console_messages']:
    print(f"[{msg['type']}] {msg['text']}")

Advanced & Utility

Wait for Manual Step

Wait for manual step: Sometimes you need a human (e.g. to solve a CAPTCHA). Use:

res = await toolkit.browser_wait_user(timeout_sec=60)
if "completed" in res['result']:
    print("User resumed, snapshot after:")
    print(res['snapshot'])
else:
    print("Wait timed out.")

This pauses execution and shows the last snapshot. When the user presses Enter (or timeout), it returns control.

Combine It All

Combine it all: Here's a mini example putting a few tools together:

toolkit = HybridBrowserToolkit(headless=False)
try:
    await toolkit.browser_open()
    await toolkit.browser_visit_page("https://example.com")
    # Look for a product link and click it
    snap = await toolkit.browser_get_page_snapshot()
    # Suppose ref=7 is "Products"
    await toolkit.browser_click(ref="7")
    # Now add to cart and checkout
    await toolkit.browser_click(ref="add-to-cart")
    await toolkit.browser_click(ref="checkout")
    # Fill checkout form
    inputs = [
        {'ref': 'name', 'text': 'Alice'},
        {'ref': 'email', 'text': 'alice@example.com'},
        {'ref': 'address', 'text': '1 Developer Way'}
    ]
    await toolkit.browser_type(inputs=inputs)
    await toolkit.browser_select(ref="shipping", value="standard")
    await toolkit.browser_console_exec("return document.querySelector('form').checkValidity()")
    await toolkit.browser_click(ref="place-order")
finally:
    await toolkit.browser_close()

This was just a taste. The Hybrid Browser Toolkit provides all the basic navigation and interaction tools you'd expect, plus some powerful extras (like smart screenshots and AI-assisted analysis) to help you automate complex tasks smoothly.

Operating Modes: Text vs. Visual vs. Hybrid

Text Mode is the default: every action returns a text snapshot. It's lightweight and great for pure data tasks (like scraping or filling forms). Each element is listed with a [ref=ID] and a label. If you initialize with full_visual_mode=True, then actions don't auto-return snapshots (fast mode); you can still call browser_get_page_snapshot() manually when you need it.

Visual Mode uses screenshots. The browser_get_som_screenshot() tool we saw is the core of this mode. It's ideal for verifying layouts, catching visual glitches, or when a human needs to see something. You'll often toggle visual mode on when you need to confirm that a button is visible, or to show the agent exactly what's on screen.

Hybrid Mode is smart: it uses text mode by default, but seamlessly takes and interprets screenshots when needed (or as requested). For example, you might click through forms in text mode, then do one final screenshot with AI analysis to "spot check" the result.

A good rule of thumb:

Use Text Mode for most automation (fast, headless, easy parsing).
Switch to Visual Mode when you need the UI context (e.g. for CAPTCHAs, complex UIs, or human verification).
Combine Both as needed. E.g., click by refs in text mode, then verify with a screenshot.

Connection Modes: Playwright vs CDP vs MCP

Finally, how do we connect to the browser?

Standard Playwright (default)

The toolkit launches and manages its own browser instance. Just HybridBrowserToolkit() and call browser_open(). You can set headless=True/False, user_data_dir for persistence, timeouts, etc. Use this when you just want an isolated browser.

Chrome DevTools Protocol (CDP)

This lets you attach to an already running browser (Chrome/Edge/Chromium) that was started with --remote-debugging-port. For example, start Chrome manually:

google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-profile

Then in Python:

import requests
resp = requests.get('http://localhost:9222/json/version')
ws = resp.json()['webSocketDebuggerUrl']

toolkit_cdp = HybridBrowserToolkit(cdp_url=ws)
# No need to call browser_open(); it's already running
tab_info = await toolkit_cdp.browser_get_tab_info()
print(f"Connected to {tab_info['total_tabs']} tabs")

CDP is the same protocol Chrome DevTools uses to talk to the browser chromedevtools.github.io, so any browser with debugging enabled can be controlled. You can even set cdp_keep_current_page=True to make the toolkit use the current page instead of opening a new one.

MCP (Model Context Protocol)

This is for connecting the toolkit to an AI assistant (like Claude via LLMs) so the AI can call these browser tools as if they were native functions. Here's how to set it up:

1. Install the MCP Server

git clone https://github.com/camel-ai/browser_agent.git
cd browser_agent
pip install -e .

2. Configure Claude Desktop

Add to your Claude configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "hybrid-browser": {
      "command": "python",
      "args": ["-m", "hybrid_browser_mcp.server"]
    }
  }
}

3. Restart Claude Desktop

After adding the configuration, completely restart Claude Desktop. The browser tools will appear when you click the 🔌 icon in the chat interface.

Available Browser Tools

Once connected, you'll have access to:

Navigation: browser_open, browser_visit_page, browser_back, browser_forward
Interaction: browser_click, browser_type, browser_select, browser_scroll
Screenshots: browser_get_som_screenshot (captures page with clickable elements marked)
Tab Management: browser_switch_tab, browser_close_tab
Advanced: browser_console_exec, browser_mouse_control

Basic Usage Example

# Claude can now control browsers with simple commands:
await browser_open()
await browser_visit_page("https://example.com")
await browser_type(ref="search", text="AI automation")
await browser_click(ref="submit-button")
await browser_get_som_screenshot()
await browser_close()

Customization

Modify browser behavior in browser_agent/config.py:

BROWSER_CONFIG = {
    "headless": False,    # Show browser window
    "stealth": True,      # Avoid bot detection
    "enabled_tools": [...] # Specify which tools to enable
}

Closing Thoughts

In summary, the Hybrid Browser Toolkit is a major upgrade over the old screenshot-only BrowserToolkit. We still give you a friendly Python API to work with, but under the hood we're speaking the browser's native language via TypeScript.

That means faster, more reliable interactions and access to shiny new features like Playwright's accessibility snapshots. Whether you need lightning-fast DOM scraping or human-like visual checks (or both!), this toolkit handles it.

It also plays well with modern workflows. Want to connect to an existing Chrome? No problem (thanks to CDP). Want your AI agent to browse the web? Check out MCP integration.

From practical navigation (click, type, scroll) to advanced tricks (Set-of-Marks screenshots, smart autocomplete typing, multi-tab management), everything's here.

Give it a spin, and let us know what you build with it. Welcome to the new era of browser automation with CAMEL's Hybrid Browser Toolkit – it's like taking off those gloves and driving with all the precision you wanted, at full speed.

Happy automating!

We hired AI to do Growth Engineering and here’s what happened

Nomadev — Wed, 17 Sep 2025 10:16:00 +0000

In open source projects, time is precious. Maintainers juggle bug fixes, feature requests, community support, and documentation, all while trying to keep code secure and releases organized. One repetitive but crucial task is reviewing pull requests and preparing release updates. It's necessary, but it eats up hours that could be spent innovating.

In our work at CAMEL-AI, open-source contributions move fast. Every week, our team spends time reviewing pull requests, highlighting key changes, and preparing release notes. It’s important work, but also repetitive — hours get lost in scanning PRs, checking impact, and formatting updates.

This time, instead of doing it manually, we asked ourselves: what if a multi-agent system could take over this process?

That’s when we decided to try it with Eigent and a custom MCP server for GitHub. The idea was simple: let AI agents handle the weekly workflow, from fetching PRs to summarizing them and even drafting release-ready notes and short posts.

What if automation could handle the grunt work for you? That's where Eigent step in.

Eigent is the world's first Multi-agent Workforce desktop application, empowering you to build, manage, and deploy a custom AI workforce that can turn your most complex workflows into automated tasks. It's a modular, multi-agent system that can break down complex tasks and handle them through specialized agents working in coordination.

Eigent's multi-agent coordination platform boosts productivity by turning your workflows into automated tasks. Built on the open-source CAMEL framework, it brings parallel execution, customization, and privacy to your AI automation.

What can Eigent do for you? For first-time readers, consider Eigent as a flexible agentic assistant. You can create different "workers" (AI agents) with domain-specific skills (e.g. coding, documentation, DevOps) and have them collaborate on tasks. Some examples of technical workflows Eigent can simplify include:

GitHub automation with AI agents: Reviewing code changes, summarizing pull requests, triaging issues.
Release note generation: Automatically compiling highlights of what's new in each release.
Documentation and code analysis: Extracting key points from docs or codebases, suggesting improvements.
Open-source workflows: Keeping track of project activity, generating reports for contributors, etc.

In this guide, we'll show you how to configure a custom GitHub MCP server inside Eigent and set up an agent workflow that:

Fetches new pull requests from a repo
Extracts and analyzes PR data
Formats the highlights into release-ready notes
Generates a short social post (e.g. for Twitter/X)

Let's dive into the step-by-step guide!

Step 1: Open Eigent and Navigate to MCP & Tools Settings

Once you have Eigent running, begin by opening the Settings panel. In the Settings, find and click on the "MCP & Tools" section. This is where you can configure external tools and servers for your AI agents. We'll use this area to add a new custom MCP server for GitHub tasks.

Eigent's Settings interface. Navigate to the **MCP & Tools* tab to configure external AI tools and servers.*

In the MCP & Tools tab, you'll see a list of available tools and any configured MCP servers. By default, Eigent might include some basic tools (e.g. web search, code execution). To add our own, look for an "Add MCP Server" button (usually a + or a labeled button) and click it. This will open a dialog where you can input a JSON configuration for the new server.

Step 2: Add a Custom MCP Server via JSON Configuration

Eigent allows advanced users to add custom agent servers by providing a JSON config. In the Add MCP Server dialog that opened, you'll see a text area to paste JSON. We're going to add a sequential-thinking MCP server - this is a general-purpose AI reasoning engine that can coordinate tasks (perfect for breaking down complex prompts). We will also tie it into GitHub by providing the GitHub integration toolset and our credentials.

Adding a new MCP server via JSON configuration. Paste in the JSON definition for the **sequential-thinking* server.*

The JSON defines how Eigent should launch the external agent server. For our use case, we'll use Node's npx to run the Sequential Thinking server package, and include the official GitHub MCP tool. Below is the JSON structure to use (as provided by Eigent's docs and examples):

"mcpServers": {
  "sequential-thinking": {
    "command": "npx",
    "args": ["-y", "@modelcontextprotocol/server-sequential-thinking"]
  }
}

Step 3: Configure the GitHub MCP Server Settings (Include Your PAT)

Before finalizing the MCP server setup, include your GitHub Personal Access Token (PAT) in the configuration. This token will allow the agent to authenticate with the GitHub API and fetch repository data. You should generate a PAT from your GitHub account (with at least read access to repos; for public repos a classic token with default public scopes is sufficient). In the JSON, we'll add an environment variable for the token and specify the GitHub toolset.

Configuring the GitHub MCP server by adding environment variables. Provide your **GitHub PAT* in the JSON config so the agent can access the GitHub API.*

To integrate the GitHub tools, modify the JSON as follows:

Add the GitHub MCP server container to the arguments.
Set the environment variable for your token.

For example, you can extend the "args" array to include the GitHub server image and use the "env" field for the token:

"mcpServers": {
  "sequential-thinking": {
    "command": "npx",
    "args": [
      "-y", "@modelcontextprotocol/server-sequential-thinking",
      "ghcr.io/github/github-mcp-server"
    ],
    "env": {
      "GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_yourGitHubTokenHere"
    }
  }
}

In this configuration, we pass the official GitHub MCP server (hosted at ghcr.io/github/github-mcp-server as an argument to the sequential thinking agent. The sequential agent will spin up the GitHub toolset internally. We also set GITHUB_PERSONAL_ACCESS_TOKEN in the environment so the agent can authenticate to GitHub. (Make sure to replace *"ghp_yourGitHubTokenHere"** with your actual PAT.)*

Once the JSON is ready, click Install or Add to save the MCP server. Eigent will download and initialize the server in the background. After a moment, you should see the new server listed in your MCP tools, indicating a successful installation.

Step 4: Add a GitHub-Focused Worker (Agent) Using the New MCP Server

Now that the MCP server is configured, we need to create a Worker that uses this server. In Eigent, a "Worker" is essentially an AI agent persona that can carry out tasks using a specified toolset or MCP server. Navigate back to the main Workforce or Agents screen (often the home screen showing your AI workers). Look for an "Add Worker" or "+" button to create a new agent.

When the Add Worker dialog appears, enter a name and description for your new agent. For example, name it "GitHub MCP" and describe it as "Helps around GitHub Tasks". Most importantly, assign the Agent Tool to the MCP server we just added (it might appear in a dropdown as "sequential-thinking" or whatever name you gave it). This ensures your new worker will utilize the GitHub-enabled sequential thinking agent.

Creating a new Worker agent for GitHub tasks. Give it a name (e.g. "GitHub PR Reviewer") and select the **GitHub MCP* server as the agent's tool.*

After filling in the details and selecting the correct MCP server, save the worker. You should now see a new agent in your AI workforce list. This agent is essentially your GitHub automation assistant, equipped with the ability to reason through tasks and interact with GitHub data.

Step 5: Prompt the Agent to Summarize Pull Requests

With the GitHub-enabled agent up and running, it's time to put it to work. Open a chat or command interface with your new worker (in Eigent, clicking the worker might open a chat panel where you can give it instructions). We'll provide a task prompt asking the agent to review pull requests from a repository and summarize them.

As an example, try a detailed prompt like this one:

Review all the 30 latest pull requests from the repo https://github.com/camel-ai/camel. Select the top 5 by impact (lines changed, files touched, or discussion depth). For each selected PR, generate a release-ready update in this format: ✨ Feature: <catchy one-liner summary> 💡 Why it matters: <short bullet-point explanation> 🙏 Thanks @<GitHubAuthor>...

Entering a prompt for the GitHub agent to review recent PRs and produce summaries. This complex instruction asks the AI to fetch the latest 30 PRs, pick the most impactful ones, and format a brief release note for each.

In the chat, paste or type in the prompt (as shown above) and hit send. This instructs the agent to automate a common open-source workflow: analyzing recent pull requests in the camel-ai/camel repo and preparing a synopsis of important changes. You can customize the repository URL or criteria as needed - for instance, use your own project's repo link. The key is that our agent now has the tools (via MCP) to fetch GitHub data and the reasoning ability to summarize it.

Step 6: Watch Eigent Automatically Break Down the Task and Fetch Data

Once you send the prompt, Eigent's multi-agent engine kicks in. The request is fairly complex, but Eigent will handle it by dividing the work into manageable subtasks. Behind the scenes, the Sequential Thinking MCP server interprets the instruction and decides on a plan. It may do something like:

Fetch the list of the latest 30 PRs from the specified repository (using the GitHub MCP tool).
Analyze each PR's metadata (lines changed, files, comments) to determine "impact".
Pick the top 5 PRs based on the criteria.
For each of those PRs, compose a summary in the requested format (✨ Feature, 💡 Why it matters, 🙏 Thanks...).
Possibly also prepare a condensed version for X (Twitter) if requested, or any additional subtasks inferred.

Eigent actually displays the subtask breakdown in the interface, so you can see the agent's thought process. It might list steps it's taking, which makes it transparent and debug-friendly. For example, the agent may explicitly show a step to retrieve PR data and then a step to filter them by impact. This showcases Eigent's dynamic task planning: "Eigent dynamically breaks down tasks and activates multiple agents to work in parallel, automating complex tasks much faster than traditional single-agent workflows" Eigent Docs.

The GitHub agent (powered by the MCP server) fetching repository data. Here the agent executed a subtask to retrieve PR details via the GitHub API, returning JSON data almost instantly.

In our case, the first subtask is to call GitHub and get details of the latest 30 PRs. The agent, using the GitHub MCP, does this in seconds and obtains a JSON array of PR info (IDs, titles, authors, lines changed, etc.). Next, the agent evaluates which PRs have the largest impact. Another subtask might involve sorting or filtering the list by those metrics. Once the top 5 PRs are identified, the agent generates the summary for each.

Finally, the agent produces the output: a neatly formatted set of release-ready updates for the top 5 PRs. The result is typically presented in the chat as Markdown text (since we asked for a release update format). Each update might look like:

✨ Feature: Added comprehensive model table and requirements badges to docs

💡 Why it matters:

Provides quick, up-to-date model info right in the documentation
Helps users assess at a glance what's available and what's required
Elevates project transparency and onboarding experience

🙏 Thanks @wendongfan for this integration

PR link: https://github.com/camel-ai/camel/pull/1343

(The above are illustrative examples.)

You would see five such entries corresponding to the top PRs. The agent might also provide a shorter "X-posting" version (e.g. a tweet-worthy one-liner) if that was part of the prompt. The outcome is that you have, in a few moments, a draft of changelog/release notes highlights, complete with acknowledgments to contributors.

Empowering OSS Workflows with Agentic Automation

In this tutorial, we configured Eigent to automate an open-source maintenance task—summarizing GitHub pull requests—using an AI agent. We introduced a custom GitHub MCP server into Eigent, created a dedicated worker, and successfully generated release note snippets from live repository data. The process demonstrates the power of agentic automation for OSS contributors: instead of manually combing through PRs, maintainers can rely on AI agents to do the heavy lifting. By leveraging Eigent's MCP integration and multi-agent coordination, even complex workflows (like triaging dozens of PRs) can be handled efficiently by AI, freeing you to focus on higher-level decisions.

Eigent makes it approachable for both developers and non-developers to harness multi-agent AI. With a few simple steps, you can configure MCP for open-source workflows and let your personalized AI workforce assist you. This was just one example—Eigent can be tailored to many scenarios, from writing summaries and managing issues to testing code or updating documentation. As the platform evolves, the possibilities for GitHub automation with AI agents will only grow.

Give Eigent a try in your own projects, and enjoy the productivity boost of having an AI-powered team on your side! The future of open-source collaboration might just be a mix of human passion and tireless AI assistants working together. 🚀

Happy automating!

A Guide to Building a Fully Local AI Workforce for FREE

Nomadev — Wed, 03 Sep 2025 09:19:03 +0000

If you're looking to deploy powerful AI tools, you've likely faced a key challenge: how to unlock their full potential without compromising the security of your sensitive data.

Most platforms push everything to the cloud, which feels convenient at first but quickly raises red flags when you're dealing with customer records, financial data, or internal IP. You want the power of multi-agent systems, but you also want privacy, control, and the ability to run everything on your own machine.

That's where Eigent comes in.

What is Eigent?

Eigent is a local-first multi-agent desktop application. Instead of sending your data to external servers, it runs everything on your computer. You get full visibility into what's happening and the confidence that your files, credentials, and logs stay with you.

Think of it as building your own AI workforce. You can spin up different agents, each with their own skills: a search agent that combs the web, a developer agent that runs code, a document agent that writes and edits files, and even multimodal agents that handle images and audio. Eigent coordinates them for you so they can tackle tasks in parallel, hand things off when needed, and deliver polished results.

In this guide we're going to show you exactly how to set it up locally. By the end you'll have Eigent running on your desktop with agents ready to work together on your terms.

Prerequisites

Before getting started, make sure you have the following in place:

Node.js (v18 or newer) and npm: Eigent is a Node/Electron application. Install Node.js (18–22 is recommended) if you haven't already. Tip: You can download Node from the official site or use a version manager.
Memory and Hardware: At least 8 GB of RAM is recommended for smooth performance. Eigent can run entirely on CPU if you're connecting to external APIs (like OpenAI or Anthropic). If you want to run large models locally on your machine, having a capable GPU (e.g. an NVIDIA RTX card) will make a big difference in speed.
Operating System: Eigent supports local deployment on major OSes (Windows, macOS). The steps below are OS-agnostic.
Docker: Install Docker if you haven't already, and make sure docker is running.(Docker Installation Guide)

1. Clone the Repo & Start the PostgreSQL Backend

First, clone the Eigent repo and install its dependencies:

git clone https://github.com/eigent-ai/eigent.git
cd eigent

This will give you the full source code on your machine. Next, switch into the server directory and launch Docker:

cd server
# Copy .env.example to .env (or create .env according to .env.example)
cp .env.example .env
docker compose up -d

This command uses the provided docker-compose.yml to spin up two containers:

A PostgreSQL database (the Eigent data store)
The Eigent API server

Both run locally on your machine (e.g. localhost:3001 for the API). The screenshot above shows Docker pulling the images and starting the eigent_postgres and eigent_api containers.

Docker Compose brings up the Postgres database and API server locally.

By default, Docker will create a volume for PostgreSQL, so all database files are stored on your disk (not in memory).

2. Verify Local Data Storage

At this point, everything is running locally. The PostgreSQL container (eigent_postgres) holds the database. You can double-check by listing your Docker containers or using a tool like psql inside the container. Everything Eigent does (agent messages, user data, task logs, etc.) will be written to that local Postgres instance. No data is sent anywhere outside your machine.

All Eigent data is stored in the local Dockerized PostgreSQL database.

This ensures privacy by design. As stated in the docs, a key advantage of self-hosting is data privacy – you keep sensitive data within your own infrastructure. In fact, when you use this setup, no workspace or login information ever leaves your local network. Eigent is fully local by default, so you can audit and trust that your data stays put.

3. Modify `.env.development` for Local Proxy

Next, we need to tell the front-end to use the local back-end instead of any cloud service. In the project root (eigent/.env.development), enable the local proxy settings. Open .env.development in a text editor and make sure it contains:

VITE_BASE_URL=/api
VITE_PROXY_URL=http://localhost:3001
VITE_USE_LOCAL_PROXY=true

By setting VITE_USE_LOCAL_PROXY=true and pointing VITE_PROXY_URL to http://localhost:3001, you configure the front-end to send all API calls to your local Docker backend. The screenshot below shows the relevant lines in the .env.development file:

Edit *.env.development: set **VITE_PROXY_URL to http://localhost:3001 and VITE_USE_LOCAL_PROXY=true to enable local mode.*

Make sure to remove any leading # or comment markers on those lines so they take effect. With this configuration, the front-end app will proxy requests to your local server rather than the external demo API.

4. Run the Frontend App

Now go back to the repo root and install the JavaScript dependencies, then start the development server:

cd ..
npm install
npm run dev

This will launch the Eigent front-end locally. By default it runs on http://localhost:3000. With the .env changes, the front-end will contact the API at http://localhost:3001 – all within your machine.

> eigent@* dev
> vite

  VITE vX.X.X  ready in Y ms

  ➜  Local:   http://localhost:3000/

No special cloud credentials are needed here – it's just a normal Node development build.

5. Access the Eigent UI Locally

Eigent's login screen, served locally. Although sign-in is required, this instance is self-hosted and no external service is involved.

Rest assured, this login is purely for the local app – your credentials and data are saved in the local Postgres database you started, not some cloud server. In other words, even though the UI presents an OAuth-style login, all authentication and user data lives on your machine. The documentation emphasizes this local-first setup: "Your data stays on your own device, addressing privacy and security concerns". Once logged in, you'll reach the main dashboard where you can create custom agents, define workflows, and configure tools.

For example, the tools/settings page lets you enable or disable built-in integrations (web search, Google docs, Slack, etc.), and the model selection screen (shown below) lets you pick or configure your preferred LLM. Everything from here on – agent messages, tool outputs, knowledge bases – will remain in your PostgreSQL database and local filesystem unless you explicitly export it.

The Eigent UI lets you configure integrated tools (Slack, web search, etc.) on your local instance.

Choose which models or APIs to use for agents in the local Eigent setup.

Note: when running in local mode, users need to set up their own API keys or endpoints for models.

Watch the Full Tutorial

Prefer a visual guide? We've recorded a step-by-step walkthrough that takes you through the entire process, from spinning up Docker to logging into Eigent locally.

YouTube Tutorial: Local Eigent Setup

What's Next?

And that's it, you've just spun up your very own Eigent AI workforce, fully local and self-hosted.

No cloud lock-in, no data leakage, just agents running on your terms.

👉 Clone the repo and build it yourself

If you run into issues, have feature requests, or just want to share what you're building, we'd love to hear from you.

Join the conversation on our Discord community, the team and other builders hang out there to answer questions, swap ideas, and collaborate on new workflows.

How Not to Be Replaced by AI (A Developer’s Guide)

Nomadev — Thu, 31 Jul 2025 18:51:51 +0000

Hello there, fellow developers! It's Nomadev here, and today we're diving into a topic on every coder’s mind: how not to get replaced by AI. With AI tools getting smarter by the day, you might be seeing scary headlines about tech giants replacing coders with AI. In fact, roughly half of people are worried they’ll lose their job to AI. But take a deep breath – the reality is more nuanced. AI is changing our jobs, not outright deleting them

The key is learning to thrive alongside these new tools instead of being outpaced by them. So buckle up, and let’s explore some strategies (not too hard) to future-proof your dev career in the age of AI.

AI is advancing in leaps and bounds, but smart developers can ride the wave by focusing on what makes us uniquely human.

1. Embrace AI as Your Coding Sidekick

Rather than fearing AI, treat it as your sidekick. The best developers use new tools to their advantage – and AI is no exception. Think of coding AIs (like Cursor, GitHub Copilot, Codex, or Claude Code as power-ups that automate the boring stuff and boost your productivity.

This frees you up to focus on the interesting parts of development. As one expert put it, AI is a valuable collaborator that can handle routine work, allowing you to focus on more complex problem-solving and creative aspects of coding.

In practice, this means you should leverage AI tools in your workflow. Don’t be the developer who says “I don’t need Copilot” while everyone else speeds ahead using it. In fact, 60% of engineering leaders are already rolling out AI coding assistants to their teams – it’s becoming the new normal. So, play around with these tools: use them to generate a first draft of a function, or to get ideas for solving a bug, or to automate writing documentation. By collaborating with AI, you can code faster and smarter. The saying in tech circles is “AI won’t replace developers, but developers who use AI will replace those who don’t.” So hop on the bandwagon and make AI your ally.

2. Be a Problem Solver, Not a Code Monkey

If your coding routine is just blindly following specs or copy-pasting Stack Overflow answers, that’s exactly the kind of work AI can do. Basic “code monkey” tasks – implementing straightforward CRUD apps or boilerplate heavy lifting – are ripe for automation. In contrast, the developers who survive and thrive will be the ones doing what AI can’t do well: truly solving problems. As one discussion noted, “AI will replace software engineers who only copy-paste, but not those who find solutions to problems or understand design.”
In other words, focus on being the creative problem-solver and critical thinker on your team, not just the person who turns coffee into code by rote.

What does this mean in practice? It means honing your skills in system design, architecture, and debugging complex issues. AI is great at generating code from prompts, but it has no real understanding of why that code should exist. It doesn’t grasp high-level design principles or the nuances of your specific product and users
Human brains excel at dealing with ambiguity, inventing new approaches, and making judgment calls

(Pro tip: If you often find yourself waiting for someone to tell you exactly what to do, flip the script. Start taking initiative in defining how to solve a task. The more you practice this, the more you transition from “the one being instructed” (replaceable) to “the one giving instructions” (invaluable).

In short, be the developer who solves problems, not just the one who writes whatever code they’re told to write. That mindset shift will make you much harder to replace.

3. Double Down on Your Human Skills (Creativity, Context and Communication)

AI might be superhuman in cranking out code or analyzing data, but it’s still artificial. There are core “human” skills that give you an edge. One is creativity. Sure, generative AI can mix and mash patterns from its training data, but it’s not truly inventing something new the way a human can.

As a developer, you can come up with creative solutions, innovative algorithms, or clever workarounds that aren’t obvious from past data. Cultivate that creativity by experimenting with new technologies, doing hackathon projects, or simply practicing thinking outside the box in your implementations.

Another human advantage is contextual understanding. You as a developer understand your project’s purpose, your users’ needs, and your company’s goals. AI doesn’t grasp the “big picture” or the why behind the code. Use that to your benefit: involve yourself in product discussions, user feedback, and domain knowledge. If you know why a feature matters and how users will interact with it, you can design and tweak it in ways an AI-generated solution wouldn’t foresee. Being able to align technology with real-world needs – essentially translating business requirements into tech – makes you incredibly valuable.

Don’t forget communication and teamwork skills, either. Coding is often a team sport. Explaining your ideas, listening to others, and collaborating effectively are things no AI can do with human nuance. For example, discussing trade-offs with a product manager or doing a code review that sensitively educates a junior colleague – those require human empathy and communication. In the future, as routine coding gets more automated, skills like communication, mentorship, and leadership will likely become even more important, not less. Developers who can connect with people (whether teammates or users) will always be in demand.

AI can even paint by numbers, but true creativity and innovation are still uniquely human traits. Focus on design and ideas that go beyond the data an AI was trained on.

In summary, sharpen the human elements of your craft. Creativity, context, and communication – these are your moat against automation.

4. Never Stop Learning and Adapting

The tech world moves fast, and AI is only accelerating that pace. To avoid obsolescence, adopt a lifelong learning mindset. That means continuously updating your skills, learning new languages or frameworks when needed, and crucially, learning about AI itself. The good news is that AI can actually help you learn faster (hello, AI tutors and interactive docs!). But it won’t matter if you don’t take the initiative.

A clear real-world sign: companies like Amazon have literally warned their developers to “upskill now” because AI is advancing fast. In other words, standing still is not an option. Make a habit of staying up-to-date on industry trends – read blogs, try out new tools, maybe take an online course on machine learning or prompt engineering. Even if AI automates some parts of coding, entirely new roles and opportunities will emerge for those who understand it. By riding the wave of automation and picking up the skills that become more valuable, you ensure you’re surfing ahead of the break, not getting wiped out.

Importantly, keep your fundamentals strong as well. One instructor noted that programmers still need solid fundamental knowledge to effectively use AI and understand its output
This means you shouldn’t skip learning algorithms, data structures, or how to debug and test thoroughly – these give you the intuition to catch AI’s mistakes and improve on its suggestions. In fact, a savvy developer uses AI to generate a solution, then uses their human skills to refine and correct it. That synergy can produce better results than either a human or AI alone.

Finally, be adaptable in your career. The projects and tech stacks you work on in five years might look very different from today’s. And that’s okay – if you’re adaptable. Be open to new roles that might emerge. Maybe in the future “AI facilitator” or “prompt engineer” becomes a common part of dev teams, or maybe you find yourself integrating AI APIs into every app. The more you roll with these changes and pick up new capabilities, the more indispensable you become. As one set of experts concluded, staying ahead by continuously learning and being willing to incorporate AI tools into your workflow makes you versatile and ready for whatever the industry throws your way.

5. Conclusion: Team Human + AI for the Win

At the end of the day, the future of programming isn’t AI vs. developers – it’s AI with developers. The most likely scenario is that AI becomes a powerful tool in the developer’s toolbox, handling chunks of code and tedious tasks, while humans provide direction, creativity, and critical oversight. We’re already seeing that dynamic: AI frees developers to tackle more creative and complex work, essentially leveling up the kind of problems we focus on Development roles will evolve (you might write fewer lines of trivial code and spend more time orchestrating AI or fine-tuning architecture), but human developers will remain necessary.

So, don’t panic about being replaced. Instead, evolve with the technology. Keep coding, keep learning, and find your unique value in this new landscape. The fact that you’re reading this shows you care about your growth – and that mindset is your biggest asset. Embrace AI, sharpen your human strengths, and you’ll do just fine. In fact, you’ll do better than fine - you’ll be leading the charge in whatever the future of software development looks like.

Last but not least, remember that the journey is more fun with community. Stay curious, keep experimenting, and never hesitate to seek out knowledge (or share it). And yes, one great way to stay updated with the latest AI and dev insights is to follow @thenomadevel – I’ll be right there learning and sharing alongside you. 😉

Until next time, happy coding and keep rocking that uniquely human creativity! You’ve got this. 🙌

Let’s Connect and Build Together

Here’s how we can collaborate:

Open to DevRel partnerships to help brands grow through educational content.
Have an AI MVP idea or need consultancy services for AI-based applications and research projects? Let’s make it happen!

📧 Drop a mail at: thenomadevel@gmail.com

How I Got an AI Agent to Read and Reply on WhatsApp Automatically

Nomadev — Tue, 27 May 2025 06:38:37 +0000

Hey! If you're into building smart, real-time AI that feels like magic but runs on clean logic — you're in the right place. I'm Nomadev, and in this guide, we’re connecting an AI agent to WhatsApp so it can actually read, reply, and even reason using OWL and MCP.

So we went ahead and built one using CAMEL-AI’s OWL multi-agent framework and a WhatsApp MCP server. In this post, I’ll walk you through exactly how to do it, what tools are involved, and how everything fits together.

By the end, you’ll have a real-time WhatsApp assistant that can read messages, understand context, use tools (like search) and respond intelligently.

What We’re Building (and Why It’s Cool)

Imagine sending a message to WhatsApp like:

“What’s the weather in Tokyo this weekend?”

And your AI assistant replies a few seconds later like:

“Looks like 23°C and mostly sunny. Pack those shades.”

No need to open a browser or app — your agent handled it in the background.

We’re making that possible by plugging OWL into WhatsApp using a Model Context Protocol (MCP) server.

Let’s break down how it all works.

What’s This MCP Thing?

Model Context Protocol (MCP) is like a universal translator for LLMs.
Instead of hardcoding how an AI talks to every app or service, MCP gives us a clean way to plug tools (like WhatsApp) into AI systems with zero mess.

Here’s how the pieces play together:

MCP Server → Adapts a tool (like WhatsApp) into a format AI can understand
MCP Client → Lives on the AI side and sends/receives data to/from the server
MCP Host → Runs the whole show (in our case, that’s OWL)

📞 Think of it like this:

OWL is the person making the call (MCP host)
The phone they use is the MCP client
The friend on the other end (WhatsApp tool) is the MCP server

You now have a modular, secure, and elegant way to let AI interact with apps like WhatsApp.

A Quick Look at OWL (Optimized Workforce Learning)

OWL is CAMEL-AI’s framework for building multi-agent systems that think and collaborate.
Instead of one lonely agent trying to do everything, OWL lets agents role-play and delegate.

In our case:

One agent plays the user to orchestrate the query user asked

Another agent plays the assistant that helps with tool calling and aligns with the user agent

And thanks to real-time messaging support, it feels natural. The assistant keeps context, remembers past replies, and actually gets what you’re trying to say (even across multiple messages).

🛠️ Let’s Build: WhatsApp AI Assistant with OWL

✅ Prereqs
Before we dive in, here’s what you’ll need:

Go (for running the WhatsApp bridge)
Python 3.10+
OpenAI API Key (or any LLM setup that OWL supports)

🔧 Step 1: Clone the Code

# OWL framework (multi-agent brain)
[git clone https://github.com/camel-ai/owl.git](git clone https://github.com/camel-ai/owl.git)

# WhatsApp MCP integration (the WhatsApp bridge)
[git clone https://github.com/lharries/whatsapp-mcp.git](git clone https://github.com/lharries/whatsapp-mcp.git)

The OWL repo includes the full WhatsApp demo under community_usecase/Whatsapp-MCP

🔁 Step 2: Fire Up the WhatsApp Bridge

cd whatsapp-mcp/whatsapp-bridge
go mod download
go run main.go

You’ll see a QR code pop up.
Scan it using your WhatsApp (just like WhatsApp Web) to link your account.

Important: Keep this bridge running in a separate terminal. It’s your live connection to WhatsApp.

🧩 Step 3: Configure MCP

Create a file called mcp_config_whatsapp.json like this:

{
  "mcpServers": {
    "whatsapp": {
      "command": "<PATH_TO_UVICORN>",
      "args": [
        "<PATH_TO_WHATSAPP_MCP_SERVER_MAIN.py>",
        "--connect_serial_host",
        "--only_one"
      ]
    }
  }
}

This lets OWL know how to launch and connect to the WhatsApp server.
Just swap in the actual file paths where needed.

🧠Step 4: Launch the OWL Agent

cd owl
python community_usecase/Whatsapp-MCP/app.py

This:

Starts OWL’s multi-agent brain
Launches the WhatsApp MCP server via Uvicorn
Connects everything together

Now try messaging your WhatsApp account (from another phone or friend).
Your AI agent will reply — in real-time — directly inside WhatsApp. No extra apps or dashboards.

🧪 Behind the Scenes

Under the hood, this is what’s happening:

Message arrives in WhatsApp
WhatsApp MCP server receives it
OWL’s assistant agent reads it via MCP
Agent reasons about the best reply
The reply gets sent back through the server
You see it in WhatsApp

🧵 Bonus: The Python Behind It

Want to peek inside the code that powers this?
We’ve got role construction, tool config, and async orchestration — all wrapped in one OWL script.

(You can find the full script inside the OWL repo → community_usecase/Whatsapp-MCP/app.py)

Pro Tips & Troubleshooting

Bridge not scanning? Run it again to get a fresh QR code
No reply showing? Check that the Python and Go processes are both running
Wrong path in config? Triple-check your main.py and Uvicorn paths
Agent feels slow? First message might take a few seconds to process

🚀 Final Thoughts

This isn’t just a hacky integration — it’s a modular system.
You can swap out WhatsApp for Slack, Gmail, Notion, or any other MCP server and keep the same AI logic in OWL.

Today it's WhatsApp. Tomorrow?
Your agent could be trading stocks, controlling IoT devices, or managing your entire workflow.

Want to dive deeper into MCP servers and discover other integrations?
🔗 checkout this blog: 7 MCP Sites Every AI Dev Should Bookmark

🧘‍♂️ Wrap-Up

So if you’ve ever wanted an AI that can manage your WhatsApp like a helpful teammate (while you stay focused or just vibe to some lo-fi), now you know how to build it.

Catch you soon with more agent tricks, smart setups, and relaxed dev energy.

— Nomadev
Follow me on X for more AI experiments and automation builds 🚀

So I Hooked My AI Agent Up with Notion. Here's What Happened.

Nomadev — Fri, 23 May 2025 13:46:23 +0000

Hey there, I’m Nomadev and today we’re doing something futuristic and cozy: getting your AI agent to update Notion for you. Yup, hands in your pockets, favorite lo-fi track on, and let’s automate like a boss.

With CAMEL-AI’s general purpose agent(open-source) and Notion’s Model Context Protocol (MCP) server, your agent can now read from and write to your Notion workspace, all in a single prompt.

Let’s walk through how to make it happen. By the end, your OWL agent will be searching for Notion pages and updating them based on your prompts

What’s CAMEL-AI, OWL, and MCP?

CAMEL-AI is the first multi-agent framework, it helps build multi-agent systems that work together smartly.

OWL (Optimized Workforce Learning) is one of the best general purpose open-source agent.

MCP is the bridge a protocol that lets your agent talk to external tools (like Notion) securely.

Step 1: Prep Notion for Your Agent

Before we code, let’s give our agent some polite access to your Notion workspace.

🔹 Create a Notion Integration
Go to Notion → Settings & Members → Integrations → “+ New Integration”
Name it something cool like MyMCP Bot, set it to Internal, and save.

🔹 Scope It Down (Safety First)
Only enable “Read content” to keep things chill and privacy-safe. You can enable editing later once you’re confident.

🔹 Copy the Token (Keep it Safe)
This is your API key. It starts with secret_ or ntn_. Guard it like your Netflix password.

🔹 Share a Page with It
Pick the page (e.g., “Travel Itinerary”) and connect the integration to it. That’s how you let your agent in.

🛠️ Step 2: Hook CAMEL’s MCPToolkit to Notion

Now let’s tell CAMEL where our Notion MCP server lives. Create a json file let's call it mcp_config.json:

{
  "mcpServers": {
    "notionApi": {
      "command": "npx",
      "args": ["-y", "@notionhq/notion-mcp-server"],
      "env": {
        "OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer ntn_****\", \"Notion-Version\": \"2022-06-28\" }"
      }
    }
  }
}

In Python:

from camel.toolkits import MCPToolkit
from owl.utils.enhanced_role_playing import OwlRolePlaying, arun_society

mcp_toolkit = MCPToolkit(config_path="mcp_config.json")
await mcp_toolkit.connect()

Just like that, your OWL agent is now equipped to use Notion like a pro assistant.

Step 3: Build an OWL Agent That Updates Notion

Here’s the vibe: you give the OWL agent a task (like "Add 10 European travel destinations to a Notion page"), and the agent figures out what tools to use and when.

No micromanaging. Just chill orchestration.

Here’s a snippet to get that going:

default_task = """
Find the page titled 'Travel Itinerary'
Add a list of Top 10 travel destinations in Europe, their descriptions, and best time to visit.
"""

tools = [*mcp_toolkit.get_tools()]
society = await construct_society(default_task, tools)
await execute_notion_task(society)

Behind the scenes, CAMEL’s OWL orchestrator handles:

Finding the page
Appending the list
Reasoning and retrying if anything fails

All from that single natural language prompt. Pretty slick, right?

🎬 Want to See It in Action?

Watch how OWL agents update Notion pages like pros — no extra hustle:

Try it out here: GitHub Community Use Case

Future Use Cases (Yes, It Gets Cooler)

Now that your agent can read/write Notion, here’s what’s possible:

Summarize and log meetings in Notion.
Cross-reference docs from across tools.
Update your weekly planner with new tasks.
Build your own AI that reads docs + writes notes.

The best part? It’s all under your control. Minimal permissions. Clear logs. Agent tools only run when they’re allowed.

🧘‍♂️ Wrap-Up

We just gave our OWL agent the power to talk to Notion securely using MCP and made it do useful stuff autonomously.

It’s not just automation. It’s intelligent, role-based, LLM-orchestrated chill-tech.

So if you’ve ever dreamed of an AI assistant that updates your Notion pages while you sip your iced coffee, now you know how to build it.

Catch you soon with more tech, tools, and chill!

— Nomadev
Follow me on X for more agent hacks and cozy AI builds 🚀

7 MCP Sites Every AI Dev Should Bookmark

Nomadev — Thu, 15 May 2025 04:35:19 +0000

Hey everyone, it’s Nomadev here with a brand-new roundup! 🚀

If you caught my last post on A2A vs MCP: Connecting AI Agents and Tools, you know I’m a bit biased toward MCP. Today, I’m super excited to share 7 Must-Know MCP Hubs Every AI Developer Should Explore, these are the go-to spots I use to level up my agents with new tools.

Alright, so let's dive in without any more waiting. 🔥

1. Smithery

Discovery meets power with Smithery. This registry puts thousands of MCP servers at your fingertips, including web scrapers, database connectors and desktop automation so you can instantly bolt new capabilities onto your AI agents. Want to fetch live data or execute code without breaking a sweat? Smithery makes it happen in just a few clicks.

2. Awesome MCP Servers

Awesome MCP Servers is a community-curated treasure trove of open-source MCP endpoints, covering everything from web scraping and Git integrations to cloud storage and database tools.
They even added a dedicated Remote Servers section, so you can plug into high-quality hosted MCP services like Sentry, Intercom, and PayPal without running anything locally.

3. ACI.dev

This platform connects your AI agents to 500+ pre-built integrations—from Gmail and Notion to Slack and HubSpot—through a single, unified MCP server. Built-in multi-tenant authentication and granular permissions mean you can spin up production-ready workflows without juggling OAuth tokens or custom security layers.

Need to let end users authorize agents or manage secrets securely? ACI.dev handles the heavy lifting so you can focus on building smarter agents.

4. MCP Hub by CAMEL-AI

Discovery meets customization with the CAMEL-AI MCP Hub. This official directory lets you explore and compare first-party and community-driven MCP servers such as web browsers, database adapters, video summarizers and secure code sandboxes.

Need an Astra DB adapter you can install with a single npm command or a Cloudflare Workers toolkit to deploy at the edge? The CAMEL-AI MCP Hub provides ready-to-use configurations and examples so you can plug powerful tools straight into your agents without reinventing the wheel.

5. PulseMCP

This hub keeps you up to date on everything MCP — from a searchable directory of 4,300+ servers and 250+ clients to real-world use-case spotlights and weekly ecosystem news. Want to discover the latest error-tracking integrations, browse vetted VS Code plugins, or see how others are automating workflows? PulseMCP curates it all in one place and updates daily so you never miss a beat.

6. Klavis AI

This API-first service lets you spin up production-grade MCP servers—everything from web search and video processing to report generation—with a single REST call. Provision and manage your servers programmatically, handle millions of interactions with built-in monitoring, and offload the heavy lifting of hosting and scaling. KlavisAI makes it seamless to plug powerful, managed MCP capabilities into your agents in minutes.

7. Composio

Orchestration meets simplicity with Composio. This platform brings 100+ managed MCP servers under one roof, complete with built-in authentication and seamless scaling. Connect your agents to Gmail, GitHub, Notion, Slack, and more in a few clicks, then drag, drop, and chain actions to build end-to-end workflows—no custom glue code or token juggling required. Composio lets you automate complex processes with a few lines of JSON.

And that’s a wrap on my top 7 MCP hubs! These platforms cover everything from discovery to enterprise orchestration—and trust me, you can’t afford to miss them if you’re serious about MCP features.

🔥 Try them out, plug them into your agents, and let me know which one becomes your favorite! Drop your thoughts or any other hubs I might’ve missed in the comments below.

Until next time, keep experimenting and build something awesome! 💪

— Nomadev

A2A vs MCP: Connecting AI Agents and Tools

Nomadev — Sun, 11 May 2025 11:35:59 +0000

The AI ecosystem is buzzing with new standards for how models, tools, and agents interconnect. Two big newcomers are Google’s A2A (Agent-to-Agent) protocol and Anthropic’s MCP (Model Context Protocol). At a high level, A2A sets out to standardize how complete AI agents talk and collaborate, while MCP standardizes how a language model hooks up to tools and data sources.

In practice, the two have different goals and architectures but can actually complement each other (and even overlap) in building agent-based systems. Let’s dive into what each protocol is, how they’re designed, and why both matter for the future of AI products.

What is A2A (Agent-to-Agent)?

A2A is Google’s open spec for agents to talk to each other. In this vision, each agent is a self-contained AI service (an LLM plus any tools or functions it uses), and A2A defines a standard communication framework for those agents. In practice, A2A uses a client–server style architecture where one agent can discover and delegate tasks to another, exchanging messages and “artifacts” as the work progresses.

Each agent publishes an agent card (a JSON profile listing its capabilities and endpoints) so others can find it. Under the hood A2A uses HTTP/HTTPS with JSON-RPC calls for structured requests and Server-Sent Events (or WebSockets) for streaming replies, all packaged in JSON.

In short, A2A lets AI agents (say, a “travel planner” bot) talk to other specialized agents (e.g. “flight-booking agent”, “hotel-finder agent”) in a consistent way. The protocol even handles task hand-offs, progress updates, error handling, and follow-up questions. In the words of Google’s documentation, “A2A is a standardized communication framework that enables AI agents to interact with each other in a consistent, predictable way”

Key components of A2A include:

Agent Cards: JSON profiles advertising what an agent can do (its name, provider, endpoint, and capabilities)

Tasks and Messages: A “task” is a job handed off to an agent; it has lifecycle states (submitted, running, input-required, completed, etc.) and carries message envelopes and artifacts

Client–Server Model: Agents act as clients or servers dynamically. One agent (client) may assign tasks to another (server), but roles can shift as agents collaborate

Supported Patterns: A2A handles asynchronous or long-running tasks (streaming partial results), multimodal payloads, clarifications, and standardized error formats

In short, A2A is about agent-to-agent collaboration. It answers questions like “How does Agent A discover Agent B and ask it to do something on my behalf?” For example, Google’s own blog shows a workflow where an HR assistant agent finds candidate-sourcing and scheduling agents, asks clarifying questions, and merges the results into one solution for the user

This makes possible complex, multi-agent solutions (like planning a multi-city trip or automating a finance workflow) by reusing specialized agents as building blocks.

What is MCP (Model Context Protocol)?

In contrast, MCP (Model Context Protocol) is designed to connect an LLM-based application to data, APIs, and tools in a standardized way. Think of MCP as a “USB-C port” for AI: just as USB-C provides a common plug for many devices, MCP provides a common interface for LLMs to access external knowledge and functionality

In MCP, there are MCP hosts or clients (like a chat app or IDE that has an AI assistant) and MCP servers (small services that expose particular data sources or APIs). A typical setup: an AI host (e.g. a chatbot using Claude or GPT) has one or more MCP client components that each maintain a 1:1 connection to an MCP server. Those servers each grant the model access to a specific resource. For example, an MCP server might wrap Google Drive, Slack history, a database, or a web API. When the AI model needs context (e.g. “show me my last five emails”), the host asks the relevant MCP client, which sends a JSON-RPC request to the server.

The server returns the data, which the AI can then use in its reasoning or response. Anthropic describes MCP as “an open standard that enables developers to build secure, two-way connections between their data sources and AI-powered tools”

The MCP architecture is simple: applications that want data act as MCP clients, while each data source implements an MCP server. The protocol uses JSON-RPC 2.0 messages for all calls.

Key elements include:

MCP Clients/Hosts: These live inside the AI application (e.g. Claude Desktop, a chatbot, or an IDE extension). They initiate connections and send requests for data or actions via MCP.

MCP Servers: Lightweight servers (often open-source) that expose a particular capability or dataset. Each server declares a schema: what prompts or tool uses it supports. There are already MCP servers for things like Google Drive, Slack, GitHub, databases, even custom code run via Puppeteer

Local vs Remote Sources: MCP servers can access local resources (your files, databases, local services) securely, as well as remote APIs over the Internet

Capability Negotiation: On connecting, the client and server exchange metadata. The server tells the client what prompts, actions, or tools it can provide (like a function signature)

Session State: MCP maintains a session so the server can remember previous exchanges. It focuses on exchanging context and coordination between client and server

Key Differences

In practice, MCP lets any AI app plug into arbitrary data easily. Anthropic notes that MCP “provides a universal, open standard for connecting AI systems with data sources, replacing fragmented integrations with a single protocol”

Aspect	A2A (Agent-to-Agent)	MCP (Model Context Protocol)
Focus & Scope	Treats each system as a full agent (LLM + tools). Defines how agents discover one another, delegate tasks, and collaborate on multi-agent workflows.	Assumes one side is an LLM application (host) and the other a data/tool provider (server). Standardizes how a model communicates with external tools and data sources.
Architecture	Mesh client–server among agents. Agents publish “agent cards” (JSON profiles) and call each other’s APIs via JSON-RPC or REST. Emphasizes task lifecycle, streaming, and peer discovery.	Hub-and-spoke: the AI host (hub) connects to multiple MCP servers (spokes). No peer discovery—clients know which server to call based on context. Uses JSON-RPC 2.0 over HTTP (or stdio).
Data Flow	Data moves between agents as part of delegated tasks (e.g. Agent A asks Agent B to perform work and return a result).	Data flows between a model and a static resource (e.g. “get this document” or “query that database”).
Standards & Tech	JSON, HTTP/HTTPS, JSON-RPC, Server-Sent Events for streaming, agent cards, artifacts, built-in enterprise-grade auth (OAuth, OIDC, API keys).	JSON, HTTP/HTTPS or stdio, JSON-RPC 2.0. Early versions used simple API keys; newer implementations adopt OAuth 2.0/DCR for stronger security.
Intended Use	“Cross-vendor discovery” on the public internet—agents from different organizations working together (“Who can do X for me?”).	Reliable connection of your AI application to internal or SaaS data (“I need data X—connect me to the server that has it”).
Overlap & Complementarity	Handles high-level agent orchestration. In practice, A2A agents often rely on MCP servers for tool/data access and may register themselves as MCP “resources” for discovery.	Handles low-level model-to-tool wiring. Complements A2A by feeding agents the context and capabilities they need during multi-agent workflows. Can run inside A2A task payloads for seamless integration.

It supports asynchronous workflows and partial results, which is crucial for long tasks. By standardizing discovery and capabilities, A2A greatly simplifies multi-agent architectures – think of it as a common rail that diverse agents can tap into.

Strengths and Challenges

A2A Strengths: It enables modular agent networks. Agents from different vendors can interoperate if they speak A2A. It has enterprise-grade features out of the box (OAuth/OIDC auth, streaming, error semantics) for building real applications

A2A Challenges: It’s new and still evolving. Every agent must implement the protocol correctly (card publishing, message handling, etc.), and network security between agents becomes important. Discovery at scale can be tricky (e.g. public versus private agent directories). There are also open questions about governance: who runs the index of agents, who certifies them, etc. And while A2A includes security mechanisms, real-world deployments will need careful design of “authorization boundaries” – deciding which agents can talk to which

MCP Strengths: It fills a huge gap: AI apps typically need to talk to data, and MCP makes that pluggable. It reduces duplication by providing pre-built connectors. Once an ecosystem of servers grows, any model can switch data sources without rewriting prompts. It also natively handles conversational state and context sharing, so data requests can be grounded in ongoing interactions

MCP Challenges: Early on, MCP had weak security defaults (simple API keys, wide OAuth scopes), raising concerns about prompt injection and over-privileged access.

Anthropic and others are addressing these (OAuth, fine-grained tokens, etc.), but it remains an area of scrutiny. MCP also isn’t built for agent discovery or orchestration – it won’t solve how you find an agent, just how you fetch data. Some vendors point out that MCP by itself doesn’t handle long-running multi-agent workflows or arbitration of tasks

Adoption and “Protocol Wars”

I’ve been watching the rise of A2A and MCP with great interest.It feels a bit like being back in the early days of the web, when everyone was debating HTTP vs. FTP. On the A2A side, Google scored a big win out of the gate: over 50 tech partners (think MongoDB, Atlassian, SAP, PayPal, Cohere) have already published sample multi-agent demos with LangGraph and Intuitive AI. It’s powerful to imagine an enterprise where every AI microservice HR, finance, analytics just “speaks A2A” out of the box for task automation.

Meanwhile, MCP isn’t far behind. I genuinely believe MCP has a bright future, its promise of plugging any LLM into any data source (Slack, Google Drive, Postgres) with a single JSON-RPC call is irresistible. Microsoft’s Copilot Studio support is a strong signal that the industry wants those standardized connectors.

But these aren’t the only players. IBM and the BeeAI project are championing ACP (Agent Communication Protocol), focusing on RESTful agent messaging and fine-grained permissions. With A2A, MCP, ACP and likely more on the horizon, it’s natural to wonder if we’re headed for a “protocol war.” Will one standard win, or will we settle into a hybrid world? Personally, I’m betting on complementarity: each protocol shines in its own layer, and smart teams will weave them together rather than pick just one.

The Future: Convergence or Coexistence?

Looking forward, I see two equally exciting paths:

Convergence

Imagine A2A agents exposing their skills as MCP servers, and MCP gaining richer task semantics borrowed from A2A (or ACP). We’d end up with a unified “agent + tool” fabric where discovery, delegation, and data access all happen through a single, battle-tested API surface.
Coexistence & Specialization

More likely in the near term: MCP becomes the go-to for model-to-tool wiring, while A2A (and ACP) dominate agent orchestration. Your next AI product whether it’s a team-building assistant or a finance-reporting bot will probably implement MCP internally to tap into documents and databases, and A2A externally to collaborate with other agents.

For anyone building AI today, here’s my takeaway: learn both. Mastering MCP means your model can instantly plug into a rich ecosystem of data and tools. Embracing A2A ensures your agents can discover, delegate, and scale across organizational boundaries. Together, they’ll be as fundamental to tomorrow’s AI stacks as REST and gRPC are to today’s.

If you love exploring the latest in tech and open source as much as I do, come say hi to me, Nomadev, on Twitter. Let’s exchange ideas, share insights, and keep the hustle spirit alive! 🚀

I’m also part of the CAMEL-AI, and we’ve been hard at work integrating some really killer MCP features into our platform. 🤖 Stay tuned we’ll be officially announcing them next week. Happy coding!

Catch up with Nomadev on X!

The New Era of Automation: How OWL, CRAB, and MCP Are Bridging the Last Mile

Nomadev — Mon, 14 Apr 2025 20:34:23 +0000

The field of autonomous agents is experiencing a renaissance. These AI systems—designed to reason, interact with tools, and complete complex tasks—are making rapid and tangible progress. From cutting-edge research frameworks to powerful platforms enabling agents to manage incredibly intricate workflows. These systems are no longer just promising demos, they’re beginning to reshape how we think about digital labor and automation.

A key enabler of this progress is the Model Context Protocol (MCP), introduced by Anthropic. MCP serves as a new standard for connecting AI assistants to the systems where data lives—including content repositories, business tools, and development environments. It has quickly gained traction, especially with Cursor and Windsurf's integration. OpenAI recently announced their support for MCP in their agent SDK, marking a significant step for the ecosystem. We have also integrated it into the CAMEL framework to embrace the MCP ecosystem.

Despite these advancements, agents still face a fundamental limitation: they struggle with long-term decision-making and adaptation. While they can execute well-scoped tasks, they falter on multi-step objectives that require learning, revising plans, or reacting to change. Current agents follow instructions but don’t truly evolve through experience.

This gap stems from the static nature of internet training data. Language models learn from passive text, not from interaction. To gain real autonomy, agents must operate and evolve within environments—digital or physical spaces where they can perceive, act, and learn from experience. Only through this feedback loop can agents begin to improve through trial and error.

To address this “last mile” challenge in agent automation, we introduce OWL and CRAB, two agent automations projects and MCP integration that are designed specifically for interactive environments.

OWL: Optimized Workforce Learning

OWL (Optimized Workforce Learning), built on top of the CAMEL-AI Framework, is our recently released project for real-world task automation. OWL has shown promise in task automation, achieving an impressive average score of 58.18 on the GAIA benchmark—ranking #1 among open-source submissions.

Watch the video

How OWL Works

OWL is a multi-agent system for automating digital tasks through the use of a browser, terminal, code execution, function calls, and MCP tools. The project has integrated:

Browser Automation: Sophisticated browser interaction capabilities using the Playwright framework, allowing for scrolling, clicking, input handling, downloading, navigation, and more.
Online Search Capabilities: Support for multiple search engines (including Google, DuckDuckGo, Baidu, Bocha, Wikipedia) enabling real-time information retrieval and knowledge acquisition.
Code Execution: Ability to write and execute Python code using an interpreter, enabling programmatic solutions to complex problems.
Document Parsing: Advanced extraction of content from various document formats (Word, Excel, PDF, PowerPoint), with conversion to text or Markdown format.
Multimodal Processing: Robust handling of internet or local videos, images, and audio data through specialized toolkits (ImageAnalysisToolkit, VideoAnalysisToolkit, AudioAnalysisToolkit).
Extensive Toolkit Integration: Access to a comprehensive set of built-in toolkits including ArxivToolkit, GitHubToolkit, GoogleMapsToolkit, and many more specialized tools built in the CAMEL framework.

The core of OWL’s functionality is built on the CAMEL framework’s RolePlaying module, which creates unique initial settings for different agents through predefined prompts. This system primarily utilizes two main agents:

1. UserAgent: Responsible for breaking down tasks and providing instructions

2. AssistantAgent: Executes instructions using various pre-configured tools or tool agents

This architecture enables OWL to handle complex workflows through dynamic agent interactions, making it particularly effective for task automation across diverse domains.

Furthermore, OWL employs a multi-agent system with context isolation for handling long-horizon tasks. Specialized sub-agents maintain isolated context windows for their domain (e.g., WebAgent keeps browser interaction history separate from main agent context).

OWL with MCP Integration

MCP has emerged as the “USB interface” of the LLM field, becoming a universal solution for addressing AI information silos, with its ecosystem growing daily. OWL supports the MCP protocol to call MCP servers within its ecosystem, achieving more standardized and efficient tool invocation.

Here’s a step-by-step guide to implementing MCP with OWL:

1. Setting Up MCP Servers

First, install the required MCP servers:

# Install MCP Playwright Server
npm install -g @executeautomation/playwright-mcp-server
npx playwright install-deps

2. Configure MCP Servers

Create a configuration file named mcp_servers_config.json with the following structure:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@executeautomation/playwright-mcp-server"]
    }
  }
}

3. Implementation in OWL

Here’s how to integrate OWL with MCP in your code:

import asyncio
import sys

from camel.models import ModelFactory
from camel.toolkits import MCPToolkit
from camel.types import ModelPlatformType, ModelType
from camel.societies import RolePlaying
from camel.logger import set_log_level

from owl.utils.enhanced_role_playing import arun_society

set_log_level(level="DEBUG")

async def main():
    # Initialize MCP toolkit and connect
    mcp_toolkit = MCPToolkit(config_path="mcp_servers_config.json")

    try:
        await mcp_toolkit.connect()

        # Get task from command line or use default
        task = sys.argv[1] if len(sys.argv) > 1 else (
            "Using a web browser, search Google Scholar for Andrew Ng's academic profile. Create a comprehensive report that includes: (1) his main research directions in AI and machine learning, (2) at least five of his most influential published papers with citation counts, (3) his affiliated institutions throughout his career, and (4) a summary of his impact on the field."
        )

        # Setup model
        model = ModelFactory.create(
            model_platform=ModelPlatformType.OPENAI,
            model_type=ModelType.GPT_4O,
        )

        # Create and run society
        society = RolePlaying(
            task_prompt=task,
            user_role_name="user",
            user_agent_kwargs={"model": model},
            assistant_role_name="assistant",
            assistant_agent_kwargs={
                "model": model,
                "tools": mcp_toolkit.get_tools(),
            },
        )

        answer, chat_history, token_count = await arun_society(society)
        print(f"\033[94mAnswer: {answer}\033[0m")

    finally:
        try:
            await mcp_toolkit.disconnect()
        except Exception:
            print("Disconnect failed")


if __name__ == "__main__":
    asyncio.run(main())
‍

Example Use Case

Consider this task: “Using a web browser, search Google Scholar for Andrew Ng's academic profile. Create a comprehensive report that includes: (1) his main research directions in AI and machine learning, (2) at least five of his most influential published papers with citation counts, (3) his affiliated institutions throughout his career, and (4) a summary of his impact on the field.”

The OWL framework with MCP can handle this by:

Utilizing autonomous agents to decompose and tackle different aspects of the task
Leveraging the Playwright MCP Server to navigate academic websites and extract paper information
Coordinating the agents through OWL’s role-playing mechanisms to complete the task

Benefits of OWL + MCP Integration

1. Standardized Tool Access: MCP offers a unified interface for interacting with tools and data sources.
2. Ecosystem Expansion: New MCP servers can be seamlessly integrated to enhance OWL’s capabilities.
3. Security: MCP’s architecture safeguards sensitive data through its robust design.
4. Flexibility: Users can easily switch between any AI models that support the MCP standard.
5. Efficiency: Development time for complex multi-agent systems is significantly reduced.
‍

OWL’s Future Directions

OWL’s development roadmap focuses on enhancing its capabilities in several key areas:

Expanding Tool Integration: Incorporating more specialized toolkits to address domain-specific challenges
Improving Multi-Agent Coordination with RL: Incorporating environmental feedback to train the multi-agent systems with reinforcement learning
Strengthening Reasoning Capabilities: Developing more sophisticated planning and decision-making mechanisms
Broadening Environment Compatibility: Ensuring seamless operation across different computing environments

The recent integration of MCPToolkit, FileWriteToolkit, and TerminalToolkit represents significant progress toward these goals, enhancing OWL agents with MCP tool calling, file writing capabilities, and terminal command execution.

CRAB: Cross-environment Agent Benchmark

CRAB stands for CRoss-environment Agent Benchmark, is the first agent framework that supports cross-device task execution. This project aims to build a benchmark that enables agents to perform tasks across multiple environments. For instance, within the CRAB framework, an agent can read a message on a smartphone and then operate a PC based on the message content.

Crab Demo

What Is an “Environment” in CRAB?

The term environment is crucial in CRAB. In the example above, there are two environments: an Ubuntu PC and an Android smartphone. In fact, an environment can be any device, application, or even a more complex multi-device system—as long as it has a well-defined action space and observation space.

Why Cross-Environment Matters

Cross-environment capability is a crucial consideration in our framework, enabling agents to interact simultaneously with multiple devices or applications. This involves coordinating across environments, leveraging information between them, and passing messages. Much like humans who naturally navigate diverse environments—each with different action/observation spaces and logic, to solve complex problems, this capability is vital. However, it stands in contrast to most existing agent benchmarks, which are typically limited to interactions within a single device or application.

CRAB introduces the first cross-environment agent benchmark, CRAB Benchmark v0, which includes 120 tasks spanning more than 20 applications on Ubuntu desktops and Android smartphones. We believe that scaling agent environments is a key step toward building capable and practical agents.

The cross-environment capability unlocks tremendous potential for real-world applications. One exciting possibility is applying CRAB to IoT scenarios—imagine controlling all your devices through a single intelligent agent assistant. In industries such as networking and cloud computing, managing a large number of heterogeneous devices is a constant challenge. Our cross-environment paradigm offers a promising path forward in these domains.

What’s Next: CRAB’s Updating Directions

We are actively improving CRAB and planning several key upgrades in the upcoming version:

Usability: Simplifying configuration and improving code readability. Introducing MCP (Model Connector Protocol) for seamless integration with any model or framework.
Extensibility: Adopting a modular design that makes it easy to add new environments or virtual device implementations. We’ll also introduce a plugin system to support easy customization of existing modules.
Robustness: Our current VM implementations rely on QEMU/KVM and the Google Android Emulator, which are not very stable and Linux-dependent. We plan to switch to more stable and convenient alternatives like Docker.
Automation: Reducing the amount of manual labor needed to conduct experiments.

We’ll be integrating more components into our official GitHub repo, including:

Popular Benchmarks: OSWorld, WebArena, and more
New Environments: Windows, macOS, iOS, web browsers, specific applications, OpenAI Gymnasium, etc.
Visual Prompt Tools: OmniParser, Ferret-UI, Grounding DINO, etc.
Advanced GUI models: OpenAI Operator, Claude Computer Using, etc.
Multi-Agent Systems: Frameworks like CAMEL and OWL, protocols like MCP

OWL + CRAB: A Unified Agent Operating System

The integration of OWL and CRAB creates a potent ecosystem for developing, testing, and scaling agents.

OWL can execute complex, multi-step digital tasks using its sophisticated reasoning and toolkits within a defined environment.
CRAB can provide and manage the diverse, interconnected environments (like PCs, smartphones, specific apps) where these tasks unfold, enabling agents to operate across previously siloed systems.

Complementary Capabilities

OWL and CRAB complement each other in several important ways:

1. Development and Evaluation

OWL provides the framework for building sophisticated multi-agent systems.
CRAB offers standardized methods for evaluating their performance.

2. Task Automation and Environment Adaptation

OWL is good at automating complex tasks.
CRAB ensures these capabilities work consistently across different environments.

3. Tool Integration and Benchmark Standardization

OWL’s extensive toolkit integration is balanced by CRAB’s rigorous benchmarking approach

Data Generation Potential

Combining these projects enables the generation of high-quality training data. Once established, the environments can be used to:

Create Diverse Scenarios: Generate a wide range of task scenarios across different environments.
Capture Agent Interactions: Record how agents navigate these scenarios, including both successful and unsuccessful approaches.
Develop Improvement Metrics: Analyze interaction data to uncover patterns and strategies that correlate with better performance.
Train New Agent Models: Use the synthetic data and identified success signatures to guide the training process through RLHF, targeted fine-tuning, and supervised learning.

This data generation capability creates a virtuous cycle where agent performance continuously improves through iterative testing and refinement.

The Critical Role of Environment in Agent Scaling

CAMEL-AI has identified environment as one of the three key dimensions in the scaling laws of agents—alongside:

the number of agents
memory capabilities

This highlights how crucial environment design is to advancing agent technology.

Why Environments Matter for Agent Scaling

Environments provide the context in which agents operate and learn. They define:

The Action Space: What agents can do and how they interact with the world
The Observation Space: What information agents can perceive
The Reward Structure: How agent behaviors are reinforced
The Task Complexity: The range of challenges agents must overcome

As environments become more diverse and complex, they drive the development of more sophisticated agent capabilities. This creates a scaling effect—better environments lead to better agents, which in turn can handle more complex environments.

Cross-Environment Challenges

The ability to operate across different environments represents a significant leap in agent capabilities. It requires:

Abstraction Skills: Understanding common principles that apply across environments
Adaptation Mechanisms: Adjusting strategies based on environment-specific constraints
Transfer Learning: Applying knowledge gained in one environment to another
Meta-Learning: Learning how to learn in new environments quickly

CRAB’s focus on cross-environment benchmarking directly addresses these challenges, providing a structured way to measure and improve these critical capabilities.

Environment-Driven Intelligence

CAMEL-AI’s hypothesis on the scaling laws of agents emphasizes that intelligence emerges from the interplay between agents and their environments. This aligns with Marvin Minsky’s Society of Mind concept—suggesting that intelligence is not monolithic, but emerges from diverse interactions. Environments serve as crucial testing grounds, stretching and refining agent capabilities. By developing increasingly complex environments, we drive the creation of more sophisticated agents—mirroring how human intelligence evolved through natural and social interactions.

Future Directions in Environment Design

As agent technology advances, environment design will likely focus on:

Increased Realism: Mimicking real-world complexity
Dynamic Adaptation: Evolving in response to agent capabilities
Multi-Agent Ecosystems: Encouraging rich agent-to-agent interactions
Cross-Modal Integration: Combining sensory and interaction modalities

The combination of OWL's advanced agent capabilities and CRAB's rigorous environment specifications offers an ideal platform for exploring these frontiers.

Conclusion

The integration of OWL, CRAB, and MCP represents a significant step forward in solving the “last mile” challenge of agent automation.

By creating environments where agents can learn from experience, operate across platforms, and leverage standardized tool interfaces, we’re building the foundation for truly autonomous systems. As these projects continue to evolve, they promise to unlock new possibilities for AI agents—from more effective task automation to cross-environment coordination and continuous improvement through interaction. The future of agent technology lies not just in better models, but in better environments—environments that allow those models to learn, adapt, and grow through experience.

‍Join us in exploring this frontier of AI research and development—where the boundaries between environments dissolve, and agents gain the power to navigate our complex digital world with increasing autonomy and effectiveness. Ready to join? Click the link or paste it into your browser to apply now.

The combination of OWL and CRAB provides an ideal platform for exploring these directions, with OWL's sophisticated agent capabilities complemented by CRAB's rigorous environment specifications.

OWL GitHub: https://github.com/camel-ai/owl

CRAB GitHub: https://github.com/camel-ai/crab

🐉 Loong: Synthesize Long CoTs at Scale through Verifiers

Nomadev — Wed, 09 Apr 2025 18:07:51 +0000

Recent Large Reasoning Models such as DeepSeek-R1 have demonstrated that general reasoning capabilities of LLMs greatly improve when base models undergo post-training with Reinforcement Learning (RL) with a verifiable reward. Mathematics and programming have particularly benefited from this approach, as these domains can be verified quite easily—allowing accurate interpretation of LLM responses and effective comparison to the ground truth on a semantic level. This idea that ease of verification is crucial to improving domain-specific capabilities has become widely accepted in the research community.

Another critical prerequisite which is often overlooked is the abundance of high-quality datasets, featuring questions paired with verified correct answers in the domains of Math and Coding. These curated datasets provided the necessary signal for models to learn to construct coherent Chains-of-Thought (CoTs) leading reliably to correct answers.

However, many other domains also require reliable reasoning—such as logic, graph theory, physics, and finance. These domains lack comparable datasets, and human-supervised data production at scale is prohibitively expensive. Without abundant correct answers to learn from, models cannot easily acquire domain-specific reasoning patterns. This raises a crucial question: Can similar reasoning performance be achieved in domains beyond math and programming?

In this blog, we introduce Project Loong - focusing on scaling up synthetic data generation with verifiers for a broad range of domains. We believe that synthetic data generation is essential—not only for addressing gaps in data-scarce domains, but also for enhancing reasoning capabilities in areas like math and programming by expanding dataset availability.

Closing the Verification Gap in Synthetic Data for RL

Naturally, a natural gap exists between synthetic questions and their answers, as the correctness of synthetic answers isn't inherently guaranteed. To close this gap entirely, one would need human supervision which is prohibitively expensive. We try to close this gap as much as possible without involving a human in the loop.

To do this, we developed a multi-agent system that generates synthetic questions and corresponding answers from a seed dataset. These synthetic questions are then posed to the agent we want to train, and we employ various domain-specific verifiers to compare the agent's responses against the synthetic answers to check for semantic equivalence.

One of our main ideas is grounded in a simple hypothesis: an LLM equipped with a code interpreter can solve questions significantly more reliably compared to one relying solely on its own chain-of-thought reasoning in natural language.

This makes intuitive sense, as many fields beyond computer science—such as physics, neurophysiology, economics, and computational biology—frequently rely on code-based solutions to solve problems in their own domain.

The Loong Environment

Since we are mostly interested in doing RL, we have structured all components into a unified Gym-like environment, providing a clear interface for RL experimentation.

Our environments compromises three main components:

Seed Dataset

We begin by manually collecting domain-specific datasets consisting of questions and ground truth answers. Each question in the seed dataset is ensured to be solvable using code. If available, we also record the code that leads to the ground truth. The purpose of the seed dataset is not to be a large -scale dataset to use directly for training, but as a means to bootstrap the synthetic data generation process by seeding the generative process of the LLM.

Dataset Overview

The repository currently includes a total of 3,551 questions spanning 8 diverse domains (and growing):

Advanced Math: 1,615 questions
Advanced Physics: 434 questions
Computational Biology: 304 questions
Finance: 320 questions
Graph & Discrete Math: 179 questions
Logic: 110 questions
Mathematical Programming: 68 questions
Security & Safety: 521 questions

Synthetic Data Generator

Our Synthetic Data Generator can be seen as a blackbox, that is seeded by a seed dataset, and generates an arbitrary number of synthetic questions and synthetic answers to those questions based on the seed dataset. The environment makes no further assumptions about the inner workings of the generator. This means that any algorithm can be used under the hood for creating synthetic data. We currently support few-shot prompting over the seed data, as well as a mutli-agent system, where we use self-instruct, evol-instruct or other data generation pipelines for generating questions and a solver agent for the synthetic answers.

It is important to stress, that we do not expect these synthetic answers to always be correct. While we assume that we will obtain more correct solutions than with a naive CoT due to the code execution providing accurate computations, we are well aware that a lot of synthetic answers will still be wrong.

However, this is not a problem since we don’t learn from this raw synthetic data. We will further filter it in the next step and only learn from this filtered synthetic data.

Verifier

While the Synthetic Data Generator produces ample synthetic data, it's essential to filter out incorrect solutions before using them for training. To do this effectively, we validate synthetic answers using two independent approaches:

Deriving one solution directly through the Synthetic Data Generator’s code execution.
Independently generating another solution via natural-language Chain-of-Thought (CoT) reasoning.

If these independent solutions agree, it's highly likely that the answer is correct. Although rare, there's still a possibility of false positives (both approaches incorrectly agreeing). However, given the fundamentally different methods involved, we believe this will not occur often enough to be detrimental to model training.

Each environment also includes a verifier that semantically compares the LLM response with the synthetic answer, ensuring they are effectively equivalent. This verification step is crucial for accurately filtering semantic equivalences, significantly reducing false negatives (cases where semantically correct answers would otherwise be wrongly rejected).

The CoT-generating agent is the model we ultimately aim to train. During RL training, this agent receives positive rewards only when its final CoT-generated answer is semantically confirmed by the verifier to match the synthetic answer, thus ensuring it learns exclusively from likely-correct synthetic data.

A code snippet to get started with the Loong Environment

The code snippet below shows a simplified version of how to use the Loong environment. Implementation details that are not conducive to improving the understanding on a cursory level have been omitted. For a detailed explanation on how to use the single step environment, please refer to this cookbook.

from camel.environments import SingleStepEnv
from camel.datasets import FewShotGenerator, StaticDataset
from camel.verifiers import PythonVerifier
from camel.agents import ChatAgent
from datasets import load_dataset

# Load and initialize a seed dataset
dataset = load_dataset("camel-ai/loong", split="graph_discrete_math")
seed_dataset = StaticDataset(dataset)

# Set up the verifier
verifier = PythonVerifier(required_packages=["numpy", "networkx"])

# Define a model backend to use for the generator
model = ...

# Set up synthetic data generation
generator = FewShotGenerator(seed_dataset=seed_dataset, verifier=verifier, model=model)

# Initialize the Loong environment
env = SingleStepEnv(generator, verifier)

# Define the agent that shall interact with the environment
agent = ChatAgent()

# Example environment interaction
obs = await env.reset()
agent_response = agent.step(obs.question)  # a step for the agent
next_obs, reward, done, info = await env.step(agent_response)

Contribute to Project Loong 🐉

Researchers and developers can use the Loong environment to generate synthetic data across a variety of domains. We have already collected seed datasets for a few domains, including Mathematics, Graph Theory, Mathematical Programming and Logic. The seed data, as well as cookbooks can be found on Github. Additionally, we encourage you to collect your own seed datasets and leverage Loong to generate synthetic data for your domain.We have have unified and uploaded all the seed dataset we collected to HuggingFace:check here

Additionally, we encourage you to collect your own seed datasets and leverage Loong to generate synthetic data for your domain.

We are currently working on using the environment that we built to do post-training on top of LLMs of different sizes to see whether we can see an improvement in the general as well as domain-specific reasoning capabilities. We are still experimenting with different reward setups, focusing mainly on accuracy rewards, following the approach of DeepSeek. More details, as well as our results will be released in our upcoming preprint paper.

At CAMEL, we believe that environments are a vital component for improving domain-specific agent reasoning. If a problem can be framed clearly within an environment, agents have the potential to master it autonomously.

With Loong, we aim to address a key challenge in synthetic data generation: ensuring data quality through verifiability. Our goal with Loong is to make it easier to build reliable reasoning datasets in domains where curated data is scarce.

We invite researchers and developers to contribute seed datasets, verifiers, and ideas to help improve and extend our project. Ready to join? Click the link or paste it into your browser to apply now.

Scaling Environments for Agents

Nomadev — Mon, 07 Apr 2025 12:13:04 +0000

At CAMEL-AI.org, we are committed to pushing the boundaries of artificial intelligence through multi-agent systems. This blog post restates our mission, discusses current limitations and trends of AI agents, and outlines our initiative to build environments for the data-driven future of AI agents.

Mission: Finding the Scaling Laws of Agents

Our mission has always been clear and unwavering: to uncover the scaling laws of agents and build the foundational infrastructure for multi-agent systems that can drive the future of artificial intelligence. From the beginning, we have been committed to exploring how agents scale in complexity, environments, and evolution.

Dimensions of Scaling Laws of Agents

We focus on three key dimensions:

1. Number of Agents: How do agents behave when scaled to large numbers? What emergent abilities arise from their interactions? We aim to study these phenomena and uncover patterns that reveal new capabilities as agent systems grow in scale.

2. Environments: How do we create environments designed to enable agents to learn complex reasoning, long-term decision-making, adaptive behavior, and allow agents to acquire new knowledge or skills through interaction? Our focus is on developing environments that simulate real-world complexity while providing reward signals that effectively drive agent learning and evolution.

3. Evolution: How can agents evolve through interactions within their environment? We are building reinforcement learning environments and memory systems for agents to create agents that can generalize across tasks, adapt to new challenges, and continuously improve through experience.

In this blog, we are focusing on the importance of scaling environments. Environments are not just containers for agent activity; they are essentially the missing data for agents that cannot be acquired simply by scraping the internet. Environments provide the dynamic, interactive contexts necessary for agents to learn adaptive behaviors and develop long-term decision-making capabilities.

The Rise of End-to-End Reinforcement Learning for LLM Agents

The initial approach to making AI agents functional relied heavily on prompt engineering by crafting specific instructions to guide LLM agents. This involved techniques like:

Role-Based Prompts: Instructing agents to follow predefined roles or personas to simulate specific behavior.
Few-Shot Prompting: Providing examples within prompts to teach agents how to use tools or perform complex reasoning.
Output Formatting: Using tricks to ensure models generate structured outputs, such as JSON responses.

While these techniques are effective in prototyping agent systems, they come with significant limitations that hindered robustness, adaptability, and scalability. Prompt-based agents often fail when encountering complex or unforeseen scenarios. Their rigid behavior patterns make them ill-suited for tasks requiring dynamic decision-making. Prompts can unintentionally introduce biases or lead to hallucinated outputs, especially when interacting with tools or external components. Crafting effective prompts for increasingly complex tasks requires significant expertise, time, and trial-and-error, making it difficult to scale across diverse applications.

These challenges underscore the need for a paradigm shift—moving away from reliance on pure prompt engineering toward end-to-end reinforcement learning for LLM agents.

From Prompt Engineering to End-to-End Autonomy

End-to-end RL for LLM agents has been considered a promising direction for addressing the shortcomings of prompt engineering. These agents are trained holistically on tasks, rather than relying on manually crafted prompts for every scenario.

Recent advancements in RL for LLM agents have emerged from leading research labs and startups. Notable examples include OpenAI's Operator and Deep Research, xAI's Grok 3, and DeepSeek's R1. OpenAI's Operator combines GPT-4o's vision capabilities with reinforcement learning, allowing it to interpret screenshots and interact with GUIs effectively and perform web-based tasks such as ordering groceries, booking reservations, and creating memes without requiring custom API integrations. OpenAI's Deep Research leverages reinforcement learning to autonomously navigate complex browsing and reasoning tasks across diverse domains. Trained with end-to-end reinforcement learning, it plans and executes multi-step trajectories, backtracking and adapting to real-time information as necessary. xAI's Grok 3 trained on the Colossus supercluster with ten times the computational power of previous models, Grok 3 (Think) was trained using reinforcement learning to refine its chain-of-thought reasoning. It refines its problem-solving strategies by thinking for seconds to minutes, correcting errors, exploring alternatives, and delivering accurate answers across various tasks, including mathematics, coding, and world knowledge. DeepSeek's R1 series models utilize RL to develop advanced reasoning capabilities. Initially, DeepSeek-R1-Zero demonstrated that complex reasoning behaviors, such as extended chain-of-thought and self-correction, could emerge purely through RL without supervised fine-tuning. Building upon this foundation, DeepSeek-R1 incorporates a small "cold-start" dataset alongside iterative RL and supervised fine-tuning to enhance output coherence and user-friendliness while maintaining state-of-the-art reasoning performance.

As the field continues to evolve, we foresee an increasing number of vertical agent startups incorporating reinforcement learning to train LLM agents to tackle specific industry challenges. For instance, a recent post from the Cursor team, creators of an AI-powered code editor, indicates that Cursor AI is working on building RL models in real-world coding environments to automate coding.

Environment is the Missing “Data” for Agents

We are excited about the future of RL for LLM agents, as AI already matches human capabilities in many tasks. RL offers a promising path to achieving superhuman intelligence, and we may witness more "Lee Sedol moments," like AlphaGo’s historic victory, in the area of LLM agents across different domains. However, its full potential remains unrealized because the critical “data” for effective agent training is missing: realistic, standardized environments. While internet data may offer vast amounts of information, it lacks the interactive, adaptive, and diverse settings required for an agent to learn long-term decision-making through trial and error. Agents trained solely on static internet data struggle to understand temporal dynamics and complex cause-and-effect relationships in the real world.

Equally challenging is the design of robust reward functions. Without carefully crafted reward signals, it becomes difficult to train agents to exhibit desired behaviors. Developing dedicated verifiers to assess LLM responses can be instrumental in defining reward functions that ensure reward signals remain reliable and aligned with long-term objectives.

At CAMEL-AI.org, we believe that overcoming the challenges of reinforcement learning for LLM agents requires a community-driven approach. Our open-source framework is designed to facilitate global collaboration among researchers and developers, enabling the creation of scalable environments and robust reward mechanisms. Thanks to our contributors, we already have the foundational building blocks in place, including environments, verifiers, data generation pipelines, and toolkits that are essential for further development.

Fill out this form and join us in shaping a future where reinforcement learning reaches its full potential

Forem: Nomadev

Everything you need to know about AI this week

1. OpenAI Sora 2

2. Qwen3-VL-30B-A3B Instruct & Thinking

3. AntLing Ring-1T: 1 Trillion Open-Source Thinking Model

4. Claude Sonnet 4.5: Best Model for Agentic Coding

5. GLM-4.6: Agentic, Reasoning, and Coding Powerhouse

6. Coral v1: Launch and Monetize AI Agents

7. NotebookLM: Customizable Chat Experience

8. Comet by Perplexity: Now Available Globally

9. DeepSeek-V3.2-Exp: Faster, Smarter, Cheaper

10. Granite 4.0 by IBM: Lightweight Models for Local Use

11. Nano Banana by Google DeepMind: Production-Ready Image Model

12. Unlock the Full Potential of Your Mac with Spec

13. Composer: The First AI Agent for Document Processing

14. C1 by Thesys: Generative UI for LLMs

15. CrewAI Launches AMP: The OS for AI Agents

15. CrewAI Launches AMP: The OS for AI Agents

What Building a Hybrid Browser Toolkit Taught Us About the Web

From Monolith to Hybrid

Introducing the CAMEL Hybrid Browser Toolkit

What's Different Under the Hood

Set-of-Marks Screenshots

Enhanced Stealth Mode

Memory-Efficient Screenshots

Smarter Form Filling

Key Features at a Glance

Try It: Session & Navigation Tools

Page Inspection Tools

Interaction Tools

Click an Element

Type into Input Fields

Select Dropdowns

Enter Key

Scroll

Mouse Control

Drag and Drop

Press Keys

Tab Management

Switch Tab

Close Tab

Console Commands

Advanced & Utility

Wait for Manual Step

Combine It All

Operating Modes: Text vs. Visual vs. Hybrid

Connection Modes: Playwright vs CDP vs MCP

Standard Playwright (default)

Chrome DevTools Protocol (CDP)

MCP (Model Context Protocol)

1. Install the MCP Server

2. Configure Claude Desktop

3. Restart Claude Desktop

Available Browser Tools

Basic Usage Example

Customization

Closing Thoughts

We hired AI to do Growth Engineering and here’s what happened

Step 1: Open Eigent and Navigate to MCP & Tools Settings

Step 2: Add a Custom MCP Server via JSON Configuration

Step 3: Configure the GitHub MCP Server Settings (Include Your PAT)

Step 4: Add a GitHub-Focused Worker (Agent) Using the New MCP Server

Step 5: Prompt the Agent to Summarize Pull Requests

Step 6: Watch Eigent Automatically Break Down the Task and Fetch Data

Empowering OSS Workflows with Agentic Automation

A Guide to Building a Fully Local AI Workforce for FREE

What is Eigent?

Prerequisites

1. Clone the Repo & Start the PostgreSQL Backend

2. Verify Local Data Storage

3. Modify .env.development for Local Proxy

4. Run the Frontend App

5. Access the Eigent UI Locally

Watch the Full Tutorial

What's Next?

How Not to Be Replaced by AI (A Developer’s Guide)

1. Embrace AI as Your Coding Sidekick

2. Be a Problem Solver, Not a Code Monkey

3. Double Down on Your Human Skills (Creativity, Context and Communication)

4. Never Stop Learning and Adapting

3. Modify `.env.development` for Local Proxy