<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mininglamp</title>
    <description>The latest articles on Forem by Mininglamp (@mininglamp).</description>
    <link>https://forem.com/mininglamp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3846168%2F6a138840-d665-4ba6-aedf-1b5c492035c4.png</url>
      <title>Forem: Mininglamp</title>
      <link>https://forem.com/mininglamp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mininglamp"/>
    <language>en</language>
    <item>
      <title>How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Tue, 14 Apr 2026 12:15:51 +0000</pubDate>
      <link>https://forem.com/mininglamp/how-mano-p-achieves-1-on-osworld-architecture-benchmarks-and-edge-deployment-4p81</link>
      <guid>https://forem.com/mininglamp/how-mano-p-achieves-1-on-osworld-architecture-benchmarks-and-edge-deployment-4p81</guid>
      <description>&lt;p&gt;Open-source GUI agents have been gaining traction, but most still rely on cloud inference, DOM parsing, or CLI hooks. Mano-P takes a different approach: pure vision-driven GUI automation that runs entirely on edge devices. And the benchmark results back it up — #1 on OSWorld among specialized models.&lt;/p&gt;

&lt;p&gt;This article breaks down the architecture, benchmark data, and edge deployment performance, all from the project's public README and technical report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OSWorld: #1 Among Specialized Models
&lt;/h3&gt;

&lt;p&gt;OSWorld is the standard benchmark for GUI agent evaluation. Mano-P 1.0-72B achieves a &lt;strong&gt;58.2% success rate&lt;/strong&gt;, ranking first among all specialized GUI agent models. For context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;OSWorld Success Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P 1.0-72B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opencua-72b&lt;/td&gt;
&lt;td&gt;45.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gap&lt;/td&gt;
&lt;td&gt;+13.2 percentage points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's not a marginal improvement — it's a 29% relative gain over the second-place model.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebRetriever: Beating Cloud Giants
&lt;/h3&gt;

&lt;p&gt;On the WebRetriever Protocol I benchmark, Mano-P scores &lt;strong&gt;41.7 NavEval&lt;/strong&gt;, which puts it ahead of:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;NavEval Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P 1.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro Computer Use&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 4.5 Computer Use&lt;/td&gt;
&lt;td&gt;31.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Worth noting: Gemini and Claude are cloud-based services with massive compute budgets. Mano-P achieves comparable or better results while running on local hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  13 Benchmarks, SOTA Across the Board
&lt;/h3&gt;

&lt;p&gt;Beyond OSWorld and WebRetriever, Mano-P holds SOTA positions across 13 benchmarks spanning GUI grounding, perception &amp;amp; cognition, context learning, and pruning efficiency. The full benchmark data is available in the &lt;a href="https://github.com/Mininglamp-AI/Mano-P#benchmark-performance" rel="noopener noreferrer"&gt;README&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Why Pure Vision?
&lt;/h2&gt;

&lt;p&gt;Most GUI agents fall into one of these categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DOM/HTML Parsing&lt;/td&gt;
&lt;td&gt;Read page structure directly&lt;/td&gt;
&lt;td&gt;Web-only, breaks on native apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CDP + CLI&lt;/td&gt;
&lt;td&gt;Chrome DevTools Protocol + shell commands&lt;/td&gt;
&lt;td&gt;Browser-dependent, fragile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Computer Use&lt;/td&gt;
&lt;td&gt;Send screenshots to cloud API&lt;/td&gt;
&lt;td&gt;Privacy concerns, latency, API costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pure Vision (Mano-P)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;See the screen, understand it, act&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Requires capable on-device model&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mano-P chose pure vision. No DOM access, no browser hooks, no platform-specific APIs. The model looks at the screen — the same pixels a human sees — and decides what to click, type, or scroll.&lt;/p&gt;

&lt;p&gt;This is harder to build, but the payoff is generality: the same model works across any GUI application, any platform, without integration work per app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Methodology: Mano-Action
&lt;/h2&gt;

&lt;p&gt;The technical backbone is &lt;strong&gt;Mano-Action&lt;/strong&gt;, a bidirectional self-reinforcement learning framework. The training follows three stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Supervised Fine-Tuning (SFT)&lt;/strong&gt;&lt;br&gt;
Starting from a base vision-language model, fine-tune on curated GUI interaction datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Offline Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
Learn from recorded interaction trajectories, optimizing action quality without live environment access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Online Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
The model interacts with real GUI environments, receiving feedback and iterating. This is where the "think-act-verify" loop reasoning mechanism comes in — the model plans an action, executes it, verifies the result, and adjusts.&lt;/p&gt;

&lt;p&gt;The bidirectional aspect means Text→Action and Action→Text consistency are both optimized, creating a tighter loop between understanding and execution.&lt;/p&gt;
&lt;h2&gt;
  
  
  Edge Optimization: Running on Apple M4
&lt;/h2&gt;

&lt;p&gt;The 72B model delivers SOTA benchmarks, but the edge story is equally important. Through &lt;strong&gt;mixed-precision quantization&lt;/strong&gt; and a novel visual token pruning technique called &lt;strong&gt;GSPruning&lt;/strong&gt;, Mano-P achieves practical performance on consumer hardware.&lt;/p&gt;
&lt;h3&gt;
  
  
  GSPruning: Preserving What Matters
&lt;/h3&gt;

&lt;p&gt;GSPruning (Global Spatial Pruning) is designed specifically for vision-language models processing high-resolution interfaces. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preserves global spatial structure through anchor points&lt;/li&gt;
&lt;li&gt;Identifies semantic outliers for critical UI elements&lt;/li&gt;
&lt;li&gt;Achieves 2-3× throughput speedup with minimal performance loss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the Online-Mind2Web benchmark, GSPruning at 25% token retention achieves a success rate of 0.400 on Qwen3VL-4B, compared to 0.425 at full tokens — only a 6% drop while running significantly faster.&lt;/p&gt;
&lt;h3&gt;
  
  
  M4 Pro Performance
&lt;/h3&gt;

&lt;p&gt;The 4B quantized model (w4a16) on Apple M4 Pro with 64GB RAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill Speed&lt;/td&gt;
&lt;td&gt;476 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode Speed&lt;/td&gt;
&lt;td&gt;76 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak Memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill Time (4K context)&lt;/td&gt;
&lt;td&gt;8.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4.3 GB peak memory means this runs comfortably alongside other applications. No dedicated GPU server required.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hardware Requirements
&lt;/h3&gt;

&lt;p&gt;Two deployment options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Direct&lt;/strong&gt;: Mac mini or MacBook with Apple M4 chip, 32GB+ RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computing Stick&lt;/strong&gt;: Any Mac + Mano-P computing stick via USB 4.0+&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Data Privacy: The Edge Advantage
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Local Mode&lt;/strong&gt;, all processing happens on-device:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Screenshots never leave the device&lt;/li&gt;
&lt;li&gt;✅ Task descriptions stay local&lt;/li&gt;
&lt;li&gt;✅ No cloud API calls&lt;/li&gt;
&lt;li&gt;✅ Full source code is open for audit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cloud Mode&lt;/strong&gt; is available as a fallback (screenshots sent to &lt;code&gt;mano.mininglamp.com&lt;/code&gt;), but the local-first architecture means sensitive workflows can run with zero data exposure.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Three usage forms are available:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI (for terminal users):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap HanningWang/tap
brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua

mano-cua run &lt;span class="s2"&gt;"Open WeChat and tell FTY the meeting is postponed"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python SDK (planned):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mano_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ManoClient&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManoClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search for AI news on Xiaohongshu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ClawHub Skill (for AI agents):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;clawhub &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Skill form is designed for AI agents like Claude Code or OpenClaw — the agent automatically invokes Mano-P when GUI operations are needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Mano-P is being released in three phases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt; (now): Mano-CUA Skills — for agent enthusiasts to build CUA task workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt; (coming): Local models + SDK — for developers with high security requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3&lt;/strong&gt; (planned): Training methods + pruning techniques — for researchers who want to train their own GUI-VLA models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project is Apache 2.0 licensed. Full source, benchmarks, and documentation: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>Open-Sourcing Mano-P Today: Pure Vision GUI Agent, OSWorld #1, Apache 2.0</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 13 Apr 2026 10:29:19 +0000</pubDate>
      <link>https://forem.com/mininglamp/open-sourcing-mano-p-today-pure-vision-gui-agent-osworld-1-apache-20-3c0h</link>
      <guid>https://forem.com/mininglamp/open-sourcing-mano-p-today-pure-vision-gui-agent-osworld-1-apache-20-3c0h</guid>
      <description>&lt;h2&gt;
  
  
  Article
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mano-P: Open-Sourcing the #1 GUI Agent on OSWorld — Pure Vision, On-Device, Apache 2.0
&lt;/h3&gt;

&lt;p&gt;Today we're open-sourcing &lt;strong&gt;Mano-P&lt;/strong&gt; under Apache 2.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Mano-P?
&lt;/h3&gt;

&lt;p&gt;Mano-P is a pure-vision GUI agent. It looks at your screen — literally a screenshot — understands the UI elements, and performs actions. No CDP protocol. No HTML parsing. No accessibility API. Just vision.&lt;/p&gt;

&lt;p&gt;The name: &lt;em&gt;Mano&lt;/em&gt; is Spanish for "hand." Current AI agents can interact with computers, but most of them operate like a lobster's claw — functional, but clumsy. Mano-P is designed to give agents a proper hand.&lt;/p&gt;

&lt;p&gt;The P stands for four things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Power&lt;/strong&gt; — 13 multimodal benchmark SOTAs, #1 among proprietary models on OSWorld&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private&lt;/strong&gt; — Runs on-device, your data never leaves your machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public&lt;/strong&gt; — Apache 2.0, fully open source starting today&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal&lt;/strong&gt; — A foundation for building your own personalized AI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Does This Matter?
&lt;/h3&gt;

&lt;p&gt;There are broadly four approaches to GUI automation today:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional RPA&lt;/td&gt;
&lt;td&gt;UiPath&lt;/td&gt;
&lt;td&gt;Coordinate-based, breaks when UI changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Browser CUA&lt;/td&gt;
&lt;td&gt;CDP-based agents&lt;/td&gt;
&lt;td&gt;Limited to browsers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Computer Use&lt;/td&gt;
&lt;td&gt;Claude CU, etc.&lt;/td&gt;
&lt;td&gt;Requires uploading screen data to the cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pure Vision GUI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mano-P&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sees the screen directly, works everywhere, runs locally&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most current GUI agents either depend on browser protocols (limiting them to Chrome) or run in the cloud (requiring you to stream your screen to a remote server). Mano-P takes a different approach: it processes raw screen captures through a vision model to understand and interact with any GUI — desktop apps, browsers, 3D tools, professional software, anything with a graphical interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmarks
&lt;/h3&gt;

&lt;p&gt;Numbers first.&lt;/p&gt;

&lt;h4&gt;
  
  
  OSWorld: #1 Among Proprietary Models
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mano-P 72B: 58.2%&lt;/strong&gt; success rate&lt;/li&gt;
&lt;li&gt;Second place (opencua-72b): 45.0%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gap: +13.2 percentage points&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Ranks &lt;strong&gt;5th overall&lt;/strong&gt; (the four models above it are 100B+ general-purpose LLMs like GPT, Claude, and Gemini)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  13 Multimodal Benchmark SOTAs
&lt;/h4&gt;

&lt;p&gt;Selected results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ScreenSpot-V2&lt;/td&gt;
&lt;td&gt;93.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMBench&lt;/td&gt;
&lt;td&gt;87.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI-Vision&lt;/td&gt;
&lt;td&gt;46.6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Web Navigation
&lt;/h4&gt;

&lt;p&gt;On WebRetriever Protocol I NavEval:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mano-P: 41.7&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Gemini 2.5 Pro CU: 40.9&lt;/li&gt;
&lt;li&gt;Claude 4.5 CU: 31.3&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  On-Device Performance
&lt;/h3&gt;

&lt;p&gt;The 4B quantized model (w4a16) running on Apple M4 Pro:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill&lt;/td&gt;
&lt;td&gt;476 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode&lt;/td&gt;
&lt;td&gt;76 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Your screenshots, your instructions, your data — all processed locally. Nothing uploaded to any cloud server.&lt;/p&gt;

&lt;p&gt;For anyone dealing with sensitive environments (enterprise systems, personal data, financial interfaces), this isn't a nice-to-have. It's a requirement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Technical Highlights
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pure-Vision GUI Understanding&lt;/strong&gt;: No CDP, no DOM parsing. The model directly processes screen captures to identify UI elements, understand layout, and locate interaction targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mano-Action Bidirectional Self-Enhancement&lt;/strong&gt;: Inspired by cycle consistency in GANs. The training loop runs in both directions — Text→Action and Action→Text — enforcing consistency between them. This significantly improves generalization to unseen UIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-Stage Training Pipeline&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;SFT&lt;/strong&gt; (Supervised Fine-Tuning): Learn basic screen-to-action mapping from human demonstrations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline RL&lt;/strong&gt;: Learn from collected trajectories (both successes and failures) without live interaction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Online RL&lt;/strong&gt;: Direct interaction with real GUI environments for continued policy improvement&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;GSPruning&lt;/strong&gt; (Guided Structural Pruning): Compresses visual token retention to &lt;strong&gt;约 25%&lt;/strong&gt;, boosting throughput &lt;strong&gt;2-3×&lt;/strong&gt; with minimal accuracy loss. This is what makes on-device inference practical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think-Act-Verify Loop&lt;/strong&gt;: Instead of generating a full action sequence upfront, Mano-P executes one step at a time: analyze the screen → perform an action → verify the result → plan the next step. Far more robust than one-shot planning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open-Source Roadmap
&lt;/h3&gt;

&lt;p&gt;We're releasing Mano-P in three stages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Phase 1&lt;/td&gt;
&lt;td&gt;Mano-CUA Skill — ready to use&lt;/td&gt;
&lt;td&gt;✅ &lt;strong&gt;Today&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phase 2&lt;/td&gt;
&lt;td&gt;Local model + SDK — zero cloud dependency&lt;/td&gt;
&lt;td&gt;🔜 Coming soon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phase 3&lt;/td&gt;
&lt;td&gt;Training methods + GSPruning + quantization&lt;/td&gt;
&lt;td&gt;📋 Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The end goal: the entire stack — training, pruning, quantization, deployment — fully open to the community.&lt;/p&gt;

&lt;h3&gt;
  
  
  Get Started
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Option 1: CLI (mano-cua)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap HanningWang/tap &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mano-cua run &lt;span class="s2"&gt;"Open WeChat and tell FTY the meeting is postponed"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option 2: Agent Skill (mano-skill)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;clawhub &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, your agent can autonomously invoke Mano-P for GUI operations — no manual triggering needed. The agent decides when GUI interaction is required and calls Mano-P automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 3: Python SDK (mano-client)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coming soon. Watch the repo for updates.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Can You Do With It?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mano-afk: Fully Autonomous App Building&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Give it a natural language requirement. Mano-P autonomously handles the entire pipeline: requirements analysis → architecture → code → test → fix → verify. No human in the loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personal AI: Learning Your Habits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mano-P can learn and adapt to your personal operational style. A fun example: it can play mahjong following &lt;em&gt;your&lt;/em&gt; specific strategy — not the optimal strategy, but &lt;em&gt;your&lt;/em&gt; style. That's what Personal AI means in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "Personal AI"?
&lt;/h3&gt;

&lt;p&gt;When a capable GUI agent runs on your local device, handles your data without uploading anything, and operates all the software you use daily — it stops being a "tool" and starts becoming a truly personal AI.&lt;/p&gt;

&lt;p&gt;With Mano-P open-sourced under Apache 2.0, every developer, every team, every organization can build their own version. This is the beginning.&lt;/p&gt;




&lt;h3&gt;
  
  
  ⭐ Star, Clone, Try It
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apache 2.0. Issues, PRs, and contributions welcome.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>automation</category>
    </item>
    <item>
      <title>SOTA on 13 Benchmarks: Mininglamp Open-Sources GUI-VLA Model Mano-P 1.0</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 13 Apr 2026 04:41:27 +0000</pubDate>
      <link>https://forem.com/mininglamp/open-sourcing-mano-p-today-pure-vision-gui-agent-osworld-1-apache-20-32eg</link>
      <guid>https://forem.com/mininglamp/open-sourcing-mano-p-today-pure-vision-gui-agent-osworld-1-apache-20-32eg</guid>
      <description>&lt;h1&gt;
  
  
  SOTA on 13 Benchmarks: Mininglamp Open-Sources GUI-VLA Model Mano-P 1.0
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Original article: &lt;a href="https://mp.weixin.qq.com/s/eWnQTvY0OiuHzujJE32kPw" rel="noopener noreferrer"&gt;WeChat (Chinese)&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Mininglamp Technology has officially open-sourced Mano-P 1.0, an in-house GUI-aware agent model. Mano-P handles GUI perception, understanding, planning, action, and verification — all through pure vision. It can directly understand and operate desktop applications, web interfaces, and complex graphical workflows, and it runs locally on Apple M4 devices.&lt;/p&gt;

&lt;p&gt;Mano-P moves AI beyond "look but don't touch." It executes complex tasks across platforms directly in real graphical interfaces. The project is released under Apache 2.0 with full source code available for audit, commercial use, and modification.&lt;/p&gt;

&lt;p&gt;By combining pure visual understanding with local execution, Mano-P enables individual developers and organizations to build personalized AI at low cost while maintaining full data sovereignty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vision-Only: Solving the Last Mile for Complex Workflows
&lt;/h2&gt;

&lt;p&gt;Most automation today depends on underlying API calls, the CDP protocol, or HTML parsing. These approaches fall short when dealing with non-standard applications or cross-system workflows. Mano-P takes a fundamentally different approach: pure visual understanding as the core paradigm. It requires no external APIs or protocols, and can directly understand and operate desktop software, 3D applications, and specialized professional tools — breaking free from the browser-centric ecosystem.&lt;/p&gt;

&lt;p&gt;Mano-P also serves as an execution backbone for existing agent ecosystems. It integrates seamlessly as a skill into AI agents like OpenClaw. With this integration, agents can navigate across multiple windows and cross-application workflows, performing clicks, text input, window switching, and visual verification in a closed loop.&lt;/p&gt;

&lt;p&gt;This addresses a long-standing bottleneck in agent workflows: the need for human intervention. Mano-P enables not just automated build-and-test pipelines, but autonomous execution of complex business scenarios end-to-end.&lt;/p&gt;

&lt;h2&gt;
  
  
  SOTA on 13 Benchmarks: Raising the Bar for GUI-Specific Models
&lt;/h2&gt;

&lt;p&gt;Mano-P ships in two versions: a full 72B model that pushes the performance ceiling, and a 4B quantized model (w4a16) optimized for on-device deployment.&lt;/p&gt;

&lt;p&gt;The 72B version achieves SOTA results across 13 authoritative multimodal benchmarks, covering GUI Grounding, CUA (Computer Use Agent), multimodal perception and cognition, video understanding, and long-context learning. It sets a new performance standard for on-device GUI agents.&lt;/p&gt;

&lt;p&gt;On the OSWorld proprietary model benchmark, Mano-P 72B reaches a 58.2% task success rate — ranking first globally and leading the second-place opencua-72b (45.0%) by 13.2 percentage points. It also tops ScreenSpot-V2, MMBench, UI-Vision, and other evaluation suites.&lt;/p&gt;

&lt;p&gt;These results are driven by architectural innovation. Mano-P uses a three-stage progressive training pipeline: SFT (supervised fine-tuning), offline reinforcement learning, and online reinforcement learning. Combined with a proprietary GSPruning visual token pruning technique, this delivers a significant leap in on-device inference efficiency.&lt;/p&gt;

&lt;p&gt;On Apple M4 Pro hardware, the 4B quantized model achieves 476 tokens/s prefill speed and 76 tokens/s decode speed, with peak memory usage of just 4.3 GB — well within the constraints of mainstream edge devices.&lt;/p&gt;

&lt;h2&gt;
  
  
  On-Device Deployment: Air-Gapped Data Protection
&lt;/h2&gt;

&lt;p&gt;As AI moves deeper into core business processes, data privacy and compliance become critical. Mano-P supports fully local on-device deployment with zero cloud data transmission. Its "pure vision + local execution" architecture enables physical isolation between data processing and external networks.&lt;/p&gt;

&lt;p&gt;In local mode, the model runs directly on Mac mini or MacBook (M4 chip or later, 32 GB+ RAM), or via a Mano-P compute stick connected over USB 4.0. Screenshots, business data, and task instructions all stay in a local closed loop, eliminating cloud transmission risk at the source.&lt;/p&gt;

&lt;p&gt;Mano-P also handles long-running tasks autonomously offline. Even without network access, it can independently drive complex business workflows, including mid-process decision-making and error correction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Open-Source Strategy: Accelerating the Personalized AI Ecosystem
&lt;/h2&gt;

&lt;p&gt;Mano-P is released under Apache 2.0 with complete client source code open for audit, commercial use, and derivative work.&lt;/p&gt;

&lt;p&gt;To lower the barrier to entry, Mano-P offers three ready-to-use deployment modes covering different tech stacks. No complex API key setup required — users can build high-performance GUI agents with minimal configuration.&lt;/p&gt;

&lt;p&gt;As a first step, Mininglamp is open-sourcing the Mano-CUA core skill. Users can configure it into OpenClaw or Claude Code to build smarter CUA task workflows and eliminate human-intervention bottlenecks.&lt;/p&gt;

&lt;p&gt;The Mano-CUA local model and SDK components are expected to be open-sourced within the month, targeting developers with high-security requirements. Users will be able to call locally deployed GUI-VLA models to build custom skills and tools, with all CUA operations executing on local Mac hardware — nothing uploaded to external servers.&lt;/p&gt;

&lt;p&gt;Looking ahead, Mininglamp plans to fully open-source the underlying training methods, token pruning techniques, and mixed-precision quantization schemes behind Mano-P, enabling developers to build custom local GUI-VLA models tailored to their own use cases.&lt;/p&gt;

&lt;p&gt;From technical breakthroughs to ecosystem building, Mano-P tightly integrates GUI perception, visual operation, local execution, and open-source collaboration. It establishes a solid technical foundation for on-device agents and charts a concrete path toward Personalized AI.&lt;/p&gt;




&lt;p&gt;GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Original article (Chinese): &lt;a href="https://mp.weixin.qq.com/s/eWnQTvY0OiuHzujJE32kPw" rel="noopener noreferrer"&gt;https://mp.weixin.qq.com/s/eWnQTvY0OiuHzujJE32kPw&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>automation</category>
    </item>
    <item>
      <title>The Evolution of GUI Agents: From RPA Scripts to AI That Sees Your Screen</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Thu, 09 Apr 2026 07:12:14 +0000</pubDate>
      <link>https://forem.com/mininglamp/the-evolution-of-gui-agents-from-rpa-scripts-to-ai-that-sees-your-screen-4mkc</link>
      <guid>https://forem.com/mininglamp/the-evolution-of-gui-agents-from-rpa-scripts-to-ai-that-sees-your-screen-4mkc</guid>
      <description>&lt;p&gt;In 2020, if you wanted to automate a desktop app, you'd write an RPA script — record mouse movements, hardcode coordinates, and pray the UI never changed.&lt;/p&gt;

&lt;p&gt;In 2024, if you wanted an AI to operate a browser, you'd use a CDP-based agent — one that reads the DOM, parses HTML, and executes tasks inside Chrome.&lt;/p&gt;

&lt;p&gt;In 2026, there's a model that looks at a screenshot, understands the interface, and clicks, types, and switches windows like a human — no API needed, no HTML parsing, no knowledge of the underlying tech stack.&lt;/p&gt;

&lt;p&gt;These three stages represent three paradigm shifts in GUI automation over the past few years.&lt;/p&gt;

&lt;p&gt;Let's break down how we got here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation 1: RPA — Record and Replay
&lt;/h2&gt;

&lt;p&gt;Traditional RPA (UiPath, Blue Prism, Automation Anywhere) boils down to one idea: record what a human does, then replay it.&lt;/p&gt;

&lt;p&gt;Under the hood, it's simulating mouse and keyboard events at the OS level. Early versions used coordinate-based targeting — change the resolution and everything breaks. Later iterations added control tree recognition (Windows UI Automation, macOS Accessibility API) and image matching.&lt;/p&gt;

&lt;p&gt;RPA still powers automation at banks, insurance companies, and government systems today. But for developers, it has structural problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Brittle&lt;/strong&gt;: Change one pixel in the UI and the script breaks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero understanding&lt;/strong&gt;: It doesn't know &lt;em&gt;what&lt;/em&gt; it's doing — just mechanically repeating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High maintenance&lt;/strong&gt;: Every UI change requires re-recording&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited scope&lt;/strong&gt;: Cross-application, cross-platform workflows are painful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RPA was always "automation for non-technical users," not something that excited developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation 2: Browser CUA — DOM-Based Agents
&lt;/h2&gt;

&lt;p&gt;In 2024–2025, LLMs got good enough to understand web pages. A new class of solutions emerged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use Chrome DevTools Protocol (CDP) to grab the page DOM&lt;/li&gt;
&lt;li&gt;Feed DOM/HTML fragments to an LLM for comprehension&lt;/li&gt;
&lt;li&gt;LLM outputs action instructions (click element X, fill form Y)&lt;/li&gt;
&lt;li&gt;Execute via CDP&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The improvement was real: LLMs brought &lt;em&gt;understanding&lt;/em&gt; instead of mechanical replay. But the limitations were equally clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Locked inside the browser&lt;/strong&gt;: CDP is a Chrome protocol. Desktop apps, native apps, games, 3D tools — none of them work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Depends on HTML structure&lt;/strong&gt;: Complex or dynamically rendered pages produce massive, unreliable DOM trees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data security&lt;/strong&gt;: DOM content (including your login state and sensitive data) gets sent to a cloud LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers, this solved "browser automation" but not "general GUI automation."&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation 3: Pure-Vision GUI Agents — See the Screen, Not the Code
&lt;/h2&gt;

&lt;p&gt;Starting in late 2025, a fundamentally different approach matured: models that take a screenshot as input and output actions like "click at (x, y)" or "type 'hello world'."&lt;/p&gt;

&lt;p&gt;The key difference from everything before: &lt;strong&gt;no dependency on any underlying protocol or interface.&lt;/strong&gt; No CDP, no Accessibility API, no need to know what framework the app was built with. Input is a screenshot. Output is an action.&lt;/p&gt;

&lt;p&gt;Coverage is theoretically unlimited — any application with a graphical interface can be operated. Desktop software, browsers, games, 3D modeling tools, even apps inside a remote desktop session.&lt;/p&gt;

&lt;p&gt;The technical challenges are significant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GUI Grounding&lt;/strong&gt;: The model needs to precisely locate and understand interface elements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step planning&lt;/strong&gt;: Complex tasks require sequences of actions with memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error recovery&lt;/strong&gt;: When something goes wrong, the model needs to detect the anomaly and self-correct&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach splits into two paths — &lt;strong&gt;cloud&lt;/strong&gt; (screenshots sent to remote servers) and &lt;strong&gt;on-device&lt;/strong&gt; (inference runs locally). Same technique, completely different data flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  On-Device Pure-Vision: Where It Gets Interesting
&lt;/h2&gt;

&lt;p&gt;Let me use a concrete example to show where on-device GUI agents stand today.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P 1.0&lt;/a&gt; is a GUI-VLA (Vision-Language-Action) agent model purpose-built for on-device deployment. Pure vision, no CDP, no HTML parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark results
&lt;/h3&gt;

&lt;p&gt;On &lt;a href="https://os-world.github.io/" rel="noopener noreferrer"&gt;OSWorld&lt;/a&gt; — the academic community's standard benchmark for desktop GUI agents — the Mano-P 72B model achieved &lt;strong&gt;58.2% success rate&lt;/strong&gt;, ranking &lt;strong&gt;#1 among proprietary models globally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For context: the other four models in the top 5 are all 100B+ general-purpose models. A 72B model purpose-built for GUI scenarios beating them says something about the efficiency of specialized models vs. the brute-force approach.&lt;/p&gt;

&lt;p&gt;Across a broader evaluation, Mano-P hit SOTA on &lt;strong&gt;13 benchmark leaderboards&lt;/strong&gt; spanning GUI grounding, perception, video understanding, and in-context learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-device performance
&lt;/h3&gt;

&lt;p&gt;The 4B quantized model (w4a16) runs at &lt;strong&gt;476 tokens/s prefill, 76 tokens/s decode&lt;/strong&gt; on Apple M4 Pro, with peak memory of just &lt;strong&gt;4.3GB&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means on an M4 Mac mini or MacBook with 32GB RAM, you can run an OSWorld-champion-level GUI agent &lt;strong&gt;entirely on-device&lt;/strong&gt;. No data ever leaves your machine.&lt;/p&gt;

&lt;p&gt;One command to install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API key. No cloud config. No worrying about where your screenshots end up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Comparison Table Developers Actually Want
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Traditional RPA&lt;/th&gt;
&lt;th&gt;Browser CUA&lt;/th&gt;
&lt;th&gt;Cloud Computer Use&lt;/th&gt;
&lt;th&gt;On-Device GUI Agent (Mano-P)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perception&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coordinates / control tree / image matching&lt;/td&gt;
&lt;td&gt;DOM / HTML parsing&lt;/td&gt;
&lt;td&gt;Cloud screenshot + vision model&lt;/td&gt;
&lt;td&gt;Local screenshot + vision model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single app&lt;/td&gt;
&lt;td&gt;Browser only&lt;/td&gt;
&lt;td&gt;Theoretically all platforms&lt;/td&gt;
&lt;td&gt;All platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes (HTML-based)&lt;/td&gt;
&lt;td&gt;Yes (vision-based)&lt;/td&gt;
&lt;td&gt;Yes (vision-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data flow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local&lt;/td&gt;
&lt;td&gt;DOM sent to cloud&lt;/td&gt;
&lt;td&gt;Screenshots uploaded to cloud&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Data never leaves device&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Robustness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (breaks on UI change)&lt;/td&gt;
&lt;td&gt;Medium (depends on DOM stability)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local RPA engine&lt;/td&gt;
&lt;td&gt;Browser + API&lt;/td&gt;
&lt;td&gt;Cloud API + network&lt;/td&gt;
&lt;td&gt;Local device (e.g., M4 Mac + 32GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There's a frequently overlooked distinction here: cloud Computer Use and on-device GUI agents use the same technique (pure vision), but the data flow is completely different.&lt;/p&gt;

&lt;p&gt;Cloud solutions send your screenshots — everything on your screen, including code, emails, and credentials — to a remote server. For many developers, that's a non-starter.&lt;/p&gt;

&lt;p&gt;On-device solutions run inference locally. Screenshots processed locally. Actions executed locally. This isn't "we added encryption" level security — it's &lt;strong&gt;physically eliminating the possibility of data leakage&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why On-Device Only Became Possible Now
&lt;/h2&gt;

&lt;p&gt;Two changes made this viable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: Apple's M4 unified memory architecture gave consumer devices the foundation to run medium-scale models. M4 + 32GB unified memory + high-bandwidth memory bus — this was workstation-grade hardware two years ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model compression&lt;/strong&gt;: Mano-P's GSPruning visual token pruning + w4a16 quantization keeps the 4B model at 4.3GB peak memory with 476 tokens/s throughput. That's a fully usable inference speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the Endgame?
&lt;/h2&gt;

&lt;p&gt;When an AI agent can see any screen, understand intent, and operate any graphical interface, it has &lt;strong&gt;the same software-usage capability as a human user&lt;/strong&gt;. It doesn't need APIs, doesn't wait for integrations, doesn't learn each tool's SDK.&lt;/p&gt;

&lt;p&gt;The implications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-tail software gets activated&lt;/strong&gt;: Millions of professional tools with no API can suddenly be operated by agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-application workflows become possible&lt;/strong&gt;: Design in Figma, compile in Terminal, deploy in browser — all via GUI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The walls between software break down&lt;/strong&gt;: No data export/import needed — the agent just operates at the interface level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With benchmark scores above 50% on complex desktop tasks, we're watching GUI agents cross from "lab demo" to "developer-usable."&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Mano-P 1.0 is open source under Apache 2.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your take — is on-device the right path for GUI agents, or is cloud compute still the pragmatic choice? Drop your thoughts below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>automation</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
