aimodels-fyi

Posted on Apr 23 • Originally published at aimodels.fyi

UFO2: Desktop Automation Evolved Beyond Brittle Prototypes

#machinelearning #ai #programming #datascience

This is a Plain English Papers summary of a research paper called UFO2: Desktop Automation Evolved Beyond Brittle Prototypes. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

From Prototypes to Practicality: Introducing UFO2 as a Desktop AgentOS

Computer-Using Agents (CUAs) have emerged as a promising approach for automating complex desktop workflows through natural language. However, despite their potential, most existing CUAs remain conceptual prototypes with significant limitations. They suffer from shallow OS integration, relying on screenshot-based interactions that prove fragile in real-world scenarios. Furthermore, these agents disrupt user workflows by monopolizing the desktop during automation tasks.

UFO2 addresses these fundamental issues by reimagining desktop automation as a first-class operating system abstraction. Unlike previous CUAs that treat automation as a layer atop screenshots, UFO2 is architected as a deeply integrated, multiagent execution environment that embeds OS capabilities, application-specific introspection, and domain-aware planning into the core automation loop.

Comparison between traditional CUAs and the UFO2 approach. While existing CUAs rely solely on screenshot-based interaction, UFO2 provides deep OS integration through specialized agents and API access.

This transformation from prototype to practical system represents a crucial advancement in the field of desktop automation, offering significantly improved robustness, efficiency, and user experience.

The Evolution of Desktop Automation

The Fragility of Traditional Desktop Automation

For decades, desktop automation has relied on brittle techniques to replicate human interactions with GUI-based applications. Commercial Robotic Process Automation (RPA) tools like UiPath, Automation Anywhere, and Microsoft Power Automate operate by recording and replaying mouse movements, keystrokes, or rule-based scripts. These systems rely heavily on surface-level GUI cues (pixel regions, window titles), offering little introspection into application state.

While widely deployed in enterprise settings, traditional RPA systems exhibit poor robustness and scalability. Even minor UI updates—such as reordering buttons or relabeling menus—can silently break automation scripts. Maintaining correctness requires frequent human intervention. Furthermore, these tools lack semantic understanding of application workflows and cannot reason about or adapt to novel tasks. As a result, RPA tools remain constrained to narrow, repetitive workflows in stable environments, far from general-purpose automation.

Rise of Computer-Using Agents

Recent advances in large language models (LLMs) and multimodal perception have enabled a new class of automation systems, referred to as Computer-Using Agents (CUAs). CUAs aim to generalize across applications and tasks by leveraging LLMs to interpret user instructions, perceive GUI layouts, and synthesize actions such as clicks and keystrokes. Early CUAs like UFO demonstrated that multimodal models (e.g., GPT-4V) could map natural language requests to sequences of GUI actions with no hand-crafted scripts. More recent industry prototypes, including Claude-3.5 (Computer Use) and OpenAI Operator, have pushed the envelope further, performing realistic desktop workflows across multiple applications.

These CUAs represent a promising evolution from static RPA scripts to adaptive, general-purpose automation. However, despite their sophistication, current CUAs largely remain research prototypes, constrained by architectural and systems-level limitations that impede real-world deployment.

Systems Challenges in CUAs

Current CUAs fall short in three fundamental ways, which stem from missing operating system abstractions:

Lack of OS-Level Integration: Most CUAs interact with the system through screenshots and low-level input emulation (mouse and keyboard events). They ignore rich system interfaces such as accessibility APIs, application process state, and native inter-process communication mechanisms. This superficial interaction model limits reliability and efficiency—every action must be inferred from pixels rather than structured state.

Absence of Application Introspection: CUAs typically operate as generalists with limited awareness of application-specific capabilities. They treat all interfaces uniformly, lacking the ability to leverage built-in APIs or vendor documentation. As a result, they cannot reason over high-level concepts unless such flows are explicitly embedded in the model. This rigidity limits their generalization and makes maintenance expensive.

Disruptive and Unsafe Execution Model: Most CUAs drive automation directly on the user's desktop session, hijacking the real mouse and keyboard. This design prevents users from interacting with their system during execution, introduces interference risk, and violates isolation principles fundamental to safe system design. Long-running tasks—especially those involving multiple LLM queries—can monopolize the session for minutes at a time.

Missing Abstraction: OS Support for Automation

Despite growing demand for intelligent, language-driven automation, existing operating systems offer no first-class abstraction for exposing GUI application control to external agents. In contrast to system calls, files, or sockets, GUI workflows remain opaque and non-programmable. As a result, both RPA and CUA systems are forced to operate as ad-hoc layers atop the GUI, with no unified substrate for execution, coordination, or introspection.

UFO2 argues that automation should be elevated to a system primitive. It addresses these limitations by embedding automation as a deeply integrated OS abstraction—exposing GUI controls, application APIs, and task orchestration as programmable, inspectable, and composable system services.

Inside UFO2: A New Architecture for Desktop Automation

UFO2 as a System Substrate for Automation

Overview of UFO2's architecture showing the central HostAgent coordinating with specialized AppAgents across multiple applications.

UFO2 provides a structured runtime environment for task-oriented automation on Windows desktops. Deployed as a local daemon, UFO2 enables users to issue natural language requests that are translated into coordinated workflows spanning multiple GUI applications. The system provides core abstractions for orchestration, introspection, control execution, and agent collaboration—exposing these as system-level services analogous to those in traditional OSes.

At the heart of UFO2 is a central control plane, the HostAgent, responsible for parsing user intent, managing system state, and dispatching subtasks to a collection of specialized runtime modules called AppAgents. Each AppAgent is dedicated to a particular application (e.g., Excel, Outlook, File Explorer) and encapsulates all logic needed to observe and control that application, including API bindings, UI detectors, and knowledge bases. These modules act as isolated execution contexts with application-specific semantics.

Upon receiving a user request, HostAgent decomposes it into a series of subtasks, each mapped to the application best suited to fulfill it. If the corresponding application is not already running, HostAgent launches it using native Windows APIs and instantiates the corresponding AppAgent. Execution proceeds through a structured loop: each AppAgent continuously observes the application state (via accessibility APIs and vision-based detectors), reasons about the next operation using a ReAct-style planning cycle, and invokes the appropriate action—either a GUI event or a native API call. This loop continues until the subtask terminates, either successfully or due to an unrecoverable error.

UFO2 implements shared memory and control flow via a global blackboard interface, allowing HostAgent and AppAgents to exchange intermediate results, dependency states, and execution metadata. This architecture supports complex workflows across application boundaries—for instance, extracting data from a spreadsheet and using it to populate fields in a web form—without requiring hand-crafted scripts or coordination logic. Crucially, all interactions occur within a virtualized, PiP-based desktop environment, ensuring process-level isolation and safe multi-application concurrency.

UFO2 adopts a centralized multiagent runtime to support both reliability and extensibility. The centralized HostAgent acts as a control plane, simplifying task-level orchestration, error handling, and lifecycle management. Meanwhile, each AppAgent is architected as a loosely coupled executor that encapsulates deep, application-specific functionality.

This design creates a scalable, pluggable runtime substrate for GUI automation—abstracting away the complexity of heterogeneous interfaces and providing a unified system interface to structured application behavior.

HostAgent: System-Level Orchestration and Execution Control

The HostAgent architecture showing its role as a control-plane orchestrator that manages task decomposition and AppAgent coordination.

The HostAgent serves as the centralized control plane of UFO2. It is responsible for interpreting user-specified goals, decomposing them into structured subtasks, instantiating and dispatching AppAgent modules, and coordinating their progress across the system. HostAgent provides system-level services for introspection, planning, application lifecycle management, and multi-agent synchronization.

Operating atop the native Windows substrate, HostAgent monitors active applications, issues shell commands to spawn new processes as needed, and manages the creation and teardown of application-specific AppAgent instances. All coordination occurs through a persistent state machine, which governs the transitions across execution phases.

The finite-state machine that governs HostAgent's behavior, showing transitions between CONTINUE, ASSIGN, PENDING, FINISH, and FAIL states.

The HostAgent maintains two classes of persistent state: Private State (tracks user intent, plan progress, and the control flow of the current session) and Shared Blackboard (a concurrent, append-only memory space that facilitates transparent agent communication). This separation ensures that local context remains encapsulated, while global coordination is visible and consistent across the system.

By abstracting away the complexity of managing concurrent, stateful, cross-application workflows in desktop environments, HostAgent enables modular execution, coordinated progress, and robust task lifecycle management.

AppAgent: Application-Specialized Execution Runtime

The AppAgent architecture showing its perception layer, reasoning process, and execution components for application-specific automation.

The AppAgent is the core execution runtime in UFO2, responsible for carrying out individual subtasks within a specific Windows application. Each AppAgent functions as an isolated, application-specialized worker process launched and orchestrated by the central HostAgent. Unlike monolithic CUAs that treat all GUI contexts uniformly, each AppAgent is tailored to a single application and operates with deep knowledge of its API surface, control semantics, and domain logic.

Upon receiving a subtask and execution context from the HostAgent, the AppAgent initializes a ReAct-style control loop, where it iteratively senses the current application state, reasons about the next step, and executes either a GUI or API-based action. This hybrid execution layer—implemented via a Puppeteer interface—enables reliable control over dynamic and complex UIs by favoring structured APIs whenever available, while retaining fallback to GUI-based interaction when necessary.

Each AppAgent fuses multiple streams of perception:

Visual Input: Captures GUI screenshots for layout understanding and control grounding
Semantic Metadata: Extracted from Windows UIA APIs, including control type, label, hierarchy, and enabled state
Symbolic Annotation: Uses Set-of-Mark (SoM) techniques to annotate the control on screenshots

Based on this state, the AppAgent produces a structured output with target control, action type, arguments, reasoning trace, and current state in local FSM.

The finite-state machine controlling AppAgent execution, showing transitions between CONTINUE, PENDING, FINISH, and FAIL states.

Each AppAgent maintains a local finite-state machine that governs its behavior within the assigned application context. This bounded execution model isolates failures to the current task and enables safe preemption, retry, or delegation. The FSM also supports interruptible workflows, which can resume from intermediate checkpoints.

To support rapid onboarding of new applications, UFO2 exposes an SDK that encapsulates the development and maintenance of AppAgents. Developers can register application-specific APIs via a declarative interface, including function metadata, argument schemas, and prompt bindings. Domain-specific help documents and patch notes can be ingested into a searchable knowledge base that agents query at runtime.

As a per-application execution runtime, each AppAgent provides modular, domain-aware control that surpasses generic GUI agents in both efficiency and robustness. Its hybrid perception-action loop, plugin-based extensibility, and local fault containment enable UFO2 to scale to large application ecosystems with minimal system-wide disruption.

Hybrid Control Detection

The hybrid control detection pipeline that fuses UIA metadata with vision-based grounding for comprehensive control identification.

Reliable perception of GUI elements is fundamental to enabling AppAgents to interact with application interfaces in a deterministic and safe manner. However, real-world GUI environments exhibit substantial heterogeneity: some applications expose well-structured accessibility data via Windows UI Automation (UIA) APIs, while others—especially legacy or custom applications—render critical controls using non-standard toolkits that bypass UIA entirely.

To address this disparity, UFO2 introduces a hybrid control detection subsystem that fuses UIA-based metadata with vision-based grounding to construct a unified and comprehensive control graph for each application window. This design ensures both coverage and reliability, forming a resilient perceptual foundation for downstream action planning and execution.

When available, UIA offers a semantically rich and high-precision interface for enumerating on-screen controls. The detection pipeline first queries the accessibility tree to extract controls satisfying a set of runtime predicates (e.g., is_visible(), is_enabled()). To augment the perception pipeline for UI-invisible or custom-rendered controls, UFO2 integrates OmniParser-v2, a vision-based grounding model designed for fast and accurate GUI parsing.

UFO2 unifies these two streams by performing deduplication based on bounding-box overlap. Visual detections with Intersection-over-Union (IoU) greater than 10% against any UIA-derived control are discarded. Remaining visual-only detections are converted into pseudo-UIA objects using a lightweight UIAWrapper abstraction. This approach bridges the gap between structured accessibility trees and pixel-level perception, ensuring reliable control detection across diverse interface styles.

Unified GUI–API Action Orchestrator

Puppeteer architecture showing how it selects between GUI actions and API calls based on availability and context.

AppAgents can interact with application environments that expose two distinct classes of interfaces: GUI frontends, which are universally observable but often brittle; and native APIs, which are high-fidelity but require explicit integration. To unify these heterogeneous execution backends under a single runtime abstraction, UFO2 introduces the Puppeteer, a modular execution orchestrator that dynamically selects between GUI-level automation and application-specific APIs for each action step. This design significantly improves task robustness, latency, and maintainability. Tasks that would otherwise require long GUI interaction sequences can often be collapsed into a single API call, reducing both execution time and the surface area for failure.

Puppeteer supports a lightweight API registration mechanism that enables developers to expose high-level operations in target applications. APIs are registered using a simple Python decorator interface. Each function is wrapped with metadata—name, argument schema, and application binding—and automatically incorporated into the AppAgent's runtime action space.

At runtime, UFO2 prompts AppAgent to employ a decision-making policy to choose the most appropriate execution path for each operation. If a semantically equivalent API is available, Puppeteer prefers it over GUI automation for reliability and atomicity. If an API fails or is unavailable, the system gracefully falls back to GUI-based control via simulated clicks or keystrokes.

This hybrid execution model not only improves system performance and stability but also lays the groundwork for sustainable integration of application-specific capabilities in future desktop agents.

Continuous Knowledge Integration Substrate

The knowledge substrate in UFO2 combines static documentation with dynamic execution history to enhance agent capabilities.

Unlike traditional CUAs, which rely heavily on static training corpora, UFO2 introduces a persistent and extensible knowledge substrate that supports runtime augmentation of application-specific understanding. This substrate enables each AppAgent to retrieve, interpret, and apply external documentation and prior execution traces without requiring retraining of the underlying models. This hybrid memory design functions analogously to an OS-level metadata manager, abstracting over two key knowledge flows: static references (e.g., user manuals) and dynamic experience (e.g., execution logs).

Most real-world desktop applications expose substantial task-level documentation via user guides, help menus, or online tutorials. UFO2 capitalizes on this resource by offering a one-click interface to parse and ingest such documents into an application-specific vector store. Documents are structured as JSON records with a natural language description in the request field and detailed execution guidance in the guidance field.

Beyond static knowledge, UFO2 continuously learns from its own execution history. Each automation run produces structured logs—including natural language task descriptions, executed action sequences, application screenshots, and final outcomes. Periodically, these logs are mined offline by a summarization module that distills successful trajectories into reusable Example records.

At the system level, the knowledge substrate acts as a Retrieval-Augmented Generation (RAG) layer that bridges the gap between pre-trained language models and application-specific requirements. Because both help documents and examples are indexed with semantic embeddings, the retrieval pipeline is fast, interpretable, and cache-friendly. Additionally, versioned indexing ensures that knowledge artifacts can evolve alongside software updates, preventing model obsolescence and supporting robust execution across long deployment cycles.

By integrating static and experiential knowledge into a unified RAG pipeline, UFO2 transforms CUAs from brittle, training-time constructs into dynamic, evolving agents. This substrate plays a foundational role in enabling sustainable automation across complex, heterogeneous application ecosystems.

Speculative Multi-Action Execution

The speculative multi-action execution process showing batched inference with runtime validation and sequential execution.

Conventional CUAs suffer from a fundamental execution bottleneck: each automation step is executed in isolation, requiring a full LLM inference for every single GUI action. This step-wise inference loop introduces excessive latency, inflates system resource usage, and increases cumulative error rates—especially when interacting with complex or multi-phase workflows.

To overcome these limitations, UFO2 introduces a system-level optimization called speculative multi-action execution, inspired by classical ideas from speculative execution in processor design and instruction pipelining. Rather than issuing one action per LLM call, UFO2 speculatively generates a batch of likely next steps using a single inference pass and validates their applicability at runtime through tight OS integration.

The speculative executor operates in three stages:

Action Prediction: The AppAgent issues a single LLM query to predict multiple plausible actions under its current context. Each predicted step includes a target control, intended operation, and rationale.
Runtime Validation: For each action, the system consults the Windows UIA API to verify the action's preconditions (e.g., is_enabled(), is_visible()). This check ensures that each target control is still valid and interactive.
Sequential Execution and Early Exit: Actions are executed in order, halting immediately if any validation fails due to interface change (e.g., control no longer exists or is disabled). The executor then reports a partial result set and prompts the agent to replan.

This strategy drastically reduces LLM invocation frequency and amortizes the cost of action planning across multiple steps, while preserving the correctness guarantees of per-step validation. Validation through OS APIs rather than visual detection ensures high-confidence safety checks with minimal computational overhead.

Picture-in-Picture Interface

A key design objective of UFO2 is to deliver high-throughput automation while preserving the responsiveness and usability of the primary desktop environment. Existing CUAs often monopolize the user's workspace, seizing mouse and keyboard control for extended periods and making the system effectively unusable during task execution. To overcome this, UFO2 introduces a Picture-in-Picture (PiP) interface: a lightweight, virtualized desktop window powered by Remote Desktop loopback, enabling fully isolated agent execution in parallel with active user workflows.

Virtualized User Environment with Minimal Disruption

Unlike conventional CUAs that operate in the main desktop session, the PiP interface presents a resizable, movable window containing a fully functional replica of the user's desktop. Internally, this is implemented via Windows' native Remote Desktop Protocol (RDP) loopback, creating a distinct virtual session hosted on the same machine. Applications launched within the PiP session inherit the user's identity, credentials, settings, and network context, ensuring consistency with foreground operations.

From the user's perspective, the PiP window behaves like a sandboxed workspace: the automation executes in the background, visible but unobtrusive. The user retains full control of the primary desktop and can minimize or reposition the PiP window at will. This enables UFO2 to perform long-running or repetitive workflows (e.g., data entry, batch file processing) without blocking user interaction or degrading responsiveness.

Robust Input and State Isolation

To ensure robust separation between agent actions and user activities, UFO2 leverages the RDP subsystem to maintain distinct input queues and device contexts across sessions. Mouse and keyboard events generated within the PiP desktop are fully scoped to that session and cannot interfere with the primary desktop. Similarly, GUI changes and focus transitions are restricted to the virtual environment.

This level of input isolation is critical for preventing accidental interference—either by the user or the agent—and ensures that automation sequences remain stable, even during simultaneous foreground activity. The architecture also supports controlled error recovery: failures or unexpected UI states within the PiP session do not propagate to the primary desktop, preserving the integrity of the user's environment.

Secure Cross-Session Coordination

Although visually and operationally distinct, the PiP session must remain logically connected to the host environment. To enable this, UFO2 establishes a secure inter-process communication (IPC) channel between the PiP agent runtime and a host-side coordinator. This is implemented using Windows Named Pipes, authenticated and encrypted using per-session credentials.

This IPC layer supports two-way messaging:

From the host to the PiP: task assignment, progress polling, cancellation, and user clarifications
From the PiP to the host: status updates, completion reports, and exception notifications

Users interact with the automation pipeline through a lightweight frontend panel on the host desktop, enabling real-time visibility and partial control without needing to directly access the PiP window. This transparent yet secure communication channel ensures trust and usability, particularly in long-running or partially supervised workflows.

System-Level Implications

The PiP interface represents more than a UX refinement—it is a system-level abstraction that reconciles concurrency, usability, and safety. It decouples automation execution from foreground interactivity, introduces a new isolation primitive for GUI-based agents, and simplifies failure recovery by sandboxing side effects. By exploiting existing RDP capabilities with minimal system overhead, the PiP interface offers a practical and backwards-compatible approach to scalable desktop automation.

Implementation and Specialized Engineering Design

UFO2 is implemented as a full-stack desktop automation framework spanning over 30,000 lines of Python and C# code. Python serves as the core runtime environment for agent orchestration, control logic, and API integration, while C# supports GUI development, debugging interfaces, and Windows-specific operations such as the Picture-in-Picture desktop. To support retrieval-augmented reasoning, UFO2 leverages Sentence Transformers for embedding-based document and experience retrieval.

Beyond its core functionality, UFO2 incorporates multiple specialized engineering components that target critical systems goals: composability, interactivity, debuggability, and scalable deployment.

Multi-Round Task Execution

Unlike stateless one-shot agents, UFO2 adopts a session-based execution model to support iterative, interactive workflows. Each Session maintains persistent contextual memory—including intermediate results, task progress, and application state—across multiple Rounds of execution. Users can refine prior instructions, launch follow-up tasks, or intervene when agents encounter ambiguous or unsafe operations.

This multi-round interaction paradigm facilitates progressive convergence on complex tasks while preserving transparency and human oversight. It enables UFO2 to support human-in-the-loop refinement strategies, bridging static LLM workflows with dynamic user guidance.

Safeguard Mechanism

While automation substantially boosts productivity, any CUA carries inherent risks of executing unsafe actions that may adversely affect user data or system stability. Examples include deleting critical files, terminating applications prematurely (resulting in unsaved data loss), or activating sensitive devices such as webcams without explicit consent. These actions pose severe risks, potentially causing irrecoverable damage or security breaches.

To mitigate such risks, UFO2 incorporates an explicit safeguard mechanism, designed to actively detect potentially dangerous actions. Specifically, whenever an AppAgent identifies an action matching predefined risk criteria, it transitions into a dedicated PENDING state, pausing execution and actively prompting the user for confirmation. Only upon receiving explicit user consent does the agent proceed; otherwise, the action is aborted to prevent harm. The definition and scope of what constitutes a risky action are fully customizable through a straightforward prompt-based interface, enabling users and system administrators to precisely tailor safeguard behavior according to their organization's specific risk policies.

Through this proactive safety-checking framework, UFO2 significantly reduces the likelihood of executing harmful operations, thus enhancing overall system safety, user trust, and robustness in real-world deployments.

Everything-as-an-AppAgent

To support ecosystem extensibility, UFO2 introduces an agent registry mechanism that encapsulates arbitrary third-party components as pluggable AppAgents. Through a simple registration API, external automation solutions—such as domain-specific copilots or proprietary tools—can be wrapped with lightweight compatibility shims that expose a unified interface to the HostAgent.

This design enables HostAgent to treat native and external AppAgents interchangeably, dispatching subtasks based on capability and specialization. The researchers found that even minimal wrappers (e.g., for OpenAI Operator) lead to tangible performance gains, highlighting the system's modularity and its ability to incorporate diverse execution backends with minimal engineering overhead.

AgentOS-as-a-Service

UFO2 adopts a client-server architecture to support practical deployment at scale. A lightweight client resides on the user's machine and is responsible for GUI operations and application-side sensing. Meanwhile, a centralized server (running on-premises or in the cloud) hosts the HostAgent/AppAgent logic, orchestrates workflows, and handles LLM queries.

This separation of control and execution offers several systems-level benefits:

Security: Sensitive orchestration and model execution are isolated from user devices
Maintainability: Server-side updates propagate without modifying the client
Scalability: The system can support multiple concurrent clients with centralized scheduling and load management

The client-server boundary enforces a clean service abstraction, promoting modularity and simplifying rollout in enterprise environments.

Comprehensive Logging and Debugging Infrastructure

Robust observability is essential for diagnosing failures and supporting ongoing system improvement. To this end, UFO2 implements a comprehensive logging and debugging framework. Each session captures fine-grained traces of execution: prompts, LLM outputs, control metadata, UI state snapshots, and error events.

At the end of each session, UFO2 compiles these artifacts into a structured, Markdown-formatted execution log. Developers can inspect action-by-action agent decisions, visualize interface state transitions, and replay behavior for debugging. The framework also supports prompt editing and selective replay for targeted hypothesis testing, significantly accelerating the debugging cycle.

This observability layer functions as a lightweight provenance system for agent behavior, fostering transparency, accountability, and rapid iteration during deployment.

Automated Task Evaluator

To provide structured feedback and facilitate continuous improvement, UFO2 includes an automated task evaluation engine based on LLM-as-a-judge. The evaluator parses session traces—including actions, rationales, and screenshots—and applies Chain-of-Thought reasoning to decompose tasks into evaluation criteria.

It assigns partial scores and synthesizes an overall result: success, partial, or failure. This structured outcome feeds into downstream dashboards and debugging tools. It also supports self-monitoring and offline analysis of failure cases, closing the loop between execution, diagnosis, and improvement.

Evaluation

UFO2 was tested rigorously across more than 20 Windows applications, including office suites, file explorers, and custom enterprise tools to assess performance, efficiency, and robustness.

Experimental Setup

Deployment Environment: The benchmark environments were hosted on isolated VMs with 8 AMD Ryzen 7 CPU cores and 8 GB of memory, matching typical deployment conditions. All GPT-family models (GPT-4V, GPT-4o, o1, and Operator) were accessed via Azure OpenAI services, while the OmniParser-v2 vision model operated on a separate virtual machine provisioned with an NVIDIA A100 80GB GPU.

Benchmarks: UFO2 was evaluated using two established Windows-centric automation benchmarks:

Windows Agent Arena (WAA): Consists of 154 live automation tasks across 15 commonly used Windows applications
OSWorld-W: A targeted subset of the OSWorld benchmark specifically tailored for Windows, comprising 49 live tasks

Baselines: UFO2 was compared with five representative state-of-the-art CUAs, each leveraging GPT-4o as the inference engine:

UFO: A pioneering multiagent, GUI-focused automation system designed explicitly for Windows
NAVI: A single-agent baseline from WAA, utilizing screenshots and accessibility data for GUI understanding
OmniAgent: Employs OmniParser for visual grounding combined with GPT-based action planning
Agent S: Features a multiagent architecture with experience-driven hierarchical planning
Operator: A recent, high-performance CUA from OpenAI, simulating human-like mouse and keyboard interactions

Evaluation Metrics:

Success Rate (SR): Percentage of tasks successfully completed
Average Completion Steps (ACS): Average number of LLM-involved action inference steps required per task

Success Rate Comparison

Agent	Model	WAA	OSWorld-W
UFO	GPT-4o	$19.5 \%$	$12.2 \%$
NAVI	GPT-4o	$13.3 \%$	$10.2 \%$
OmniAgent	GPT-4o	$19.5 \%$	$8.2 \%$
Agent S	GPT-4o	$18.2 \%$	$12.2 \%$
Operator	computer-use	$20.8 \%$	$14.3 \%$
UFO $^{2}$-base	GPT-4o	$23.4 \%$	$16.3 \%$
UFO $^{2}$-base	o1	$25.3 \%$	$16.3 \%$
UFO $^{2}$	GPT-4o	$27.9 \%$	$28.6 \%$
UFO $^{2}$	o1	$30.5 \%$	$32.7 \%$

Table 1. Comparison of success rates (SR) across agents on WAA and OSWorld-W benchmarks.

Notably, even the basic configuration (UFO2-base)—which relies solely on standard UI Automation and GUI-driven actions—consistently surpasses prior state-of-the-art CUAs. Specifically, with GPT-4o, UFO2-base achieves an SR of 23.4% on WAA, outperforming the best existing baseline, Operator (20.8%), by 2.6%.

The complete version of UFO2, incorporating hybrid GUI–API action execution, advanced visual grounding, and continuous knowledge integration, further amplifies these performance gains. With GPT-4o, UFO2 achieves a 27.9% SR on WAA, exceeding Operator by a substantial 7.1%. The performance gap becomes even more pronounced on OSWorld-W, where UFO2 achieves a 28.6% SR compared to Operator's 14.3%, effectively doubling its success rate.

Agent	Model	WAA						OSWorld-W
		Office	Web Browser	Windows System	Coding	Media & Video	Windows Utils	Office	Cross-App
UFO	GPT-4o	$0.0 \%$	$23.3 \%$	$33.3 \%$	$29.2 \%$	$33.3 \%$	$8.3 \%$	$18.5 \%$	$4.5 \%$
NAVI	GPT-4o

Click here to read the full summary of this paper

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

MongoDB Atlas lets you build and run modern apps in 125+ regions across AWS, Azure, and Google Cloud. Multi-cloud clusters distribute data seamlessly and auto-failover between providers for high availability and flexibility. Start free!

Learn More

DEV Community

UFO2: Desktop Automation Evolved Beyond Brittle Prototypes

From Prototypes to Practicality: Introducing UFO2 as a Desktop AgentOS

The Evolution of Desktop Automation

The Fragility of Traditional Desktop Automation

Rise of Computer-Using Agents

Systems Challenges in CUAs

Missing Abstraction: OS Support for Automation

Inside UFO2: A New Architecture for Desktop Automation

UFO2 as a System Substrate for Automation

HostAgent: System-Level Orchestration and Execution Control

AppAgent: Application-Specialized Execution Runtime

Hybrid Control Detection

Unified GUI–API Action Orchestrator

Continuous Knowledge Integration Substrate

Speculative Multi-Action Execution

Picture-in-Picture Interface

Virtualized User Environment with Minimal Disruption

Robust Input and State Isolation

Secure Cross-Session Coordination

System-Level Implications

Implementation and Specialized Engineering Design

Multi-Round Task Execution

Safeguard Mechanism

Everything-as-an-AppAgent

AgentOS-as-a-Service

Comprehensive Logging and Debugging Infrastructure

Automated Task Evaluator

Evaluation

Experimental Setup

Success Rate Comparison

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

Top comments (0)

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

Okay