Forem: Matthew Diakonov

Building a macOS Desktop Agent with Claude - How AI Wrote Most of Its Own Code

Matthew Diakonov — Wed, 18 Mar 2026 04:00:55 +0000

Building a macOS Desktop Agent with Claude

Here is something that sounds circular but actually works: using an AI coding assistant to build an AI desktop agent.

Fazm is a macOS app that can see and control your screen. It uses ScreenCaptureKit to grab frames, accessibility APIs to click and type things, and Whisper for voice input. The interesting part is that Claude wrote most of the Swift code itself.

How It Works in Practice

The key was getting the architecture figured out first. Once we had clear CLAUDE.md files describing the project structure, the component boundaries, and the conventions, Claude got surprisingly good at writing native Mac code.

A typical development session looks like:

Describe the feature in plain language
Claude reads the existing codebase and writes the implementation
Build, test, iterate

For something like adding a new accessibility API interaction - say, reading the contents of a specific text field in a specific app - Claude can look at how existing interactions work and extend the pattern. The Swift type system helps a lot here because the compiler catches most mistakes before runtime.

What Claude Is Good At

Boilerplate and patterns. SwiftUI views, async/await pipelines, accessibility API wrappers - once Claude sees one example, it can produce correct variations quickly.
API integration. Given Apple's documentation and existing usage in the codebase, Claude writes correct ScreenCaptureKit and accessibility API code on the first try more often than not.
Test scaffolding. Setting up XCTest cases for the agent's action pipeline is tedious work that Claude handles well.

What Required Human Architecture

The overall pipeline design. How screen capture, LLM processing, and action execution chain together needed human thinking about latency, error handling, and state management.
Privacy decisions. What data stays local, what gets sent to the LLM, how voice recordings are handled - these are product decisions, not code decisions.
The accessibility API strategy. The decision to use the accessibility tree instead of screenshot-based OCR was a fundamental architecture choice that shaped everything downstream. We explain the tradeoffs between these two approaches in how AI agents see your screen.

The CLAUDE.md Pattern

The most valuable thing we did was maintaining detailed CLAUDE.md files. These files tell Claude:

What each module does and where it lives
What conventions the codebase follows
What Swift patterns to use (and which to avoid)
How to run builds and tests

This sounds like documentation, and it is - but it is documentation optimized for an AI reader rather than a human one. The result is that any new Claude session can pick up where the last one left off without re-discovering the codebase from scratch. We expanded on this idea significantly in the HANDOFF.md pattern, which covers context window management across sessions.

Running Multiple Agents in Parallel

For larger features, we run multiple Claude Code sessions simultaneously. Each agent works on an isolated scope - one might handle the UI layer while another works on the data pipeline. The rule is simple: no two agents edit the same file.

This works surprisingly well when the architecture has clean module boundaries. Each agent reads the shared CLAUDE.md for context but writes to its own set of files. We wrote a dedicated post on running parallel AI agents on one codebase with the full playbook on tmux, branch isolation, and scope assignment.

This post is based on our experience shared in r/ClaudeAI. Fazm is open source on GitHub.

Keep Reading

The 10 Best AI Agents for Desktop Automation in 2026

Matthew Diakonov — Wed, 18 Mar 2026 04:00:47 +0000

The 10 Best AI Agents for Desktop Automation in 2026

AI agents that control your computer are no longer experimental. In 2026, there are real, production-ready tools that can automate desktop workflows - clicking buttons, filling forms, navigating browsers, writing code, and managing files. Some work only inside a browser. Others control your entire operating system. A few are open source. Most are not.

We tested the leading options and ranked them based on what actually matters: scope of control, speed and reliability, privacy, input methods, memory, pricing, and platform support. Whether you are looking for a browser copilot or a full desktop automation agent, this guide will help you find the right tool.

What Makes a Great AI Desktop Agent?

Before getting into the rankings, here are the criteria we used to evaluate each tool. Not every agent needs to excel in all of these, but the best ones perform well across most.

Scope of control. Does the agent only work inside a browser, or can it control your entire desktop - native apps, file system, system settings? The broader the scope, the more tasks you can automate.

Speed and reliability. How fast does the agent execute tasks? Does it use screenshot-based control (slower, less reliable) or direct DOM/API interaction (faster, more precise)? Does it frequently misclick or get stuck?

Privacy model. AI agents can see everything on your screen. Where does that data go? Is it processed locally or sent to cloud servers? Can you audit the code?

Input method. Text-only, or does the agent support voice commands? Voice input is significantly faster for delegating tasks, especially during calls or when your hands are busy.

Memory and context. Does the agent remember your preferences, contacts, and past interactions? A good memory layer means less explaining over time.

Pricing. Free, subscription, usage-based, or some combination? Open source or proprietary?

Platform support. macOS only, Windows only, cross-platform, or cloud-based?

The 10 Best AI Agents for Desktop Automation

1. Fazm

What it is: An open-source, local-first AI computer agent for macOS that controls your entire desktop through voice commands. Fazm sits as a floating toolbar, listens via push-to-talk, and executes real actions on your screen - clicking, typing, navigating, filling forms, writing code, and managing files across any app.

Key features:

Full desktop control - any app, any window, any file, not just browsers
Voice-first push-to-talk interface with natural language understanding
Direct browser DOM control via extension - no screenshot-and-guess loop
Personal knowledge graph that learns your contacts, preferences, and workflows over time
Local-first architecture - screen analysis stays on your machine
Open source on GitHub
Works with your existing browser - Chrome, Safari, Arc, Firefox

Best for: Power users who want full desktop automation with voice control, privacy, and zero subscription cost.

Pricing: Free and open source.

Platforms: macOS (Apple Silicon and Intel). Windows on the roadmap.

Pros:

Broadest scope of any tool on this list - controls your entire computer, not just a browser
Voice input is genuinely faster than typing instructions for most tasks
DOM-based browser control is faster and more reliable than screenshot approaches
Memory layer reduces friction over time - less explaining, more doing
Local processing means your screen data never leaves your machine
Completely free with open-source code you can audit

Cons:

macOS only right now - no Windows or Linux support yet
Requires installing a browser extension for DOM control
Younger project compared to tools backed by OpenAI or Perplexity

Why it is number one: Fazm is the only tool on this list that combines full desktop control, voice input, DOM-based browser automation, a persistent memory layer, local-first privacy, and open-source transparency - all for free. This native approach is a deliberate architectural choice - see our post on native desktop agents vs cloud VMs for why it matters. Most other agents are limited to browser tabs or require cloud processing of your screen. Fazm operates at the OS level, which means it can automate tasks that browser-only tools simply cannot touch.

2. ChatGPT Atlas

What it is: OpenAI's Chromium-based web browser with ChatGPT built in. It features a sidebar assistant and an agent mode where ChatGPT takes over the browser cursor to complete multi-step web tasks like booking travel, filling forms, and navigating complex workflows.

Key features:

Agent mode automates multi-step web tasks
Powered by OpenAI's latest models (GPT-4o and beyond)
Sidebar chat for summaries, rewrites, and Q&A on any page
Integrated with ChatGPT's conversation history and memory

Best for: ChatGPT Plus subscribers who want browser automation backed by OpenAI's models. For a detailed three-way comparison, see ChatGPT Atlas vs Perplexity Comet vs Fazm or our ChatGPT Atlas comparison page.

Pricing: Requires ChatGPT Plus at $20/month. Pro tier at $200/month for heavier usage.

Platforms: macOS.

Pros:

Backed by OpenAI's best-in-class language models
Familiar interface for existing ChatGPT users
Solid at complex web research and multi-step browser tasks
No extension needed - agent runs inside its own browser

Cons:

Browser only - cannot control desktop apps, files, or native software
Screenshot-based automation is slower and less reliable than DOM control
Pages are sent to OpenAI's servers for processing - privacy concern for sensitive data
Requires switching to Atlas browser - your Chrome extensions and bookmarks do not carry over
No voice input
$20/month minimum

3. Perplexity Comet

What it is: Perplexity's AI-powered Chromium browser with two modes - Comet Assistant for search and Q&A, and Comet Agent for multi-step web automation. Its standout strength is built-in Perplexity search with AI-synthesized answers and source citations.

Key features:

Best-in-class AI search with citations directly in the browser
Agent mode for automating web tasks like shopping, booking, and form filling
Comet Assistant sidecar for summarizing tabs and answering questions
Cross-platform availability

Best for: Researchers and information workers who need AI-powered search with light browser automation.

Pricing: Limited free searches. Full access requires Perplexity Pro at $20/month or Max at $200/month.

Platforms: macOS, Windows, Android, iOS - the broadest platform support on this list.

Pros:

Perplexity search is genuinely excellent for research tasks
Broadest platform support of any tool listed here
Agent mode handles common web automation tasks well
Clean, fast browsing experience

Cons:

Browser only - no desktop app or file system access
Screenshot-based agent mode shares the same speed and reliability limits as Atlas
Browsing data sent to Perplexity servers
Agent mode is secondary to the search experience - less polished than Atlas for automation
Requires switching browsers

4. Simular

What it is: An AI-powered autonomous agent for macOS that can perceive, reason about, and execute tasks on your computer. Simular goes beyond browser-only agents by interacting with the full macOS environment, using advanced vision models to understand and control interfaces.

Key features:

Full desktop and browser automation on macOS
Vision-based interface understanding that adapts to layout changes
Task recording and replay for repeatable workflows
Tops industry benchmarks across browser, computer, and smartphone agent tasks

Best for: macOS power users who want desktop-wide automation with strong vision-based understanding. For a detailed head-to-head, see our Simular AI comparison page.

Pricing: Free tier available. Simular Plus and Simular Pro tiers for heavier usage (hosted servers with 200 agent hours included, additional compute at $0.10/agent hour).

Platforms: macOS (version 15+, Apple Silicon required).

Pros:

Full desktop control, not just browser
Strong benchmark performance across multiple agent categories
Task recording lets you create reusable automations
Adapts to interface changes without breaking

Cons:

Requires Apple Silicon - no Intel Mac support
Vision-based approach is slower than direct DOM control for browser tasks
Pricing can add up with heavy usage
Less transparent about data handling compared to open-source alternatives

5. Highlight AI

What it is: A desktop AI assistant that observes your screen and provides contextual answers, summaries, and meeting transcriptions. Unlike most tools on this list, Highlight is primarily a read-only observer rather than an active automation agent - it watches what you do and helps you understand it, but does not take actions on your behalf.

Key features:

Screen awareness - ask questions about anything visible on your screen
Automatic meeting transcription and summaries from system audio
Cross-app context - works across any application without switching windows
MCP integration for connecting to tools like Slack, Notion, and GitHub
Privacy-focused local processing

Best for: Knowledge workers who want an always-on AI assistant for meetings, screen Q&A, and context recall - not desktop automation. We go deeper on this distinction in our Highlight AI vs Fazm comparison and our Highlight AI comparison page.

Pricing: Free to use. Premium plans expected based on word count processed.

Platforms: macOS, Windows.

Pros:

Excellent meeting transcription and summarization
Works across all apps without configuration
Low friction - just install and it starts observing
Processes data locally on your device
Cross-platform support

Cons:

Not an automation agent - it observes and answers but does not click, type, or take actions
Cannot execute multi-step workflows
Limited to answering questions and summarizing, not doing tasks
Meeting-focused feature set may not justify installation for non-meeting-heavy users

6. BrowserOS

What it is: An open-source Chromium fork that runs AI agents natively inside the browser. BrowserOS positions itself as a privacy-first, open-source alternative to Atlas and Comet, with support for 11+ AI providers including local models via Ollama and LM Studio.

Key features:

Open-source agentic browser (AGPL-3.0 license)
Supports 11+ AI providers - OpenAI, Anthropic, Google, Moonshot Kimi, OpenRouter (500+ models), and local models
Agents access the DOM, execute JavaScript, capture screenshots, fill forms, and navigate pages
Local-first option - run entirely on your machine with Ollama or LM Studio
Compatible with Chrome extensions

Best for: Privacy-conscious users and developers who want an open-source AI browser with model flexibility.

Pricing: Free and open source.

Platforms: macOS, Windows, Linux.

Pros:

Genuinely open source with active community (4.3k+ GitHub stars)
Choose your own AI provider, including fully local models
Chrome extension compatibility means you keep your existing tools
Cross-platform support
No subscription fees

Cons:

Browser only - cannot control desktop apps or files
Requires technical comfort to set up local models
Chromium fork means another browser to manage
Younger project - agent reliability is still maturing
Community-driven development pace may be slower than venture-backed competitors

7. Composio (Open ChatGPT Atlas)

What it is: An open-source Chrome extension that replicates ChatGPT Atlas-style browser automation. Built by Composio, it combines visual browser automation (using Gemini's computer use capabilities) with a Tool Router that connects directly to 500+ SaaS APIs for tasks like sending Slack messages, creating GitHub issues, or searching Gmail.

Key features:

Two modes: Browser Tools (visual automation with screenshots) and Tool Router (direct API calls to 500+ services)
Sidebar chat interface within Chrome
No backend required - runs entirely in the browser extension
Safety features with confirmation dialogs for sensitive actions
Open source on GitHub

Best for: Developers who want a free, open-source Atlas alternative with direct API integrations for SaaS tools.

Pricing: Free and open source.

Platforms: Any platform that runs Chrome (macOS, Windows, Linux).

Pros:

Free and open source alternative to Atlas
Tool Router is clever - direct API calls are faster and more reliable than visual automation for supported services
500+ SaaS integrations out of the box
No browser switch required - it is a Chrome extension
Confirmation dialogs add a safety layer

Cons:

Browser only - no desktop automation
Visual automation mode uses the slower screenshot-analyze-click loop
Requires your own API keys for AI providers
More of a developer tool than an end-user product - setup is not turnkey
Extension-based approach has inherent limitations compared to a full browser

8. Bytebot

What it is: An open-source, self-hosted AI desktop agent that runs inside a containerized Linux environment (Docker). Bytebot gives AI its own computer - a full desktop where it can use any application, process documents, navigate websites, and complete multi-step workflows through natural language.

Key features:

Full desktop environment running in Docker - any application, not just browsers
Natural language task control
Adaptive AI vision that understands interfaces semantically
Two modes: Autonomous (hands-off) and Takeover (manual intervention)
Self-hosted - your data, your keys, your security policies

Best for: Developers and technical users who want a self-hosted, containerized AI agent for server-side automation.

Pricing: Free and open source.

Platforms: Any platform that runs Docker (macOS, Windows, Linux).

Pros:

Full desktop environment, not browser-limited
Self-hosted means complete data control
Docker-based deployment is quick - running in minutes
Autonomous and takeover modes give flexibility
No subscription fees

Cons:

Runs in a containerized Linux environment, not your actual desktop - you cannot automate your personal Mac or Windows apps directly
Primarily for server-side/headless automation, not interactive desktop use
Requires Docker knowledge and infrastructure
The GitHub repository was archived in March 2026, raising questions about ongoing maintenance
Screenshot-based vision approach for interface interaction

9. Macro

What it is: A workspace super app that unifies tasks, documents, and AI workflows into a single platform. Macro combines document management, AI chat, and productivity features with plans to add persistent AI agents for ongoing workflows. It is less of a desktop automation agent and more of an AI-powered workspace.

Key features:

AI chat that works across multiple documents - SEC filings, legal transcripts, research papers
Auto-generated structured reports and visual diagrams from uploaded documents
Keyboard-driven interface with rapid triage for emails, DMs, and to-dos
Planned: persistent AI agents for project management and document drafting
Google Cloud-powered web search integration (coming soon)

Best for: Knowledge workers who want an AI-powered workspace for document analysis and task management.

Pricing: Free tier available. Premium plans for advanced features.

Platforms: macOS, web.

Pros:

Strong document analysis and multi-document chat capabilities
Clean, keyboard-driven interface designed for speed
AI features are well-integrated into the workspace experience
Useful for research-heavy and document-heavy workflows

Cons:

Not a desktop automation agent - it does not control your mouse, click buttons, or automate other apps
AI agents are still planned, not shipped
Workspace approach means you need to adopt their platform rather than automating your existing tools
Limited automation capabilities compared to actual agent tools on this list

10. Agent Zero

What it is: An open-source autonomous AI agent framework that runs in a self-contained Dockerized Linux environment. Agent Zero is designed for advanced experimentation - it can use and create its own tools, learn from past interactions, spawn subordinate agents for complex tasks, and self-correct when things go wrong.

Key features:

Fully autonomous agent that can create and use its own tools
Persistent memory system with AI-filtered retrieval of relevant past interactions
Multi-agent cooperation - spawns subordinate agents for complex task delegation
Integrated browser and private search engine for web research
Extensible framework - integrate any LLM, modify behaviors, add capabilities
Open source on GitHub

Best for: Developers and AI enthusiasts who want a flexible, extensible agent framework for experimentation and custom automation.

Pricing: Free and open source.

Platforms: Any platform that runs Docker (macOS, Windows, Linux).

Pros:

Highly extensible - build custom agent behaviors and tools
Multi-agent cooperation enables complex task decomposition
Persistent memory improves over time
Active community (3.4k+ GitHub stars) and ongoing development
No vendor lock-in - bring your own LLM

Cons:

Framework, not a product - requires significant setup and configuration
Runs in Docker, not on your actual desktop
Steep learning curve for non-developers
Experimental by nature - reliability varies depending on task complexity
Not designed for end-user desktop automation

Comparison Table

Tool	Scope	Voice	Open Source	Pricing	Platform	Privacy	Browser Control
Fazm	Full desktop	Push-to-talk	Yes	Free	macOS	Local-first	DOM-based
ChatGPT Atlas	Browser only	No	No	$20/mo+	macOS	Cloud (OpenAI)	Screenshot-based
Perplexity Comet	Browser only	Limited	No	$20/mo+	macOS, Windows, Android, iOS	Cloud (Perplexity)	Screenshot-based
Simular	Full desktop	No	Partial	Free tier + paid	macOS (Silicon)	Cloud	Vision-based
Highlight AI	Observe only	Voice Q&A	No	Free (premium coming)	macOS, Windows	Local	N/A (no actions)
BrowserOS	Browser only	No	Yes (AGPL-3.0)	Free	macOS, Windows, Linux	Local option	DOM + screenshot
Composio	Browser only	No	Yes	Free	Chrome (any OS)	Self-hosted	Screenshot + API
Bytebot	Container desktop	No	Yes	Free	Docker (any OS)	Self-hosted	Screenshot-based
Macro	Workspace	No	No	Free tier + paid	macOS, web	Cloud	N/A (no browser control)
Agent Zero	Container desktop	No	Yes	Free	Docker (any OS)	Self-hosted	Integrated browser

How We Evaluated

We tested each tool across several real-world tasks to understand practical performance, not just feature lists.

Test tasks included:

Booking a flight on a travel site (multi-step form with date pickers, filters, and payment)
Replying to a specific email in Gmail
Extracting data from a webpage into a spreadsheet
Filing an expense report using data from a PDF
Creating a code file in VS Code and running a test

What we measured:

Task completion rate - did the agent finish the job without getting stuck?
Speed - how long from command to completion?
Accuracy - did it click the right things and fill in the correct data?
Recovery - when something went wrong, could the agent correct itself?
Setup time - how long from download to first successful automation?

Methodology notes: We ran each test multiple times across different days to account for variability. Browser-only tools were only tested on web-based tasks (they cannot do the desktop tasks). We used each tool's recommended configuration and latest available version as of March 2026.

Not every tool is designed for every test. Highlight AI, for example, is an observer - it is not trying to book flights or fill forms. We evaluated each tool against its stated purpose and compared across categories where tools overlap.

Conclusion

The AI desktop agent landscape in 2026 is genuinely useful but still fragmented. The right tool depends on what you need to automate and how much you care about privacy, voice control, and scope.

If you want the broadest automation with voice and privacy, Fazm is the clear pick. It is the only tool that controls your entire desktop, responds to voice commands, processes data locally, and costs nothing. The tradeoff is macOS-only support for now.

If your work lives in a browser and you want a polished experience, ChatGPT Atlas and Perplexity Comet both deliver. Atlas has stronger automation. Comet has better search. Both cost $20/month and both are browser-only. Newer entrants like Claude Cowork take a cloud VM approach, while Perplexity Personal Computer runs on dedicated Mac Mini hardware.

If you want an open-source AI browser, BrowserOS is the most promising option - cross-platform, model-flexible, and genuinely community-driven.

If you are looking at Apple's built-in AI, Apple Intelligence ships with every Mac but is limited to Siri, Writing Tools, and in-app suggestions - it does not offer full desktop automation. For a comparison of lightweight agents in the Apple ecosystem, see Sky vs Fazm.

If you are a developer building custom automation, Composio, Bytebot, and Agent Zero each offer different angles on the same idea: open-source frameworks for building your own agent workflows.

If you need a screen-aware assistant (not an agent), Highlight AI is excellent at what it does - observing, summarizing, and answering questions about your screen - even if it does not take actions.

The trajectory of this space is clear: agents are getting more capable, more reliable, and more integrated into how we work. The question is no longer whether AI agents can automate your desktop, but which one fits the way you work. If you are coming from traditional automation tools, our posts on alternatives to Alfred, Keyboard Maestro, Automator, and Zapier explain how AI agents compare to what you are using today. For open-source options specifically, see our open-source AI agents for Mac roundup.

You can download Fazm for free at fazm.ai/download, explore the source code on GitHub, or join the waitlist at fazm.ai for early access to upcoming features.

You Do Not Need an MCP Server for Every Mac App - Accessibility APIs as a Universal Interface

Matthew Diakonov — Wed, 18 Mar 2026 04:00:11 +0000

You Do Not Need an MCP Server for Every Mac App

The Model Context Protocol is great for connecting AI agents to external services. But when it comes to controlling native Mac apps, there is a simpler approach that most people overlook.

Instead of building a separate MCP server for Mail, another for Calendar, another for Finder, and another for every other app you want your agent to use - just use the macOS accessibility API. One interface, every app.

The MCP Per-App Problem

The typical setup for an AI agent that controls Mac apps looks like this:

MCP server for browser automation
MCP server for file system operations
MCP server for email
MCP server for calendar
Custom MCP server for each additional app

Each one needs to be built, configured, maintained, and kept in sync. Managing 10+ MCP servers is genuinely painful. Configuration files, version mismatches, servers that crash silently.

The Accessibility API Alternative

Every well-built Mac app exposes its UI through the accessibility framework. This is the same interface that screen readers like VoiceOver use. It gives you:

Read any element on screen - buttons, text fields, menus, labels
Perform actions - click, type, select, scroll
Navigate the UI hierarchy - find elements by role, label, or position
Works across all apps - one API, not one-per-app

An AI agent that speaks the accessibility API can control Mail, Calendar, Finder, Safari, Terminal, Xcode, Slack, and any other app without a single line of app-specific integration code.

How to Explore It

The Accessibility Inspector is built into Xcode and most people do not even know it exists. Open it from Xcode > Open Developer Tool > Accessibility Inspector. Point it at any app and you can see the entire UI tree - every element, every label, every available action.

This is the best free macOS automation tool nobody talks about. Before building an MCP server for a specific app, open the Accessibility Inspector and see if the app already exposes everything you need through the accessibility tree. It usually does.

When You Still Need MCP

Accessibility APIs are for UI-level interaction. If you need:

API-level data access (reading a database, querying an API)
Background processing (running without a visible window)
Cross-machine operations (controlling a remote server)

Then MCP is the right tool. The sweet spot is using accessibility APIs for local app control and MCP for everything else.

Fazm uses accessibility APIs as its primary interface for controlling macOS apps. Open source on GitHub. Discussed in r/ClaudeAI.

Keep Reading

Building Native macOS Apps with Claude Is a Different Beast Than Web Dev

Matthew Diakonov — Wed, 18 Mar 2026 03:59:25 +0000

Building Native macOS Apps with Claude Is a Different Beast Than Web Dev

If you have used Claude to build a React app, you know it is remarkably good. Drop in a description, get working code. But try building a native macOS app in Swift and the experience changes completely.

The reason is simple - training data. There are millions of React tutorials, Stack Overflow answers, and open source repos. For AppKit? Maybe a few thousand relevant posts, many of them outdated. SwiftUI is better but still has gaps, especially for macOS-specific features that differ from iOS.

Where Claude Hallucinates

The most common failure mode is Claude inventing APIs that do not exist. It will confidently write NSWindow.setFloatingBehavior(.alwaysOnTop) - a method that sounds right but has never existed. The real approach involves setting window.level to .floating and configuring the collection behavior.

Accessibility APIs are even worse. Claude will suggest AXUIElementCopyAttributeValue calls with attribute names that are close to real ones but slightly wrong. kAXTitleAttribute exists, but Claude sometimes uses kAXLabelAttribute (which does not) or mixes up the attribute constants with their string values.

What Actually Works

The fix is a detailed CLAUDE.md file. Not a generic one - a file that contains actual working code snippets for the patterns your app uses.

Include things like:

How your app creates and manages windows
Working accessibility API call patterns with correct attribute names
SwiftUI view patterns that compile on macOS (not iOS)
Which APIs require specific entitlements or permissions

When Claude has concrete examples of working code in context, it extends those patterns correctly. Without them, it falls back to interpolating from its training data - and for native macOS, that training data has too many gaps.

The Investment Pays Off

Building the CLAUDE.md takes time upfront. But once it covers your core patterns, Claude can extend them reliably. The second accessibility API wrapper takes 30 seconds. The twentieth SwiftUI view follows the same structure automatically. The key is giving Claude ground truth rather than letting it guess.

Fazm is an open source macOS AI agent. Open source on GitHub.

Build a Local-First AI Agent with Ollama - No API Keys, No Cloud, No Signup

Matthew Diakonov — Wed, 18 Mar 2026 03:59:24 +0000

Build a Local-First AI Agent with Ollama

The most common friction point with AI tools is setup. Create an account. Add a credit card. Generate an API key. Configure rate limits. Handle billing alerts.

What if you could skip all of that?

With Ollama running on your Mac, you can run AI models locally with zero cloud dependency. No account. No API key. No credit card. No data leaving your machine. Just download and run.

The Setup

# Install Ollama
brew install ollama

# Pull a model
ollama pull qwen2.5:14b

# It is running
ollama list

That is the entire setup. The model runs on your Apple Silicon GPU. Inference stays on your machine. Your data never touches a remote server.

What Works Well Locally

For desktop automation tasks - the kind where an agent fills in forms, navigates apps, and executes multi-step workflows - local models in the 7-14B range are surprisingly capable. They handle:

Action planning. "Open Safari, go to this URL, click this button" - straightforward sequences that smaller models handle reliably.
Text extraction. Reading structured data from screen content and reformatting it.
Simple reasoning. Deciding which app to open, which field to fill, what value to enter.

Where local models struggle:

Complex multi-step reasoning. A 20-step workflow with branching logic might need a larger model.
Nuanced writing. Drafting a sensitive email or crafting a specific tone - cloud models are still better here.
Vision tasks. Local vision models exist but are significantly behind cloud offerings.

The Hybrid Approach

You do not have to choose one or the other. Fazm supports both local models via Ollama and cloud models like Claude. The practical approach:

Local for routine tasks. Form filling, app navigation, file organization - run these on Ollama with zero latency and complete privacy.
Cloud for complex tasks. Multi-step reasoning, nuanced text generation, vision-heavy workflows - use Claude when accuracy matters more than privacy.
Your choice, per task. There is no reason to commit to one approach for everything.

Getting Started with Fazm + Ollama

Install Ollama and pull a model
Download Fazm and build it
Set the model provider to Ollama in settings
Start automating - fully local, fully private, no API keys

Fazm supports both Ollama (local) and Claude (cloud) for maximum flexibility. Open source on GitHub. Discussed in r/ollama.

What Is an AI Desktop Agent? Everything You Need to Know in 2026

Matthew Diakonov — Wed, 18 Mar 2026 03:55:06 +0000

What Is an AI Desktop Agent? Everything You Need to Know in 2026

An AI desktop agent is software that can see your screen, understand what is on it, and take real actions on your computer - clicking buttons, typing text, navigating between applications, and completing multi-step tasks on your behalf. You tell it what you want in plain language, and it figures out the steps and executes them, just like a human assistant sitting at your keyboard.

That is the core idea. But like most things in AI right now, the details matter a lot. The term "AI agent" gets thrown around loosely, and it is easy to confuse desktop agents with chatbots, copilots, browser extensions, and traditional automation tools. They are fundamentally different, and understanding those differences will save you from choosing the wrong tool for the job.

How AI Desktop Agents Differ from Other AI Tools

The AI landscape is crowded with tools that sound similar but work in very different ways. Here is how AI desktop agents compare to what you are probably already using.

Chatbots (ChatGPT, Claude, Gemini)

Chatbots are incredibly smart. They can write essays, analyze data, debug code, and answer complex questions. But they live inside a text window. When a chatbot tells you "go to Settings, click Privacy, then toggle off Location Services," you still have to do every single step yourself. The chatbot answers your question - it does not act on it. There is a wall between the AI's intelligence and your actual computer.

Copilots (GitHub Copilot, Microsoft Copilot)

Copilots sit inside a specific application and suggest actions. GitHub Copilot suggests code as you type. Microsoft Copilot suggests edits in Word or formulas in Excel. They are useful, but they are reactive - they wait for you to accept or reject their suggestions. You are still the one clicking, editing, and navigating. A copilot whispers advice. A desktop agent does the work.

Browser Extensions (ChatGPT Atlas, various AI assistants)

Some AI tools work as browser extensions or browser-based agents. They can interact with web pages - filling forms, clicking buttons, navigating sites. But they are confined to the browser. They cannot open Finder, interact with native Mac apps, manage local files, switch between desktop applications, or do anything outside the browser window. For anyone whose workflow involves more than just web apps, that is a significant limitation.

Traditional Automation (Zapier, IFTTT, Make)

Cloud automation platforms connect web services through APIs. They are great at tasks like "when I get a Slack message with a specific keyword, create a Jira ticket." But they operate entirely in the cloud, connecting service to service. They cannot interact with your desktop, see your screen, or control applications that do not have a public API. They also require you to build workflows step by step in advance - you need to know exactly what triggers what, and program it manually.

AI Desktop Agents

An AI desktop agent combines the intelligence of a chatbot with the ability to actually control your entire computer. It sees what is on your screen, understands the context, and takes action across any application - browser, native apps, files, system settings. You describe what you want in natural language, and the agent plans and executes the steps itself.

The key difference is scope and autonomy. A chatbot advises. A copilot suggests. A browser agent acts within one app. An AI desktop agent operates across your entire desktop, handling multi-app workflows that would otherwise require you to manually click through dozens of screens.

How AI Desktop Agents Work

Under the hood, an AI desktop agent follows a loop of perceive, plan, and act. Here is a simplified breakdown of what happens every time you give a command.

1. Screen Understanding

The agent needs to know what is on your screen before it can do anything useful. There are two main approaches to this.

Screenshot-based perception takes a picture of your screen and sends it to a vision model that interprets the image - identifying buttons, text fields, menus, and other elements by looking at the pixels. This is flexible but slow and sometimes inaccurate.

Structured access reads the underlying data directly. For web pages, this means reading the DOM (Document Object Model) - the structured blueprint of every element on the page. For native macOS apps, it means reading the accessibility tree that the operating system maintains. This approach is faster, more accurate, and more private because no screenshots leave your machine.

Most modern desktop agents use a hybrid approach. We wrote a detailed breakdown of how AI agents see your screen using DOM control versus screenshots if you want to go deeper on the technical side.

2. Intent Processing

Once the agent understands what is on screen, it sends your command to a large language model (LLM) for planning. The LLM interprets your natural language instruction - "reply to Sarah's email and tell her the meeting is moved to Thursday" - and breaks it into a sequence of concrete steps: open the email app, find Sarah's email, click reply, type the message, click send.

This is where the intelligence lives. The LLM does not just follow a script. It reasons about what needs to happen, adapts to the current state of your screen, and handles situations it has never seen before.

3. Action Execution

The agent carries out each planned step by controlling your mouse and keyboard - or, with DOM-based access, by interacting with UI elements directly at the programmatic level. It clicks buttons, types text, scrolls pages, switches between apps, and navigates menus.

After each action, the agent checks the screen again to verify the result and plan the next step. Did the click work? Did a new page load? Did an error appear? This feedback loop lets the agent adapt in real time rather than blindly following a pre-determined script.

What Can an AI Desktop Agent Do?

The short answer: anything you can do with a mouse and keyboard. The longer answer involves some practical examples that show where these tools really shine.

Fill Out Forms Across Apps

Expense reports, CRM entries, job applications, insurance forms - any repetitive form-filling task. The agent knows your information (name, address, company, common details) and can populate fields across any application without you re-entering the same data for the hundredth time.

Move Data Between Desktop and Web Apps

Copy data from a spreadsheet into a web-based project management tool. Extract information from emails and add it to a local database. Grab content from a PDF and paste it into a document. These cross-app workflows are where desktop agents save the most time because they eliminate the manual copy-paste-switch-paste cycle.

Automate Repetitive Workflows

Any task you do more than twice a week in roughly the same way is a candidate for automation. Organizing files, sorting emails, updating records, compiling reports. Our post on boring automation tasks that AI agents handle best covers the most common examples.

Research and Data Gathering

Need to compare prices across five vendors, compile a list of contacts, or pull information from multiple websites into a single document? An agent handles the tedious navigation while you focus on the analysis.

Types of AI Desktop Agents

Not all AI desktop agents are built the same way. The architecture matters because it affects speed, privacy, reliability, and what the agent can actually control.

Cloud VM Agents

Products like Claude Cowork and Perplexity Personal Computer run your tasks on a virtual machine in the cloud. The agent operates a remote desktop that you watch via a video feed. This approach works on any operating system and does not require local software, but it introduces latency, privacy concerns (your screen data lives on someone else's server), and cannot interact with your local files or native apps directly.

Native Desktop Agents

Native agents run directly on your computer and interact with your actual desktop environment. Fazm is an example - it runs natively on macOS, uses the accessibility API and DOM control for fast and accurate interactions, and processes screen data locally on your machine. Native agents can control everything on your desktop, including local files and apps that have no web interface.

We wrote a detailed comparison of native desktop agents versus cloud VM approaches if you are trying to decide between the two.

Browser-Only Agents

Browser-only agents like ChatGPT Atlas operate within the browser and can automate web-based tasks effectively. They are simpler to set up since they do not need system-level permissions, but they cannot interact with anything outside the browser window. For people whose work lives entirely in web apps, this might be enough. For everyone else, it is a significant limitation.

For a broader look at how these products compare on features, speed, and privacy, check out our roundup of the best AI agents for desktop automation in 2026.

Privacy and Safety

Letting software control your computer raises legitimate questions about privacy and safety. Here is what to look for when evaluating any AI desktop agent.

Local vs Cloud Processing

The biggest privacy question is where your screen data gets processed. Screenshot-based agents send images of your screen to cloud servers for analysis. Those images contain everything visible on your display - emails, documents, passwords, financial information.

Agents that use local processing - reading the DOM or accessibility tree on your machine - keep your screen content on your device. Only the intent (what you want to do) gets sent to an AI model for planning, not images of what is on your screen.

This distinction matters a lot if you work with sensitive information. We explore the full argument in why local-first AI agents are the future.

Permission Models and Bounded Tools

Good AI desktop agents do not operate with unlimited access. They use permission models that let you control what the agent can and cannot do. Can it send emails on your behalf, or only draft them? Can it delete files, or only read and create them? Can it make purchases, or only add items to a cart?

The concept of bounded tools and approval workflows is becoming standard in the industry. The best agents ask for confirmation before taking high-impact actions and let you set boundaries upfront so the agent stays within safe limits.

Open Source Transparency

One of the strongest signals that an AI agent takes privacy seriously is whether its code is open source. When the codebase is public, you can inspect exactly what data is collected, where it is sent, and how it is stored. There is no "trust us" - you can verify it yourself.

Getting Started

If this is your first time trying an AI desktop agent, the setup is simpler than you might expect. You do not need a technical background, and most agents are ready to use within a few minutes of downloading them.

Our complete beginner's guide to setting up your first AI computer agent walks through everything step by step - choosing an agent, granting permissions, running your first tasks, and building up to more complex workflows.

The learning curve is real but short. Most people go from skeptical to dependent within about a week of regular use, once the agent learns their patterns and they learn how to communicate effectively with it.

The Bottom Line

AI desktop agents represent a genuine shift in how people interact with computers. Instead of learning where every button lives in every application and clicking through the same menus hundreds of times, you describe what you need and the agent handles the execution.

They are not chatbots that advise. They are not copilots that suggest. They are autonomous software that sees your screen, understands context, and takes action across your entire desktop - any app, any workflow, any task you can do with a mouse and keyboard.

The technology is here, it works, and it is improving fast. The question is not whether AI desktop agents will become a standard part of computer use - it is how quickly you start using one.

Ready to try it? Fazm is free, open source, and built natively for macOS. Download it at fazm.ai/download or star the project on GitHub. You can also explore our detailed comparisons with Apple Intelligence, Simular AI, and other agents to find the right fit for your workflow.

Keep Reading

How to Set Up Your First AI Computer Agent (Complete Beginner'\''s Guide)

Matthew Diakonov — Wed, 18 Mar 2026 03:54:20 +0000

How to Set Up Your First AI Computer Agent (Complete Beginner's Guide)

You have probably seen the demos by now. Someone talks to their computer, and the computer just... does things. It opens apps, clicks buttons, fills out forms, sends emails - all on its own. It looks like magic. It also looks like something that would take a computer science degree to set up.

It does not. Setting up your first AI computer agent is genuinely straightforward, and you can be running your first automated task in under ten minutes. This guide will walk you through everything from scratch - no technical background required.

What Exactly Is an AI Computer Agent?

Before you set anything up, let's make sure we are on the same page about what an AI computer agent actually is. Because it is easy to confuse with things that sound similar but work very differently.

An AI computer agent is software that can perform real actions on your computer. It moves your mouse, clicks buttons, types text, navigates between apps, fills in forms, and completes multi-step tasks - all based on instructions you give it, usually in plain English (or by voice).

Think of it as a very capable assistant who is sitting at your computer, looking at your screen, and operating it for you. You say what you need done, and the agent figures out the steps and executes them.

How Is This Different from Things You Already Use?

It is not a chatbot. Tools like ChatGPT and Claude are amazing at generating text, answering questions, and reasoning through problems. But they live inside a chat window. They tell you what to do - they do not actually do it. You still have to take the answer, switch to the right app, and manually carry out every step yourself.

It is not Siri or Alexa. Voice assistants can set timers, play music, and check the weather. But ask Siri to reply to a specific email, fill out an expense report, or book a flight on Kayak, and it cannot help you. These assistants handle a fixed set of simple commands - not open-ended computer tasks. Even Apple Intelligence, which adds on-device AI features to macOS, does not cross this line - it enhances existing apps but does not control your computer autonomously.

It is not traditional automation like Automator or Keyboard Maestro. Those tools require you to program exact sequences of steps in advance. They are powerful but rigid - you need to know exactly what you want to automate and build the workflow yourself, step by step. An AI computer agent understands natural language and figures out the steps on its own.

An AI computer agent combines the intelligence of a chatbot with the ability to actually control your computer. You describe what you want in plain language. The agent plans the steps, then executes them on your screen - clicking, typing, and navigating just like a human would, except faster.

Choosing Your First AI Computer Agent

There are several AI computer agents available right now. Here is a quick overview to help you pick the right one for your situation.

If You Want the Easiest Free Option: Fazm

Fazm is open source, free, and built specifically for macOS. It sits as a floating toolbar on your screen and takes voice commands through push-to-talk. It can control your entire desktop - not just the browser - including native apps, files, and documents. Fazm uses direct browser DOM control instead of screenshots, which makes it significantly faster and more reliable than most alternatives.

If You Are Already Paying for ChatGPT Plus: ChatGPT Atlas

ChatGPT Atlas is OpenAI's computer agent built into ChatGPT. It works through a text sidebar in your browser and can automate browser-based tasks. The limitation is that it only works inside the browser - it cannot control native Mac apps, manage files on your computer, or handle desktop-level tasks. It also costs $20/month as part of ChatGPT Plus.

If You Mainly Need Research: Perplexity Comet

Perplexity Comet is a search-focused AI browser that can automate some web tasks. It is excellent for research-heavy workflows but limited in scope compared to a full desktop agent. It requires a Perplexity Pro subscription.

For this tutorial, we will use Fazm. It is free, it works across your entire Mac (not just the browser), and it has the broadest range of capabilities. Everything we cover here will apply to any AI agent, but the specific setup steps will follow Fazm.

Step-by-Step Setup with Fazm

Let's get you up and running. This whole process takes about five minutes.

Step 1: Download Fazm

You have two options:

Download the app directly from fazm.ai/download. This works on both Apple Silicon (M1, M2, M3, M4) and Intel Macs.
Clone from GitHub if you prefer to build from source: github.com/m13v/fazm. This is totally optional - the downloadable app works perfectly.

For most people, just grab the download from the website.

Step 2: Install the App

This works like any other Mac app. Open the downloaded file and drag Fazm into your Applications folder. Then open it from Applications (or Spotlight - press Command+Space and type "Fazm").

The first time you open it, macOS might show a security warning since Fazm is not from the App Store. If that happens, go to System Settings > Privacy & Security and click Open Anyway next to the Fazm notification. This is standard for open-source Mac apps.

Step 3: Grant Permissions

When Fazm launches for the first time, it will ask for three macOS permissions. Each one is necessary for the agent to work, and here is why:

Accessibility Permission - This lets Fazm control your mouse and keyboard. Without it, the agent can plan actions but cannot actually execute them. Go to System Settings > Privacy & Security > Accessibility and toggle Fazm on.

Microphone Permission - This is for voice commands. Fazm uses push-to-talk, so it only listens when you activate it - it is not always listening in the background. You will see a standard macOS microphone permission dialog.

Screen Recording Permission - This lets Fazm see what is on your screen so it knows what app you are in, what is on the page, and where to click. Importantly, Fazm processes screen data locally on your machine. Your screen content is never sent to any external server. Go to System Settings > Privacy & Security > Screen Recording and toggle Fazm on.

After granting permissions, you may need to restart Fazm for everything to take effect. Just quit the app (right-click its icon in the menu bar and choose Quit) and open it again.

Step 4: Get Familiar with the Interface

Once Fazm is running, you will see a small floating toolbar on your screen. This is the main interface. It stays on top of your other windows so it is always accessible.

The toolbar is minimal by design. There is no complicated dashboard to learn. The core interaction is simple: press the keyboard shortcut to activate push-to-talk, speak your command, and watch Fazm work.

You can also type commands directly into the toolbar if you prefer text input over voice.

Step 5: Test That Everything Works

Before diving into real tasks, let's make sure everything is connected. Try this simple command - either speak it using push-to-talk or type it:

"Open Safari"

If Fazm opens Safari, you are good to go. If nothing happens, double-check that all three permissions are granted in System Settings and that you restarted Fazm after granting them.

Your First 5 Tasks (Progressive Difficulty)

Now for the fun part. We will work through five tasks that gradually increase in complexity. By the end, you will have a solid feel for how AI computer agents work and what they can handle.

Task 1 (Easy): Open a Website

Say: "Open Safari and go to google.com"

What you will see: Fazm opens Safari (or brings it to the front if it is already open), clicks the address bar, types "google.com," and hits Enter. The Google homepage loads.

Why start here: This confirms that Fazm can control your browser. It is a simple, low-stakes test.

If it does not work: Make sure Accessibility permission is enabled. Fazm needs this to control mouse and keyboard actions. Also check that the Fazm browser extension is installed if prompted.

Task 2 (Easy): Do a Web Search

Say: "Search for the weather in San Francisco"

What you will see: Fazm opens your browser, navigates to a search engine, types the query, and hits Enter. You will see the search results page with the weather for San Francisco.

Why this matters: This shows that Fazm can handle a task with a clear goal but without you specifying every single step. You did not say "open Safari, click the address bar, type google.com, click the search box, type weather in San Francisco, press Enter." You just said what you wanted, and Fazm figured out the steps.

If it does not work: Try being slightly more specific, like "Open Safari and search Google for the weather in San Francisco." As Fazm learns your habits, you will be able to use shorter, more natural commands.

Task 3 (Medium): Send an Email

Say: "Send an email to Jake saying I'll be 10 minutes late to the meeting"

What you will see: Fazm opens your email client (Gmail in the browser or Apple Mail), starts a new message, fills in Jake's email address (if it knows Jake from previous interactions - if not, it will ask or search your contacts), types the subject line and message body, and sends it.

Why this is a step up: This involves multiple actions across different parts of an app - composing, addressing, writing, and sending. It also shows how the memory layer works. The first time, you might need to say Jake's full email address. Next time, Fazm will remember.

If it does not work: If Fazm does not know who Jake is, add more detail: "Send an email to jake@example.com saying I'll be 10 minutes late." After this, Fazm will associate the name Jake with that email address for future commands.

Task 4 (Medium): Multi-Step Browser Research

Say: "Find the cheapest flight to New York next weekend"

What you will see: Fazm opens a travel website like Google Flights or Kayak, enters your departure city (which it may already know from your location or past searches), sets New York as the destination, picks the dates for next weekend, searches for flights, and sorts by price. You will see the results on screen.

Why this is useful: This is a multi-step task that would normally involve a lot of clicking, typing, and waiting. The agent handles the entire flow while you watch. You can stop it at any point if you want to take over or adjust the search.

If it does not work: Break it into two parts. First: "Open Google Flights." Then: "Search for flights from [your city] to New York departing Saturday and returning Sunday." As you use Fazm more, it will learn your home airport and travel preferences so you can go back to the shorter version.

Task 5 (Advanced): Multi-App Workflow

Say: "Summarize my unread emails and create a to-do list in Notes"

What you will see: Fazm opens your email, scans your unread messages, identifies the ones that need action, switches to the Notes app, creates a new note, and writes a summary of your emails along with a to-do list of action items.

Why this is powerful: This task spans two completely different apps and requires the agent to read, interpret, and synthesize information - not just click buttons. This is the kind of workflow that really shows the value of a desktop-level AI agent versus a browser-only tool.

If it does not work: Start with a simpler version: "Open my email and tell me how many unread messages I have." Once that works, try: "Summarize my three most recent unread emails." Build up to the full workflow gradually.

Tips for Getting Better Results

AI computer agents are powerful, but they work best when you know how to communicate with them effectively. Here are the practices that make the biggest difference.

Be Specific but Natural

You do not need to use special syntax or robotic phrasing. Talk to Fazm the way you would talk to a capable assistant sitting next to you. "Can you reply to that email from Sarah and tell her the meeting is moved to Wednesday" is a perfectly good command.

That said, specificity helps for complex tasks. "Book a flight" is vague. "Book a direct flight to Tokyo next Thursday, departing after 10am, economy class" gives the agent everything it needs to get it right on the first try.

Start Simple, Then Build Complexity

If you jump straight to complex multi-app workflows, you might get frustrated. Start with single-app, single-action tasks. Get comfortable with how the agent operates. Then gradually combine actions and span across apps.

Think of it like learning to drive. You start in a parking lot, not on the highway.

Let the Memory Layer Learn Your Preferences

Fazm's memory layer builds a personal knowledge graph from your interactions. The more you use it, the less you need to explain. In the first week, you might need to spell out details - email addresses, preferred websites, file locations. By the fourth week, Fazm already knows your contacts, your favorite tools, and your workflow patterns.

Do not fight this process by repeating information Fazm already has. Trust the memory and keep your commands short. If Fazm has already learned who Sarah is, just say "Reply to Sarah" - you do not need to re-explain every time.

Use It Consistently for a Week Before Judging

AI computer agents improve dramatically with use. The experience in the first hour is not representative of the experience after a week. Give it time to learn your patterns, and give yourself time to learn how to communicate with it effectively.

Most people report a noticeable difference after three to five days of regular use. The commands get shorter, the results get more accurate, and the overall flow becomes second nature.

Common Issues and How to Fix Them

Every new tool has a learning curve. Here are the most common issues people run into and how to resolve them.

The Agent Clicks the Wrong Thing

This usually happens when your command is ambiguous. If there are multiple buttons or links that could match your intent, the agent has to guess. Fix this by being more specific about what you want. Instead of "click the button," try "click the blue Submit button at the bottom of the form."

Over time, Fazm learns the specific interfaces you use regularly and gets much better at navigating them accurately.

Voice Commands Are Not Recognized

First, check that Microphone permission is enabled in System Settings > Privacy & Security > Microphone. If it is enabled and commands still are not recognized, try speaking a bit more clearly and at a steady pace. Background noise can also interfere - if you are in a noisy environment, try moving to a quieter spot or using text input instead.

Also make sure you are pressing and holding the push-to-talk shortcut while speaking. Fazm does not listen continuously - it only captures audio while the shortcut is held down.

A Task Takes Too Long

If the agent seems to be taking a roundabout path to complete a task, it might be because the instruction was too broad. Break complex tasks into smaller, more specific steps. Instead of "organize all my files," try "move all the PDFs from my Downloads folder to my Documents folder."

Smaller, well-defined tasks execute faster and more reliably than large, ambiguous ones.

Permission Errors or the Agent Cannot Control Apps

If Fazm seems unable to interact with certain apps or features, permissions are almost always the cause. Go to System Settings > Privacy & Security and verify that Fazm has Accessibility, Screen Recording, and Microphone permissions enabled.

Some macOS updates can reset permissions, so if things were working before and suddenly stop, check this first.

If you recently installed Fazm and granted permissions but things are not working, try restarting the app. Some permissions require a restart to take effect.

What to Automate Next

Once you are comfortable with the basics, here are some areas where AI computer agents really shine. Each of these can save significant time every week.

Email Workflows

Go beyond single replies. Try commands like "Archive all newsletters from this week," "Draft a follow-up to everyone I met at the conference last Tuesday," or "Flag all emails from clients that mention a deadline." Email management is where most people see the biggest time savings - often 30 to 45 minutes per day. See our post on the most satisfying tasks to automate for more ideas.

Form Filling and Data Entry

Expense reports, CRM updates, compliance forms, job applications - any form you fill out repeatedly is a candidate for automation. Fazm's memory layer means it already knows your name, address, company details, and other common form fields, so you do not have to re-enter them every time.

Research Tasks

Need to compare pricing across competitors? Find the best-reviewed restaurants in a new city? Compile a list of potential vendors? Research tasks that involve visiting multiple websites, extracting information, and organizing it are a perfect fit for AI agents. A task that would take an hour of tab-switching becomes a single voice command.

Scheduled Automations

Fazm can run recurring tasks automatically. Set up workflows like "Every Monday, compile the team's GitHub activity into a summary email" or "Every morning, check my inbox and flag anything urgent." This is where automation moves from reactive (you ask for something) to proactive (it happens automatically).

Code Writing and Development

If you write code, voice-controlled agents can create files, write functions, run tests, commit changes, and navigate your IDE - all from voice commands. It is not just dictation. The agent understands the structure of your project and makes intelligent decisions about where to write code and how to structure it.

The Privacy Question

If you are going to let software control your computer, privacy matters. Here is how Fazm handles it.

Screen analysis happens locally on your Mac. When Fazm looks at your screen to understand what app you are in and where to click, that processing happens on your machine using on-device AI on Apple Silicon. Your screen content is never uploaded to a third-party server.

Your knowledge graph stays local. The memory layer - which stores your contacts, preferences, file information, and workflow patterns - lives entirely on your Mac. It never leaves your machine.

Only intent is sent to the cloud. When you give a command, the intent (what you want to do) is sent to an AI model for action planning. But the actual screen content, document contents, and personal details stay local.

Fazm is fully open source. The entire codebase is available on GitHub. You can inspect exactly how your data is handled, what is sent where, and how everything works. There is nothing hidden. We explain why local-first architecture matters for privacy in a dedicated post.

Getting Started Today

The learning curve for AI computer agents is real but short. Most people go from "this is weird" to "I cannot live without this" within a few days. Here is your quick-start checklist:

Download Fazm from fazm.ai/download - it is free and open source
Grant the three permissions (Accessibility, Microphone, Screen Recording) in System Settings
Try the five tasks from this guide, starting with the easy ones
Use it daily for a week to let the memory layer learn your patterns
Star the project on GitHub at github.com/m13v/fazm to follow development and contribute

The way we interact with computers is changing. Instead of learning where every button is and clicking through the same menus hundreds of times, you can just say what you need and let the computer handle it. AI computer agents are not replacing you - they are handling the tedious parts so you can focus on work that actually matters.

The tools are here. They are free. They are open source. The only question is which repetitive task you want to eliminate first.

How LLMs Can Control Your Computer - Voice-Driven, Local, No API Keys

Matthew Diakonov — Wed, 18 Mar 2026 03:53:34 +0000

How LLMs Can Control Your Computer

Most people interact with LLMs through chat interfaces. Type a question, get an answer. But there is a much more interesting use case: letting an LLM actually control your computer.

Not generating code for you to run. Not suggesting what to click. Actually moving the mouse, typing in text fields, navigating between apps, and completing multi-step workflows autonomously.

The Architecture

A desktop agent powered by an LLM needs three things:

Perception - the ability to see what is on the screen and understand the current state of the UI
Planning - the ability to break a high-level instruction ("update the CRM with call notes") into a sequence of concrete actions
Execution - the ability to actually perform those actions (click buttons, type text, switch apps)

The LLM handles the planning step. It takes the current screen state as input and outputs a structured action plan. The perception and execution layers are handled by native APIs - ScreenCaptureKit for screen capture and accessibility APIs for UI interaction. We cover the technical implementation of these APIs in our post on building a macOS AI agent in Swift.

Why Voice Changes Everything

Typing instructions to an LLM-powered agent defeats the purpose. If you are already at your keyboard, you might as well just do the task yourself.

Voice input changes the equation. You can tell the agent what to do while walking to the kitchen, while on a phone call, or while working on something else entirely. The agent becomes ambient - always available, never requiring you to switch contexts.

Push-to-talk is the right interaction model. Always-listening creates privacy concerns and false activations. A single keyboard shortcut to start speaking, then release to execute, keeps you in control.

Local vs Cloud

Running the LLM locally means:

No API keys. Download the app, open it, start using it. No account creation, no billing setup, no rate limits.
No latency. The roundtrip to a cloud API adds 500ms-2s per action. For a multi-step workflow, that adds up to a noticeably sluggish experience.
No privacy concerns. Your screen content, voice recordings, and file contents never leave your machine.

With Ollama and models like Qwen running on Apple Silicon, local inference is fast enough for practical desktop automation. You trade some accuracy for complete independence from cloud services. Our post on on-device AI on Apple Silicon goes deeper into what models run well locally and the latency tradeoffs.

That said, Fazm also supports Claude and other cloud models for users who want maximum accuracy and do not mind the cloud dependency. The choice is yours.

What It Actually Looks Like

Here is a typical workflow:

You press the hotkey and say "send Sarah the meeting notes from today's standup"
The agent reads the current screen to understand context
It opens your email client, finds Sarah's contact, drafts the email with the meeting notes it observed earlier, and sends it
Total time: 15 seconds instead of 2 minutes of manual app-switching and typing

The boring tasks - CRM updates, form filling, file organization, email triage - are where this shines. Not because the AI is smarter than you, but because these tasks do not deserve your attention in the first place. We compiled a list of the most satisfying tasks to automate based on real user feedback.

Native Mac Speech-to-Text That Runs Locally - Privacy, Speed, and No Cloud

Matthew Diakonov — Wed, 18 Mar 2026 03:52:35 +0000

Native Mac Speech-to-Text That Runs Locally

A Reddit thread about testing "a native, private and very fast speech-to-text app" on Mac drew a lot of interest. The appeal is obvious: you talk, it types, and nothing leaves your machine. No cloud API calls, no latency, no subscription fees, no privacy concerns.

For AI desktop agents, local speech-to-text is not just a nice feature - it is foundational. If you are using voice to control an agent that manages your desktop, sending audio to a cloud API means every command you speak travels to a server somewhere. That includes everything visible on your screen that you might reference out loud - passwords, financial data, private conversations.

Speed Changes the Interaction Model

Cloud-based transcription adds 200-500ms of latency per utterance. That does not sound like much, but it is enough to break the feeling of direct control. When you say "move this file to the projects folder" and there is a half-second delay before anything happens, it feels like talking to a phone tree. When transcription is instant, it feels like the agent is listening.

Local models running on Apple Silicon have gotten remarkably good. Whisper variants optimized for M-series chips can transcribe in near real-time with accuracy comparable to cloud services for most common speech patterns. The tradeoff is usually with accents and specialized vocabulary, but for command-and-control usage it works well.

Integration with Desktop Agents

The real power comes when local speech-to-text feeds directly into a desktop agent. You speak a command, it gets transcribed locally, the agent interprets it, and executes the action - all without touching the internet. This is the architecture behind voice-controlled Mac automation.

For a native menu bar agent, local transcription means the voice interface is always available, even offline. You can dictate notes, trigger automations, and control apps entirely by voice while on a plane or in a location with no connectivity.

The shift from cloud to local speech processing is not about being anti-cloud. It is about removing unnecessary dependencies from a workflow that should be instant and private.

Fazm is an open source macOS AI agent. Open source on GitHub.

On-Device AI on Apple Silicon - What It Means for Desktop Agents

Matthew Diakonov — Wed, 18 Mar 2026 03:52:28 +0000

On-Device AI on Apple Silicon

Apple Silicon changed what is possible for local AI. The unified memory architecture means ML models can run on the GPU without copying data between CPU and GPU memory. For a desktop agent that needs to process screen content in real-time, this matters a lot.

What Runs Locally Now

On an M1 with 16GB of RAM, you can comfortably run:

WhisperKit for voice transcription - fast enough for real-time push-to-talk
Ollama with 7-13B parameter models for action planning - usable latency for simple tasks
Vision models for screen understanding - when accessibility APIs are not enough

On an M4 Pro with 48GB, the picture gets much better:

32B parameter models run at interactive speeds
Multiple models simultaneously - transcription and planning can run in parallel without contention
Overnight batch processing - the agent can process files, organize documents, and handle backlog tasks while you sleep

The Latency Question

Cloud APIs add 500ms-2s per request. For a desktop agent that might need 5-10 LLM calls to complete a single task, that is 5-20 seconds of waiting. Local inference on Apple Silicon cuts this to near-zero for smaller models.

The tradeoff is accuracy. A local 13B model is not as capable as Claude for complex multi-step reasoning. But for straightforward desktop automation - filling forms, navigating menus, extracting text - smaller models are usually sufficient. Our post on how LLMs control your computer covers the full architecture of voice-driven, local-first desktop agents.

The Privacy Argument

A desktop agent sees everything on your screen. Every password, every private message, every financial document. Running the AI model locally means none of that data leaves your machine.

This is not a theoretical concern. Screenshot-based cloud agents literally upload images of your screen to remote servers every few seconds. If your screen shows your bank account, that screenshot is now on someone else's server.

Local inference eliminates this entirely. Your screen content stays in your RAM, gets processed by your GPU, and the results stay on your machine. We make the full case for this architecture in why local-first AI agents are the future.

What About Apple Intelligence?

Apple's own on-device AI initiative - Apple Intelligence - ships with macOS Sequoia and runs models directly on the Neural Engine. It powers Writing Tools, Smart Replies, and an upgraded Siri. But it is not a desktop agent. Apple Intelligence cannot click buttons, fill forms, navigate browsers, or automate multi-step workflows across apps. It is a set of in-app AI features, not autonomous computer control. For users who want to go beyond what Apple's built-in AI offers, a dedicated desktop agent fills the gap.

How Fazm Uses Apple Silicon

Fazm is designed to take advantage of Apple Silicon's unified memory:

Voice input goes through WhisperKit locally
Screen capture uses ScreenCaptureKit (hardware-accelerated) - see our deep dive into ScreenCaptureKit and accessibility APIs for implementation details
You choose between local models via Ollama or cloud models like Claude
The accessibility tree is processed entirely on-device

The goal is that the most privacy-sensitive operations - capturing your screen and understanding your voice - always happen locally, regardless of which LLM you use for action planning.

Typing Instructions to an AI Agent Is Backwards - Voice First Is the Answer

Matthew Diakonov — Wed, 18 Mar 2026 02:45:21 +0000

Stop Typing to Your Agent

Think about what happens when you use a typical AI coding agent. You type a detailed prompt. Wait for it to work. Read the output. Type corrections. Wait again. Your hands are on the keyboard the entire time, dedicated to managing the agent.

That defeats the purpose. The agent is supposed to give you time back, not consume it in a different way.

Voice Changes the Dynamic

When you can speak to your agent, your hands are free. You can be reviewing a design in Figma while telling the agent to fix a build error. You can be eating lunch while directing a refactor. You can be on a walk while the agent handles your email backlog.

The interaction model shifts from "sitting at your desk managing the agent" to "living your life while the agent handles tasks in the background." That's a fundamentally different value proposition.

Why It's Hard to Get Right

Voice-first interaction needs three things to work well: reliable speech-to-text, good intent parsing from natural speech, and a way to handle ambiguity without stopping everything to ask clarifying questions.

Natural speech is messy. People say "um," change direction mid-sentence, and use vague references. A voice-first agent needs to handle "fix that thing from earlier, you know, the one that was breaking" and figure out what "that thing" refers to from context.

Local speech-to-text models running on Apple Silicon are now fast enough to make this practical. You don't need to send audio to a cloud API, which solves both the latency and privacy concerns.

The Right Default

Text input should still exist for precision work - complex code snippets, exact file paths, specific configuration values. But the default interaction mode should be voice. Speak your intent, let the agent execute, check the results when you're ready.

The agents that win the daily-use battle will be the ones you talk to, not the ones you type to.

Fazm is an open source macOS AI agent. Open source on GitHub.

AI Agent vs Chatbot vs Copilot: What Is the Difference?

Matthew Diakonov — Wed, 18 Mar 2026 02:45:00 +0000

Chatbots talk. Copilots suggest. Agents act. If you only remember one thing from this article, let it be that. A chatbot answers your questions in text. A copilot watches what you are doing and offers suggestions that you then execute yourself. An AI agent takes action on your behalf - it clicks, types, navigates between apps, and completes tasks end-to-end without waiting for you to do the work.

These three categories get thrown around constantly, and the lines between them are starting to blur. But understanding the core differences matters - especially if you are trying to figure out which tool will actually save you time.

Comparison Table

	Chatbot	Copilot	AI Agent
Primary function	Answers questions	Suggests next steps	Executes tasks
User involvement	You do everything	You approve suggestions	Agent does the work
Examples	ChatGPT, Claude chat	GitHub Copilot, Cursor	Fazm, Claude Cowork
Scope	Text conversation	Within one app	Entire computer
Autonomy	None	Low	High
Learning curve	Very low	Medium	Low to medium
Best for	Information and ideas	Productivity in one tool	Multi-step workflows

What Is a Chatbot?

A chatbot is a conversational interface powered by a language model. You type a question, and it types an answer. That is the entire interaction loop.

Modern chatbots like ChatGPT, Claude, and Gemini are remarkably capable at what they do. You can ask them to explain a concept, draft an email, summarize a document, brainstorm ideas, write code, or analyze data you paste in. The quality of their responses has improved dramatically over the past few years, and for many tasks they are genuinely useful.

But there is a fundamental limitation: chatbots can only talk. They cannot do anything outside the chat window. If you ask a chatbot to "schedule a meeting with Sarah for next Tuesday," it will write you a nice reply explaining how to schedule the meeting. It will not actually open your calendar and create the event.

This means you are still the one doing the work. The chatbot gives you text, and then you have to take that text and act on it yourself - copy the email draft into your email client, take the code snippet and paste it into your editor, manually follow the steps it outlined.

For information retrieval, brainstorming, and content generation, chatbots are excellent. For actually getting things done, they are only the first step.

What Is a Copilot?

A copilot is an AI assistant embedded inside a specific application. Unlike a chatbot that lives in its own window, a copilot sits alongside you in the tool you are already using and offers contextual suggestions based on what you are doing right now.

The most well-known example is GitHub Copilot, which watches you write code and suggests completions in real time. As you type a function name, it predicts the body. As you write a comment describing what you want, it generates the code below. Cursor takes this further by letting you chat with your codebase and apply suggested edits.

Other copilots include Microsoft 365 Copilot (embedded in Word, Excel, and PowerPoint), Notion AI (built into Notion), and various design copilots in tools like Figma.

Copilots are a clear step up from chatbots in terms of practical utility. Because they are embedded in your workflow, they understand your current context - the file you are editing, the spreadsheet you are working on, the document you are writing. Their suggestions are more relevant because they can see what you are doing.

The limitation is twofold. First, copilots are confined to a single application. GitHub Copilot cannot help you with your email. Notion AI cannot edit your spreadsheet. Each copilot is locked into its host app. Second, copilots only suggest - they do not act. You still have to review each suggestion and accept or reject it. The human stays in the loop for every action.

What Is an AI Agent?

An AI agent is software that can take independent action on your computer to complete tasks. Instead of answering questions or making suggestions, an agent actually does the work - clicking buttons, filling in forms, switching between applications, and navigating multi-step workflows.

This is the key breakthrough that separates agents from chatbots and copilots: the ability to act. If you tell an AI agent to "schedule a meeting with Sarah for next Tuesday at 2pm," the agent opens your calendar app, creates a new event, fills in the details, adds Sarah as an attendee, and saves it. You watch it happen, or you walk away and come back when it is done.

Agents can work across your entire computer, not just one app. A single task might involve opening a browser, looking up information, switching to a spreadsheet to enter data, then moving to an email client to send a summary. The agent handles all of that as one continuous workflow. For a deeper look at how this works, see our explanation of cross-app workflows with AI desktop agents.

How do agents actually interact with your screen? There are two main approaches - screenshot-based vision and direct DOM control. We wrote a detailed breakdown in how AI agents see your screen.

Agents like Fazm run locally on your Mac and use a combination of these techniques to control applications, respond to voice commands, and execute tasks while keeping your data private.

When to Use Each

Choosing the right tool depends on what you are trying to accomplish. Here is a practical guide.

Use a chatbot when you need information

If your goal is to understand something, get ideas, or generate text, a chatbot is the right tool. Need to research a topic? Ask a chatbot. Want help drafting a blog post? Ask a chatbot. Need to understand a complex concept? Ask a chatbot. The interaction is purely informational - you are trading prompts for knowledge.

Use a copilot when you need help inside one app

If you are deep in a specific tool and want an AI assistant that understands your context, a copilot is the right choice. Writing code and want autocomplete that understands your codebase? Use a coding copilot. Editing a long document and want AI-powered rewriting? Use a writing copilot. The copilot accelerates your work within that single application.

Use an AI agent when you need a task done across apps

If you have a multi-step task that spans several applications - or if you simply want the work done for you rather than getting suggestions - an AI agent is what you need. Data entry that involves copying information between a browser and a spreadsheet. File management that requires renaming, moving, and organizing across folders. Repetitive workflows that you do the same way every time. These are where agents shine.

For a practical walkthrough of what this looks like, check out our beginner's guide to using your first AI computer agent.

The Future: Convergence

The boundaries between these three categories are already blurring. ChatGPT now has an "agent mode" that can browse the web and take actions. Claude can operate a virtual computer. Google's Gemini is gaining the ability to interact with apps on Android.

We are moving toward a world where every AI interface will have some degree of agency. The chatbot that only talked will learn to act. The copilot confined to one app will break free. The standalone agent will become more conversational. Our roundup of the best AI agents for desktop automation tracks how quickly this space is evolving.

But the core distinction still matters today. When you evaluate an AI tool, ask yourself: does it just tell me things, does it suggest things, or does it actually do things? The answer tells you which category it falls into - and whether it will genuinely save you time or just give you more text to act on yourself.

Getting Started with AI Agents

If you have been using chatbots and copilots and want to experience what a true AI agent can do, the best way is to try one. Fazm is a free AI agent for Mac that takes voice commands and executes tasks directly on your computer - no copy-pasting required.

Read our beginner's guide to your first AI computer agent for a step-by-step walkthrough. Or download Fazm and start by giving it a simple task: "open Safari and search for the weather today." Once you see an AI agent actually doing the work instead of just talking about it, the difference becomes obvious.

Forem: Matthew Diakonov

Building a macOS Desktop Agent with Claude - How AI Wrote Most of Its Own Code

Building a macOS Desktop Agent with Claude

How It Works in Practice

What Claude Is Good At

What Required Human Architecture

The CLAUDE.md Pattern

Running Multiple Agents in Parallel

Keep Reading

The 10 Best AI Agents for Desktop Automation in 2026

The 10 Best AI Agents for Desktop Automation in 2026

What Makes a Great AI Desktop Agent?

The 10 Best AI Agents for Desktop Automation

1. Fazm

2. ChatGPT Atlas

3. Perplexity Comet

4. Simular

5. Highlight AI

6. BrowserOS

7. Composio (Open ChatGPT Atlas)

8. Bytebot

9. Macro

10. Agent Zero

Comparison Table

How We Evaluated

Conclusion

More on This Topic

Related Posts

You Do Not Need an MCP Server for Every Mac App - Accessibility APIs as a Universal Interface

You Do Not Need an MCP Server for Every Mac App

The MCP Per-App Problem

The Accessibility API Alternative

How to Explore It

When You Still Need MCP

Keep Reading

Building Native macOS Apps with Claude Is a Different Beast Than Web Dev

Building Native macOS Apps with Claude Is a Different Beast Than Web Dev

Where Claude Hallucinates

What Actually Works

The Investment Pays Off

Related

You Might Also Like

Build a Local-First AI Agent with Ollama - No API Keys, No Cloud, No Signup

Build a Local-First AI Agent with Ollama

The Setup

What Works Well Locally

The Hybrid Approach

Getting Started with Fazm + Ollama

More on This Topic

You Might Also Like

What Is an AI Desktop Agent? Everything You Need to Know in 2026

What Is an AI Desktop Agent? Everything You Need to Know in 2026

How AI Desktop Agents Differ from Other AI Tools

Chatbots (ChatGPT, Claude, Gemini)

Copilots (GitHub Copilot, Microsoft Copilot)

Browser Extensions (ChatGPT Atlas, various AI assistants)

Traditional Automation (Zapier, IFTTT, Make)

AI Desktop Agents

How AI Desktop Agents Work

1. Screen Understanding

2. Intent Processing

3. Action Execution

What Can an AI Desktop Agent Do?

Fill Out Forms Across Apps

Move Data Between Desktop and Web Apps

Automate Repetitive Workflows

Research and Data Gathering

Types of AI Desktop Agents

Cloud VM Agents

Native Desktop Agents

Browser-Only Agents

Privacy and Safety

Local vs Cloud Processing

Permission Models and Bounded Tools

Open Source Transparency

Getting Started

The Bottom Line

Keep Reading

How to Set Up Your First AI Computer Agent (Complete Beginner'\''s Guide)

How to Set Up Your First AI Computer Agent (Complete Beginner's Guide)