<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: antonis mixail</title>
    <description>The latest articles on Forem by antonis mixail (@antonis_mixail_bf37de483b).</description>
    <link>https://forem.com/antonis_mixail_bf37de483b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817649%2Fe687e6fa-426b-4e31-89a2-b4e797e42aae.png</url>
      <title>Forem: antonis mixail</title>
      <link>https://forem.com/antonis_mixail_bf37de483b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/antonis_mixail_bf37de483b"/>
    <language>en</language>
    <item>
      <title>We Gave AI Eyes and Hands on Windows - Here's How</title>
      <dc:creator>antonis mixail</dc:creator>
      <pubDate>Wed, 11 Mar 2026 01:52:32 +0000</pubDate>
      <link>https://forem.com/antonis_mixail_bf37de483b/we-gave-ai-eyes-and-hands-on-windows-heres-how-5688</link>
      <guid>https://forem.com/antonis_mixail_bf37de483b/we-gave-ai-eyes-and-hands-on-windows-heres-how-5688</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmmi7fqirf1i3pfz188mn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmmi7fqirf1i3pfz188mn.png" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We Gave AI Eyes and Hands on Windows — Here's How&lt;/p&gt;

&lt;p&gt;AI coding assistants can write 500 lines of code in seconds. But ask them to click a button? They're blind.&lt;/p&gt;

&lt;p&gt;While building &lt;a href="https://orbination.com" rel="noopener noreferrer"&gt;https://orbination.com&lt;/a&gt; — our AI platform launching next month — we hit this wall hard. Our agents needed to interact with the actual Windows desktop.&lt;br&gt;
  Open apps. Click through dialogs. Read what's on screen. Test their own work.&lt;/p&gt;

&lt;p&gt;Nothing out there did what we needed. So we built it.&lt;/p&gt;

&lt;p&gt;The Problem&lt;/p&gt;

&lt;p&gt;We tried the screenshot approach first. Take a screenshot, send it to the AI, let it figure out where to click.&lt;/p&gt;

&lt;p&gt;It was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow — each screenshot costs thousands of vision tokens&lt;/li&gt;
&lt;li&gt;Unreliable — the AI guesses coordinates from pixels&lt;/li&gt;
&lt;li&gt;Expensive — 15 screenshots to navigate one menu&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed the AI to know what's on screen, not guess from images.&lt;/p&gt;

&lt;p&gt;The Solution: Read the Actual UI&lt;/p&gt;

&lt;p&gt;Windows has a built-in accessibility layer called UIAutomation. Every button, input field, menu item, and checkbox exposes itself to this system with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exact text/label&lt;/li&gt;
&lt;li&gt;Exact position (bounding rectangle)&lt;/li&gt;
&lt;li&gt;Control type (button, input, text, tab...)&lt;/li&gt;
&lt;li&gt;Interaction patterns (can I click it? type in it? toggle it?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of sending screenshots, we send this:&lt;/p&gt;

&lt;p&gt;[button] "Save" @ 450,320&lt;br&gt;
  [input]  "Search..." @ 200,60&lt;br&gt;
  [tab]    "Settings" @ 120,35&lt;/p&gt;

&lt;p&gt;Three lines of text instead of a 1MB image. The AI knows exactly what to click and where.&lt;/p&gt;

&lt;p&gt;We wrapped this into an MCP server — the open protocol that connects AI assistants to external tools. Single .NET 8 executable. No Python. No Node.js. No Selenium.&lt;/p&gt;

&lt;p&gt;Then Reality Hit&lt;/p&gt;

&lt;p&gt;Problem 1: Dark Themes&lt;/p&gt;

&lt;p&gt;Half the apps we use have dark themes. Standard OCR on a dark screenshot?&lt;/p&gt;

&lt;p&gt;0 lines detected.&lt;/p&gt;

&lt;p&gt;We built automatic dark theme enhancement:&lt;/p&gt;

&lt;p&gt;if (IsDarkImage(bitmap))&lt;br&gt;
  {&lt;br&gt;
      // Invert colors + boost contrast 1.4x&lt;br&gt;
      using var enhanced = EnhanceForOcr(bitmap);&lt;br&gt;
      result = RunOcrEngine(enhanced, language);&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;Check darkness first, enhance once, OCR once. Single pass.&lt;/p&gt;

&lt;p&gt;Result: 0 → 37 lines detected on draw.io's dark interface.&lt;/p&gt;

&lt;p&gt;Problem 2: Web Apps Inside Iframes&lt;/p&gt;

&lt;p&gt;UIAutomation can't see inside web-rendered dialogs and iframes. A dark-themed "OK" button in a web dialog? Invisible to UIAutomation.&lt;/p&gt;

&lt;p&gt;We built an automatic OCR fallback into click_element:&lt;/p&gt;

&lt;p&gt;click_element "OK"&lt;br&gt;
    → UIAutomation: not found&lt;br&gt;
    → Capture window → detect dark theme → enhance → OCR → find "OK" → click center&lt;br&gt;
    → Result: Clicked "OK" @ 523,418 (OCR fallback) ✓&lt;/p&gt;

&lt;p&gt;One tool call. The AI doesn't even know which strategy was used — it just works.&lt;/p&gt;

&lt;p&gt;Problem 3: Multi-Monitor DPI Chaos&lt;/p&gt;

&lt;p&gt;Three monitors, different scaling, negative coordinates for left-side displays. GetWindowRect returns different values depending on DPI awareness.&lt;/p&gt;

&lt;p&gt;One line fixed it:&lt;/p&gt;

&lt;p&gt;SetProcessDpiAwarenessContext(DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2);&lt;/p&gt;

&lt;p&gt;Problem 4: Speed&lt;/p&gt;

&lt;p&gt;Navigating draw.io required 15+ individual tool calls. Click menu, wait, click submenu, wait, select all, paste, wait, click OK...&lt;/p&gt;

&lt;p&gt;We built run_sequence — batch multiple actions in a single MCP call:&lt;/p&gt;

&lt;p&gt;focus "Chrome"&lt;br&gt;
  wait 300&lt;br&gt;
  hotkey ctrl+a&lt;br&gt;
  wait 200&lt;br&gt;
  hotkey ctrl+v&lt;br&gt;
  wait 500&lt;/p&gt;

&lt;p&gt;One tool call instead of six. The round-trip latency between AI and tools is the real bottleneck, not the actions themselves.&lt;/p&gt;

&lt;p&gt;Problem 5: Which Windows Are Actually Visible?&lt;/p&gt;

&lt;p&gt;The AI would try to interact with windows hidden behind other windows. We built grid-based occlusion detection:&lt;/p&gt;

&lt;p&gt;Chrome (chrome)     @ -2060,-1461  ← 100% visible&lt;br&gt;
  VS Code (Code)      @ -1500,-800   ← 65% visible&lt;br&gt;
  Explorer (explorer) @ -1400,-700   ← 0% visible [OCCLUDED]&lt;/p&gt;

&lt;p&gt;24px grid cells, process windows front-to-back, calculate visible fraction. The AI now knows which windows are actually usable.&lt;/p&gt;

&lt;p&gt;The Biggest Lesson&lt;/p&gt;

&lt;p&gt;Don't screenshot. OCR first.&lt;/p&gt;

&lt;p&gt;When we switched from screenshot-first to OCR-first workflows, everything improved:&lt;/p&gt;

&lt;p&gt;┌────────────────────────────────┬─────────────┬───────┬─────────────┐&lt;br&gt;
  │            Approach            │   Tokens    │ Speed │ Reliability │&lt;br&gt;
  ├────────────────────────────────┼─────────────┼───────┼─────────────┤&lt;br&gt;
  │ Screenshot → guess coordinates │ ~5000/image │ Slow  │ Low         │&lt;br&gt;
  ├────────────────────────────────┼─────────────┼───────┼─────────────┤&lt;br&gt;
  │ OCR → exact text + coordinates │ ~200/scan   │ Fast  │ High        │&lt;br&gt;
  └────────────────────────────────┴─────────────┴───────┴─────────────┘&lt;/p&gt;

&lt;p&gt;We embedded this as ServerInstructions in the MCP server itself. Every AI client that connects receives the optimal workflow automatically:&lt;/p&gt;

&lt;p&gt;options.ServerInstructions = """&lt;br&gt;
      Observation Priority:&lt;br&gt;
      1. ocr_window — exact text with click coordinates&lt;br&gt;
      2. get_window_details — UI elements with types&lt;br&gt;
      3. list_windows — window visibility %&lt;br&gt;
      4. screenshot_to_file — ONLY for final verification&lt;br&gt;
      """;&lt;/p&gt;

&lt;p&gt;The AI learns the right approach on connection. No configuration needed.&lt;/p&gt;

&lt;p&gt;The Demo&lt;/p&gt;

&lt;p&gt;We told Claude Code: "Create an architecture diagram in draw.io."&lt;/p&gt;

&lt;p&gt;The AI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;list_windows — found Chrome&lt;/li&gt;
&lt;li&gt;navigate_to_url — opened app.diagrams.net&lt;/li&gt;
&lt;li&gt;ocr_window — read the storage dialog, saw "Create New Diagram"&lt;/li&gt;
&lt;li&gt;click_element "Create New Diagram" — UIAutomation click&lt;/li&gt;
&lt;li&gt;click_element "Create" — created blank diagram&lt;/li&gt;
&lt;li&gt;set_clipboard — prepared the XML&lt;/li&gt;
&lt;li&gt;click_menu_item "Extras" &amp;gt; "Edit Diagram" — one call for menu navigation&lt;/li&gt;
&lt;li&gt;run_sequence — Ctrl+A, Ctrl+V to paste XML&lt;/li&gt;
&lt;li&gt;click_element "OK" — applied the diagram&lt;/li&gt;
&lt;li&gt;ocr_window — verified all elements rendered correctly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Dark themed web app. Multiple dialogs. File save confirmation. All handled autonomously.&lt;/p&gt;

&lt;p&gt;What's In The Box&lt;/p&gt;

&lt;p&gt;45+ tools organized into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vision — scan_desktop, list_windows, ocr_window, get_window_details&lt;/li&gt;
&lt;li&gt;Interaction — click_element (with OCR fallback), interact, fill_form, click_menu_item&lt;/li&gt;
&lt;li&gt;Batch — run_sequence, focus_and_hotkey&lt;/li&gt;
&lt;li&gt;Input — mouse_click, keyboard_hotkey, paste_text&lt;/li&gt;
&lt;li&gt;Capture — screenshot_to_file, screenshot_window (PrintWindow API — works when obscured)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Win32 P/Invoke (EnumWindows, SendInput, PrintWindow)&lt;/li&gt;
&lt;li&gt;UIAutomation with CacheRequest (single cross-process call per window)&lt;/li&gt;
&lt;li&gt;Windows.Media.Ocr with dark theme auto-enhancement&lt;/li&gt;
&lt;li&gt;Grid-based window occlusion analysis&lt;/li&gt;
&lt;li&gt;30-second smart cache with per-window refresh&lt;/li&gt;
&lt;li&gt;Per-monitor DPI awareness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This Is v1&lt;/p&gt;

&lt;p&gt;This is the first open-source release. It works — we use it daily building Orbination. But it's v1.&lt;/p&gt;

&lt;p&gt;What we think could make it great with community input:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chrome DevTools Protocol integration for direct browser control&lt;/li&gt;
&lt;li&gt;Better OCR text matching (word boundaries, fuzzy matching)&lt;/li&gt;
&lt;li&gt;Linux/macOS ports using AT-SPI and Accessibility API&lt;/li&gt;
&lt;li&gt;Workflow recording and replay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We believe desktop control is the missing layer for truly autonomous AI agents. Not agents that just generate text — agents that use computers like humans do.&lt;/p&gt;

&lt;p&gt;Orbination launches next month. This is the first piece we're sharing.&lt;/p&gt;

&lt;p&gt;Get It&lt;/p&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control.git" rel="noopener noreferrer"&gt;https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control.git&lt;/a&gt;&lt;br&gt;
  cd Orbination-AI-Desktop-Vision-Control/DesktopControlMcp&lt;br&gt;
  dotnet build -c Release&lt;br&gt;
  claude mcp add desktop-control -- "bin\Release\net8.0-windows\DesktopControlMcp.exe"&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control" rel="noopener noreferrer"&gt;https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MIT License. Built by &lt;a href="https://leia.gr" rel="noopener noreferrer"&gt;https://leia.gr&lt;/a&gt; for &lt;a href="https://orbination.com" rel="noopener noreferrer"&gt;https://orbination.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>dotnet</category>
      <category>opensource</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
