Forem: 22Gstudios

Why I'm not open sourcing my Android voice agent

22Gstudios — Thu, 07 May 2026 02:08:13 +0000

I built LetItDo, a voice agent for Android that finishes real tasks. WhatsApp messages, Spotify, grocery orders, the boring 47-tap routines.

Every other day someone asks: are you going to open source it?

The honest answer is no.

Open source is not free distribution

The most common reason indie devs open source: "I will get more users by going open source." It is rarely true.

Open source attracts contributors and users who want a free tool. It does not magically attract people who pay. Linear, Notion, Superhuman, Cal.com all started closed. Cal.com later open sourced AFTER it had paying customers, as a marketing move with infrastructure already in place.

If your product has zero distribution, open sourcing it gives you zero contributors and zero users with extra steps.

Open source kills monetization optionality

The moment LetItDo is on GitHub:

Someone wraps it, hosts it, and charges for it
Customers who would have paid me $5/month build their own from source
Acquirers stop being interested

The "AI makes everything copyable" argument

People say: with Claude Code, anyone builds LetItDo in a week. Half right. AI compresses build time, does not eliminate moat.

The remaining moats:

Distribution and brand
Specific UX choices
The skill capture loop (data, not code)
OEM-specific quirks (Vivo, Xiaomi, Samsung battery savers)

Open sourcing compresses competitor copy time from weeks to days. For an unvalidated product, worst possible time to give it away.

The general rule

Open source AFTER you have a profitable product, AS marketing. Not BEFORE you have users, AS distribution. The latter never works the way founders hope.

If you are an indie dev about to MIT-license a product you spent 60 hours on, ask: do I want to build a business or build a side project?

Try LetItDo

Demo: https://x.com/22Gstudios/status/2051377769414791582
Early access: https://tally.so/r/jaGvx9

If you have shipped indie products and made the open source decision differently, I want to hear why.

I built a voice agent for Android in a weekend. Here's what actually worked

22Gstudios — Tue, 05 May 2026 03:21:20 +0000

Yesterday I posted this on X: https://x.com/22Gstudios/status/2051377769414791582

LetItDo is a voice agent for Android that actually finishes tasks. Solo, two and a half days, on top of an existing Auto.js fork called AutoX and Charm's Crush as the agent runtime. The architecture ended up being closer to what production agents like Perplexity Comet use than I expected, and the bugs that bit me were not the ones I planned for.

I wanted my phone to do the boring stuff. Send a WhatsApp message to a contact. Open Spotify and play a song. Scroll Instagram and like a few posts. Stuff Siri and Google Assistant pretend to do but don't actually finish.

Here's the honest writeup.

Why this could even run on a phone

Most agent runtimes are Python. Python is hostile to Android: no good way to ship the interpreter in your APK without dragging in 50MB+ of CPython and fighting NDK quirks. I needed the agent to run on the user's phone, not on some server they'd have to host.

Charm's Crush is written in Go. Go cross-compiles to Android arm64 in one command. The whole runtime fits in a libcrush.so native library that I bundle into the APK alongside AutoX. The agent runs entirely on-device. The only network call is to the user's chosen LLM API.

Untethered, no laptop, no ADB at runtime

LetItDo runs untethered. No USB cable, no ADB connection at runtime, no laptop hosting the agent. The closest research projects (AppAgent from Tencent, Mobile-Agent from Alibaba, DroidBot-GPT) all require the agent to live on a laptop and control the phone via ADB. Their architecture works fine for research demos but breaks the moment the user isn't sitting at a desk with a USB-C cable.

LetItDo is a regular Android app the user installs once and uses with their voice. The only one-time setup is adb shell pm grant com.letitdo.v7 android.permission.WRITE_SECURE_SETTINGS for the OEM-survival trick I'll cover below. After that, no ADB. The phone is the whole stack.

The first wrong intuition: vision is the answer

I copied the architecture from browser-harness, which is a small Python harness that connects an LLM to your real Chrome via CDP. It works because the LLM has vision. The agent calls capture_screenshot, the host renders the PNG to the model, the model picks pixel coordinates, the harness calls click_at_xy. There is one click primitive. No selectors. The whole loop is built around the assumption that the model can see.

I tried to translate this to Android. Took screenshots. Sent them to qwen-plus. The model replied "I cannot see images." Because qwen-plus is text only.

Production agents have already chosen the answer. Comet calls Accessibility.getFullAXTree first, screenshots only as fallback. OpenAI Operator uses a hybrid (AX tree primary, vision for charts and captchas). Browser-harness leans vision because their LLM has eyes. I copied the wrong template. The right one is whatever Comet does, even if you have a vision model, because cheap-first cascade beats one-shot vision in cost, latency, and reliability.

So LetItDo's interaction layer became a cascade, cheapest first:

exact a11y text/desc/id (free, instant)
substring a11y (textContains/descContains)
fuzzy a11y tree (Levenshtein on dumped nodes, catches STT typos like Shawn → Shaun)
OCR (Paddle, on-device, ~2s, fallback for Canvas/WebView)
vision (multimodal LLM, opt-in, only when 1-4 fail)

Each layer earns its slot maybe 1% of the time more than the layer above it. The cascade exists because no single layer is right for every surface.

The bug that took 90 seconds and 11 round-trips to find

I told the agent: "send hello to Shaun on WhatsApp." It opened WhatsApp. Then it tapped what looked like the right element. The chat did not open. It tapped again. It scrolled. It went into Settings somehow. After 11 round-trips and 90 seconds it gave up.

The actual problem: WhatsApp's chat list has the contact name "Shaun" rendered as a TextView that is clickable=false. The clickable element is a parent LinearLayout four levels up the tree. The avatar to the left of the name has content-description "Shaun picture" and IS clickable, but tapping it opens the profile preview, not the chat.

When the agent fuzzy-matched "Shawn" (STT typo of Shaun) against the screen, OCR found the text glyph. The agent clicked at the glyph's bounding box center. Android's hit testing routed that to whichever clickable ancestor wanted it, and on Vivo's WhatsApp build that turned out to be the avatar's tap zone, not the row's. So we tapped the profile icon and opened a contact preview instead of the chat.

The fix was a five-line walk:

function tap_text(query) {
  var node = text(query).findOne(2000);
  if (!node) return null;
  var cur = node;
  while (cur && !cur.clickable()) cur = cur.parent();
  if (cur) cur.click();
}

Find the text node. Walk up the parent chain. Stop at the first clickable ancestor. Click that, not the leaf. AccessibilityService.performAction(ACTION_CLICK) fires on the row container. Chat opens. 12 seconds.

This is the exact pattern Comet uses on the web. Their accessibility tree parser walks up from text nodes to clickable ancestors before reporting click targets to the model. I had to rediscover it for Android because I started from the wrong template.

The other bug: bounds were always null

The structured tree dump I had been shipping for two days was returning nodes without coordinates. Every "smoke test" I had run actually used a different code path (AutoX's UiObject, which has working .bounds()) instead of raw AccessibilityNodeInfo (which doesn't). The function name is the same. The return shape is different.

// Wrong:
var bounds = n.getBoundsInScreen()  // returns void, not Rect

// Right:
var rect = new android.graphics.Rect()
n.getBoundsInScreen(rect)

getBoundsInScreen takes an out-parameter. Calling it bare returns nothing. Every node in my tree dump had cx and cy as null. None of my "tests" caught it because I was checking different stuff. The second I actually filtered for cx and got back zero results out of 220 nodes, the bug was obvious.

This is a personal lesson, not a technical one. Smoke-test every helper on the device the day you write it, before you build anything on top of it.

The OEM problem nobody talks about

Vivo, Xiaomi, Oppo, OnePlus, Huawei phones aggressively kill background services to save battery. Android's accessibility service is one of the services they kill. So even when the user grants accessibility access to your app, the OEM's battery manager turns it off later. The app keeps running. Its permissions look fine in Settings. But auto.service is null. Every script throws "Accessibility service is not started."

This is also what kills Panda (an open-source Android voice agent in this space). Their issue tracker has #275 about Xiaomi/Huawei battery management as an unresolved roadmap item, plus a Reddit complaint that Android revokes Panda's permissions every few hours with no recovery.

The fix is mildly nuclear:

Request WRITE_SECURE_SETTINGS via ADB once at install (adb shell pm grant com.letitdo.v7 android.permission.WRITE_SECURE_SETTINGS).
Watchdog WorkManager fires every 15 minutes. Reads the secure setting enabled_accessibility_services. If our component isn't in the list, write it back.
Pre-flight check before each agent run. If the service isn't bound (verified via local TCP ping to our bridge), call heal() which writes the setting and waits up to 5s for the system to rebind.
Mid-flight retry. If the agent's run_script call fails AND auto.service is null when the call returned, heal once and retry the same script.

In practice users see nothing. The accessibility service stays bound across OEM kill cycles. They speak a command, the agent runs, no setup ceremony.

Is this hostile to Google's design? A little. Google bans Sova (another Android voice agent) from the Play Store specifically because Sova uses the accessibility API for "universal automation." LetItDo will probably never reach the Play Store either. Both apps have to live as sideloads. Sova self-hosts the APK. I'll do the same when I open early access.

Skill capture: the agent writes its own playbooks

This is the part I'm most happy with.

The first time the agent solves "turn on flashlight," it flails. It tries AutoX's device.flash which doesn't exist on this device. It tries opening the quick settings panel and tapping the torch tile. It tries hardware key shortcuts. After about ten attempts it lands on android.hardware.camera2.CameraManager.setTorchMode(cameraId, true) and the flashlight turns on.

Crush has a built-in write tool. The system prompt tells the agent: after a successful task, write a SKILL.md to the skills directory describing what worked. The agent does this on its own, unprompted past the system message:

---
name: flashlight
description: "Turn on/off the device flashlight using CameraManager.setTorchMode."
  Use when user asks to turn on/off flashlight or torch.
---

Turn on:
importClass(android.hardware.camera2.CameraManager);
importClass(android.content.Context);
var cm = context.getSystemService(Context.CAMERA_SERVICE);
var cameraId = cm.getCameraIdList()[0];
cm.setTorchMode(cameraId, true);

Gotcha: device.flash may not exist on all devices.
Use CameraManager.setTorchMode instead.

Crush's progressive disclosure injects all skill metadata into the system prompt at session start. When a relevant skill matches, the body gets loaded. Verified in the logs:

INFO Skill turn summary component=skills
prompt_len=24 active_total=7
loaded_total=1 loaded_this_turn=[flashlight]

Next "turn on flashlight" command: 14 seconds total, single round-trip, exact recipe replay. From 90s to 14s the second time.

First time the skill loop closed end-to-end I sat there for a minute. The agent had spent 90 seconds flailing on "turn on flashlight" the first time. Wrote itself a SKILL.md when it finally got CameraManager.setTorchMode working. The next prompt, the agent loaded the skill, ran the cached recipe verbatim, finished in 14 seconds. From the outside it looks like nothing. But that's the thing improving itself, on a phone, without me touching it. After that I knew this was real.

AutoX is the other half of the stack

If Crush is the agent brain, AutoX is the body. It's an Auto.js fork that's been quietly maintained for years. Out of the box it gave me:

A bound AccessibilityService running in a separate :script process. This is what lets us read and tap the UI tree.
A Rhino JavaScript engine with full access to Android's Java APIs via importClass. The agent writes JS that calls android.hardware.camera2.CameraManager directly. No native bridge to maintain.
A scripting surface (text("Send").findOne(), click(x, y), app.launch("com.spotify.music"), device.width, setClip("hi"), http.get(url)) that already covers most automation primitives.
Screenshots without MediaProjection. This is the big one. The standard Android way to grab the screen is MediaProjection, which pops up a "Start recording or casting?" dialog every single capture. That kills any voice-agent UX. AutoX's auto.takeScreenshot() uses an accessibility-API path that doesn't trigger the prompt. The user grants accessibility once at install; nothing else interrupts them. Vision flows just work.
Bundled Paddle OCR. ~2s per screen, on-device, no network. We use it as layer 4 of the cascade.
A :script process boundary that keeps accessibility crashes from killing the main app.

Without AutoX I'd have written all of this myself: the accessibility service binding, the JS-to-Android bridge, the screenshot capture without MediaProjection (which is its own sharp-edged research project), the gesture dispatcher. Probably two more weekends of pure scaffolding work.

What I had to build vs what I got from AutoX and Crush

LetItDo is mostly the glue between two existing projects.

Crush gives the agent: a working LLM loop, OpenAI-compatible multi-provider support (OpenAI, Anthropic, Google, DashScope, Groq, Cerebras, OpenRouter, local Ollama), the Agent Skills standard with progressive disclosure, conversation compaction so long sessions don't blow up the context window, sub-agent spawning, and the MCP tool calling protocol.

AutoX gives the phone: a bound AccessibilityService, a Rhino JS engine with full access to Android's Java APIs, screenshots without MediaProjection prompts, on-device Paddle OCR, gesture dispatch, and a scripting surface that already covers most of what an automation agent needs.

What I actually built: the bridge that lets Crush's MCP tools call into AutoX's accessibility surface. The JS helpers the agent uses to discover and tap UI elements (read_screen, tap_text with walk-up-to-clickable, the fuzzy cascade). The OEM survival mechanism. The voice frontend, the result UI, the skill seeding, the on-device service watchdog. Two and a half days of glue and one critical insight (a11y tree first, vision second).

If LetItDo is interesting, AutoX and Crush deserve most of the credit. I'm being explicit about this because it's the truth and because it tells you what's actually novel here: not the agent, not the phone control, but the combination plus the OEM trick.

Honest speed numbers

Single-action tasks like "turn on flashlight" floor at 12-18s. Two LLM round-trips per task (decide → run_script → summarize), each ~5-7s on Qwen DashScope. The visible action itself is 30-50ms. The structural floor is the round-trip count. Persistent daemon between prompts saves ~2s of cold-start. Switching to Groq or Cerebras for sub-1s inference saves another 8s. Neither shipped yet.

Multi-action tasks like Instagram engagement feel faster than single-action because the agent batches: one run_script with a for-loop over 5 reels = 1 LLM round-trip for 5 actions. Visible activity hides the LLM wait.

What hasn't shipped

Persistent Crush daemon between prompts. Right now every voice command spawns a fresh process. Could be ~0s cold-start with a long-running daemon listening on stdin or a socket.
Vision pipeline with cost meter. Vision works (qwen3-vl-plus, gpt-4o, gemini-flash all forward image content correctly) but burns tokens. There's no usage display in the UI yet.
Cross-prompt memory. Each Crush invocation is a fresh session. Saying "send hi to Shaun" then "make it three exclamation marks" doesn't work; the second prompt has no idea what "it" refers to.
Play Store distribution. Same accessibility-policy reason Sova got banned will likely catch LetItDo. Sideload only.
iOS. iOS has no equivalent of AccessibilityService for third-party apps. The whole architecture is non-portable.

What's next

If you want early access, the waitlist is here: https://tally.so/r/jaGvx9

Open questions I'd like feedback on:

Is "untethered Android voice agent" a category, or just a feature Google will eventually ship in Gemini? (Their Android AppFunctions API in Feb 2026 suggests yes.)
Should LetItDo's skill library be private to the user (current state), community-shared (PR-style like browser-harness), or auto-synced via cloud for everyone's benefit (network effects but governance nightmares)?
Vision is gated behind model selection. qwen3-vl-plus works, qwen-plus doesn't. Is the right answer to require vision for everyone, or is the a11y-tree-first design good enough that vision stays optional?

If you've got opinions, the comments or my DMs are open.

The closing lesson if you're starting something similar: smoke-test every helper on the device the day you write it. The bug that wasted my Day 3 was a function that had been silently returning null bounds for 48 hours. None of my "tests" caught it because they all hit a different code path. Two days of debugging that should have been two minutes if I'd checked output once. Speed of iteration on a real device is the whole game.

Streak apps taught me one missed day undoes six weeks. The science says it doesn't, so I rebuilt my habits.

22Gstudios — Wed, 29 Apr 2026 03:06:43 +0000

Five apps. One missed Tuesday. Quit.

I deleted Streaks, Habitica, Way of Life, Fabulous, and Habit Coach AI in the same month. Not because they were bad apps. Because every single one of them used the same loop, and that loop was breaking me.

Build a streak. Miss a day. Watch the counter go to zero. Feel like the last six weeks were a lie. Open the app the next morning and not check anything in, because what's the point now. Quietly delete it from the home screen a week later.

It is wild how much guilt a single number can carry.

The bit that finally got to me was the asymmetry. A run of forty good days felt like nothing, just background. One missed Tuesday felt like a personal indictment. The app was teaching me that progress is fragile and failure is permanent. That is the opposite of what I wanted to learn.

So I went looking for the actual research on how habits form, expecting to find some nuance. What I found was much stranger than that.

What the research actually says

The most cited paper in this space is Lally et al. (2010), European Journal of Social Psychology. They followed 96 people forming new habits at University College London for 12 weeks. Two of their findings sit completely outside the worldview of every streak app I had ever used.

Habit formation follows an asymptotic curve, not a streak. The average time for a new behaviour to feel automatic was 66 days, but the range was 18 to 254 days, depending on the habit and the person. The path to "automatic" is a curve that bends slowly, not a chain of identical links.
Missing a single day had no statistically significant effect on the final outcome. One miss does not break a habit. The curve barely moves. The thing that streak counters spend all their visual design screaming about, the broken chain, simply does not show up in the data.

Read those two sentences again. Every single streak app I had used was built on a model the most cited paper in the field quietly contradicts.

What I built

I built a small thing called Imperfectly around those two findings.

Your personal Lally curve. Instead of a streak, you see a curve fitted to your own check-ins. The shape of the curve is the progress, not the count. A missed day is a tiny dip, not a reset.
An estimated day to automaticity. A projected date based on your data so far, with the asymptotic curve doing the math. It moves up when you're consistent and slides a little when you're not. It never crashes to zero.
A soft message when you miss a day. No fire emoji going dark. No "you broke your chain." Just a note that says, in effect, "the curve barely moved, keep going."

No streak counter. No gamification. No signup, no email required.

You can try it here: https://imperfectly.cc/

What I am genuinely curious about

The thing I cannot tell from the inside is whether this design actually feels better, or whether it just relocates the anxiety. Specifically:

Does seeing a curve (instead of a streak) feel motivating?
Or does the projected day to automaticity start to feel like a soft deadline, the same way a streak counter does?

If you try it for a few days, I would love to hear which way it lands for you. The whole thing only works if it feels like permission, not pressure.

Reference

Lally, P., van Jaarsveld, C. H. M., Potts, H. W. W., & Wardle, J. (2010). How are habits formed: Modelling habit formation in the real world. European Journal of Social Psychology, 40(6), 998-1009.