Forem: Elliot Gao

Mobile RPA on Android Without Root

Elliot Gao — Tue, 26 May 2026 12:31:10 +0000

Mobile RPA usually starts with a simple sentence:

We need to do the same thing in this Android app every day.

Maybe it is checking an account. Maybe it is downloading a report. Maybe it is entering a code, reading a status, or moving data between an app and an internal system.

If the app has no API, automation falls back to the UI.

You can automate many Android app workflows without root. The trick is to keep the workflow close to what a human sees: tap labels, fill fields, wait for screens, and save evidence when something breaks.

Quick answer

A no-root Android RPA flow can look like this:

hs use
hs go com.example.app
hs wait "Sign in" --timeout 15s
hs fill "Email" "$APP_EMAIL"
hs fill "Password" "$APP_PASSWORD"
hs tap "Continue" --visible --unique
hs wait "Dashboard" --timeout 20s

It runs through normal Android debugging access. No root. No custom ROM. No app integration.

Why root is the wrong default

Root can make automation powerful, but it also changes the device.

For business workflows, that is often a problem:

The device no longer behaves like a normal user device.
Some apps detect rooted environments.
Security assumptions change.
Maintenance becomes harder.
Enterprise teams get nervous.

If the workflow can be driven through visible UI, start there.

Model the workflow as states

Good mobile RPA scripts are not just tap sequences. They are state transitions:

Launch app
Wait for Sign in
Fill credentials
Tap Continue
Wait for Dashboard
Open Reports
Download file
Capture confirmation

In shell:

hs go com.example.app
hs wait "Sign in"
hs fill "Email" "$APP_EMAIL"
hs fill "Password" "$APP_PASSWORD"
hs tap "Continue"
hs wait "Dashboard"

The wait commands are as important as the actions. They make the workflow resilient to slow devices and network delays.

Prefer labels over coordinates

Coordinate automation is tempting:

adb shell input tap 540 860

But RPA workflows need to survive small layout changes.

Use visible labels instead:

hs tap "Download"
hs wait "Saved"

When labels repeat, tighten the selector:

hs tap 'Button:has-text("Download")' --visible --unique

If there are multiple matches, --unique fails instead of tapping the wrong one.

Capture proof

Business automation needs evidence.

At the end of a successful run:

hs see --size 768 "/tmp/run-success.jpg"
hs ui > "/tmp/run-ui.txt"

On failure:

ARTIFACTS="/tmp/mobile-rpa-$$"
mkdir -p "$ARTIFACTS"
trap 'hs ui > "$ARTIFACTS/ui.txt"; hs see --size 768 "$ARTIFACTS/screen.jpg"; hs logs --tail 200 > "$ARTIFACTS/logs.txt"; echo "$ARTIFACTS"' ERR

That gives you a screenshot, a UI dump, and recent logs for debugging.

Handle OTP and notification flows

Many mobile workflows involve one-time passwords, push notifications, or deep links.

When the app shows a code field:

hs wait "Enter the code"
hs fill "Code" "$OTP_CODE"
hs tap "Verify"

When a push opens the app:

hs wait com.example.app --timeout 15s
hs wait "Approve" --timeout 15s
hs tap "Approve"

The important part is still the same: wait for real UI state instead of sleeping.

Where LLM agents fit

Some RPA workflows are too variable for a fixed script.

An LLM agent can help decide the next action when the screen changes. But the tool surface should still be small:

tap   Button    "Approve"  #approve  540,860
fill  EditText  "Code"     #otp      540,640

The agent should choose labels and actions, not raw pixels. That keeps the run auditable.

Limitations

No-root mobile RPA still follows Android's security model.

Some things may require device-owner policy, app integration, or root:

Reading private app data.
Bypassing secure windows.
Changing protected system settings.
Automating apps that intentionally block accessibility.

For many workflows, though, visible UI automation is enough.

FAQ

Can Android RPA run without root?

Yes. Many Android workflows can be automated without root by using ADB, visible UI labels, text input, waits, screenshots, and logs.

Is this the same as Appium?

No. Appium is a full mobile testing framework. A CLI workflow with Handsets is smaller and better suited to scripts, RPA jobs, and agent loops.

Can this run on real devices?

Yes. It works on real Android devices and emulators as long as adb can reach the device.

What should I log for compliance?

At minimum: start time, device/session id, action timeline, final status, screenshots for important states, and logs around failures.

Related guides

Originally published at https://handsets.dev/blog/mobile-rpa-android-without-root/.

How to Debug LLM-Driven Android Automation Runs

Elliot Gao — Tue, 26 May 2026 12:21:43 +0000

LLM-driven Android automation fails in strange ways.

The model may tap the wrong label. The screen may change between observation and action. A keyboard may cover the button. A permission dialog may appear. The app may still be loading. The UI dump may expose two identical "Continue" buttons.

If all you saved is the final screenshot, debugging is painful.

You need a run trace.

Quick answer

For every Android agent step, save:

the compact UI dump
the screenshot when needed
the model's chosen action
the actual device command
the result or structured error
recent logs
the top package/activity

The minimum useful trace looks like this:

observe: tap Button "Continue" #continue 540,860
model:   tap "Continue"
action:  hs tap "Continue" --visible --unique
result:  ok
wait:    hs wait "Dashboard" --timeout 15s
result:  TIMEOUT

That is much easier to debug than "the agent failed."

The failure modes

Android agent failures usually fall into a few buckets.

Failure	What it means
`NOT_FOUND`	The target label or selector was not visible
`AMBIGUOUS`	More than one visible node matched
`TIMEOUT`	The expected next state never appeared
`SECURE_WINDOW`	Android blocked screenshots for the current window
Wrong action	The model chose a bad label or command
Stale observation	The UI changed after the model saw it

Good tooling should preserve which bucket happened.

If everything becomes "click failed", the agent cannot recover intelligently.

Save the UI dump before the action

The UI dump is the agent's view of the world.

Save it before each model decision:

hs ui > run/0007-ui.txt

For LLM agents, a compact action table is usually better than full XML:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

When a model picks the wrong action, this file tells you whether the model had a reasonable choice.

Save screenshots selectively

Screenshots are valuable, but you do not need a full native PNG on every step.

For most agent debugging:

hs see --size 768 run/0007-screen.jpg

Use screenshots when:

the UI dump has too little information
the app renders custom controls
visual layout matters
a failure needs human review

Use the text UI as the default. Use screenshots as evidence.

Record the model action separately

Do not only save the final command.

Save what the model actually emitted:

{
  "step": 7,
  "model_action": "tap \"Continue\"",
  "tool_call": ["hs", "tap", "Continue", "--visible", "--unique"],
  "reason": "The login form is filled and Continue is visible."
}

This matters because the bug may be in translation:

The model chose the right label, but the tool call used the wrong selector.
The model chose a coordinate when a label was available.
The model ignored an ambiguity warning.

Keep the model layer and tool layer separate.

Prefer structured errors

Exit codes and error codes are better than stderr scraping.

Handsets has common exit codes:

0  ok
2  NOT_FOUND
3  TIMEOUT
4  AMBIGUOUS

In JSON mode, preserve the structured error:

hs --json tap "Continue" --visible --unique

Then your agent can decide:

NOT_FOUND: dump UI again or scroll
AMBIGUOUS: ask for a narrower selector
TIMEOUT: capture screenshot and logs
SECURE_WINDOW: continue without screenshot

Keep logs close to the failing step

Android logs are noisy. A small tail near the failure is usually enough:

hs logs --tail 200 > run/0007-logcat.txt

Pair logs with the UI dump and screenshot from the same step. Otherwise you end up with artifacts that are technically present but hard to correlate.

A simple artifact layout

Use numbered files:

run/
  0001-ui.txt
  0001-action.json
  0001-result.json
  0002-ui.txt
  0002-screen.jpg
  0002-action.json
  0002-result.json
  0002-logcat.txt

This is not fancy. That is the point.

Before building a dashboard, make the run inspectable with plain files.

Replay is the next step

Once you have traces, replay becomes possible.

The useful replay is not pixel-perfect video. It is a timeline:

Step 1: observed Sign in
Step 2: tapped Sign in
Step 3: filled Email
Step 4: filled Password
Step 5: tapped Continue
Step 6: timed out waiting for Dashboard

For teams, this timeline becomes the product. It lets an engineer see whether the model, the tool, or the app caused the failure.

FAQ

Why are LLM Android agents hard to debug?

Because failures can come from the model, the app, the Android UI state, the automation tool, or timing. A final screenshot does not tell you which layer failed.

Should I save screenshots for every step?

Not always. Save compact UI dumps for every step. Add screenshots for visual states, failures, and custom-rendered screens.

What is the most important artifact?

The pre-action UI dump. It shows what the model saw when it chose the action.

How does this help reliability?

Structured traces let you build targeted recovery: scroll on NOT_FOUND, narrow selectors on AMBIGUOUS, capture logs on TIMEOUT, and avoid retrying blindly.

Android Device Cloud for LLM Agents

Elliot Gao — Tue, 26 May 2026 12:21:40 +0000

Browser agents have a clear infrastructure model now.

You create a browser session, give it to a model, watch the actions, collect logs, and tear it down when the run ends.

Android agents need the same thing, but the device is harder.

An Android session is not just a webpage. It has an OS, apps, permissions, push notifications, keyboards, secure screens, package state, and a UI tree that was not designed for language models.

If you want reliable Android agents, you eventually need an Android device cloud built for agents, not just a generic mobile testing grid.

Quick answer

An Android device cloud for LLM agents should provide:

Ephemeral Android sessions.
Fast tap, fill, swipe, and wait actions.
Compact UI observations for prompts.
Screenshots only when visual context matters.
Logs, screenshots, and UI dumps for every step.
Session recording and replay.
Isolation between runs.
A Python/API surface simple enough for agent loops.

The device is the runtime. The UI dump is the observation. The action API is the actuator.

Why mobile agents need different infrastructure

Traditional device clouds were built for test suites.

The core workflow is:

Upload an app.
Start a test.
Run a framework such as Appium, Espresso, or XCTest.
Collect a report.

That model is useful, but LLM agents behave differently.

An agent may inspect the screen dozens or hundreds of times. It may need to retry actions, ask for a screenshot, inspect notifications, or recover from an unexpected permission dialog.

The infrastructure has to support a tight loop:

observe -> decide -> act -> wait -> observe

If each loop step is slow, verbose, or hard to debug, the agent becomes expensive and unreliable.

Observation: compact first, visual second

Most Android agent loops start with a screenshot or UIAutomator XML.

Both are useful. Neither should be the only observation.

Screenshots are great for visual layout, but they are heavy. XML is structured, but it contains a lot of layout noise.

For agents, a better default observation is an action table:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

That tells the model what it can do. Add a screenshot when the text UI is not enough:

hs ui
hs see --size 768 /tmp/screen.jpg

This keeps the prompt smaller and the action easier to audit.

Action: labels beat pixels

A generic device cloud can expose raw taps:

{ "type": "tap", "x": 540, "y": 860 }

That is sometimes necessary, but it is not the best default.

For agent runs, label-based actions are easier to understand:

{ "type": "tap", "text": "Continue" }

The run transcript is now readable. A human can see what the agent intended. A retry policy can distinguish NOT_FOUND from AMBIGUOUS from TIMEOUT.

Debugging is the product

The hard part of agent infrastructure is not only running the device. It is understanding why a run failed.

A useful Android agent cloud should keep a timeline:

Step	Data
Observation	UI table, screenshot, top activity
Model output	Intended action and reasoning if available
Action	Tap/fill/swipe/wait payload
Result	Success or structured error
Artifacts	Logs, screenshot, UI dump

When the agent fails, you should be able to replay the path:

1. saw "Sign in"
2. tapped "Sign in"
3. filled "Email"
4. filled "Password"
5. tapped "Continue"
6. timed out waiting for "Dashboard"

Without that, every failure becomes a mystery screenshot.

Isolation matters

Android sessions carry state:

Installed apps.
Login sessions.
Runtime permissions.
Clipboard.
Notifications.
System settings.
Cached data.

An agent cloud has to decide how sessions are reset. Emulators are easier to snapshot. Real devices are harder but closer to production.

For most agent experiments, emulator sessions are enough. For mobile RPA or app-store reality checks, real devices become important.

Where Handsets fits

Handsets is not a full device cloud by itself.

It is the control plane a cloud can build on:

no-root device actions through ADB
compact UI dumps
label-based selectors
screenshots and logs
Python and subprocess integration
a terminal UI for human debugging

The local loop looks like:

hs use
hs ui
hs tap "Continue"
hs wait "Dashboard"

A hosted version would wrap that in session management, auth, billing, isolation, recording, and replay.

FAQ

Is an Android device cloud the same as Appium cloud testing?

Not exactly. Appium clouds are usually optimized for test suites. An Android agent cloud needs lower-latency observations, compact prompt-friendly UI output, and better step-by-step replay for model-driven runs.

Do LLM agents need real Android devices?

Sometimes. Emulators are enough for many app flows and experiments. Real devices matter when hardware behavior, OEM skins, push delivery, biometrics, or production-like behavior matters.

Why not just use screenshots?

Screenshots are useful, but they are expensive and ambiguous. A compact UI table gives the model actionable labels and controls. Use screenshots as an additional observation, not the only one.

Does this require root?

No. A useful Android agent runtime can operate through ADB and the shell user for normal UI automation. Some protected screens and app-private data remain protected.

How to Automate Android Without Appium

Elliot Gao — Mon, 25 May 2026 09:00:39 +0000

You do not need Appium for every Android automation task.

Appium is the right tool when you need a full WebDriver-based mobile testing framework. But many Android workflows are smaller than that. You may only need to open an app, tap visible buttons, type into fields, wait for a result, and collect a screenshot on failure.

For those jobs, a CLI can be enough.

Handsets lets you automate Android from the terminal without root and without installing a visible helper app on the phone.

Quick answer

The fastest way to automate Android without Appium is:

hs use
hs tap "Continue"
hs fill "Email" "you@example.com"
hs wait "Dashboard"

That gives you the core automation loop: connect to the device, act on visible UI labels, and wait for the next state. You still use normal Android debugging access. You do not need root, WebDriver, or an Appium server.

What you need

You need:

An Android phone or emulator.
USB debugging enabled.
adb on your path.
Handsets installed on your host machine.

Install:

curl -fsSL https://raw.githubusercontent.com/elliotgao2/handsets/main/install.sh | bash

Connect:

hs use

Now you can control the device with commands.

Tap by text

Raw adb can tap coordinates:

adb shell input tap 540 860

That is brittle. It depends on screen size, density, orientation, and layout.

With Handsets, tap the visible label:

hs tap "Continue"

Use --visible and --unique when a script should fail rather than guess:

hs tap "Continue" --visible --unique --timeout 5s

This is the main difference from raw adb shell input tap. The script says what it means. It does not encode where a button happened to be on one device.

Fill fields

Use fill for text fields:

hs fill "Email" "you@example.com"
hs fill "Password" "$PASSWORD"
hs tap "Sign in"

If labels are repeated, use selectors:

hs fill 'EditText:below(TextView[text=Email])' "you@example.com"
hs fill 'EditText:below(TextView[text=Password])' "$PASSWORD"

The selector syntax is CSS-like and built for real Android UI trees.

Wait for the next screen

Do not write sleeps unless you truly need a fixed delay.

Instead of:

hs tap "Continue"
sleep 5

Use:

hs tap "Continue"
hs wait "Dashboard" --timeout 15s

This makes the script faster on fast devices and more reliable on slow devices.

A complete script

Here is a no-Appium Android login script:

#!/usr/bin/env bash
set -euo pipefail

hs use
hs go com.example.app
hs wait "Sign in" --timeout 15s

hs fill "Email" "$APP_EMAIL"
hs fill "Password" "$APP_PASSWORD"
hs tap "Continue" --visible --unique
hs wait "Dashboard" --timeout 20s

Add failure artifacts:

ARTIFACTS="/tmp/android-run-$$"
mkdir -p "$ARTIFACTS"
trap 'hs ui > "$ARTIFACTS/ui.txt"; hs see --size 768 "$ARTIFACTS/screen.jpg"; hs logs --tail 200 > "$ARTIFACTS/logs.txt"; echo "$ARTIFACTS"' ERR

Now the script leaves behind the UI, screenshot, and logs when something breaks.

Can I just use adb?

Sometimes, yes.

Raw adb is great for low-level device commands:

adb shell am start -n com.example/.MainActivity
adb shell input keyevent BACK
adb shell wm size

It becomes awkward when you need semantic UI automation:

Tap the button labeled "Continue".
Fill the password field below "Password".
Wait until "Dashboard" appears.
Fail if there are two matching buttons.

That is where a higher-level tool helps. Handsets still uses adb underneath, but it gives you label-based actions and structured failure modes.

When Appium is still better

Use Appium if you need:

iOS support.
WebDriver compatibility.
A full test framework.
Cloud device farm integrations.
Rich reports and recorders.
A large QA ecosystem.

Those are real strengths.

But if your goal is Android-only CLI automation, Appium may be more stack than you need.

Why this matters for LLM agents

LLM-driven Android automation benefits from a small text interface.

Handsets can print the current screen as an action table:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

That is easier for a model to consume than a full XML tree. It also reduces prompt size for long trajectories.

Summary

You can automate Android without Appium if your workflow is:

Android only.
CLI-first.
Label-based.
No-root.
Scripted from shell or Python.

Start with:

hs use
hs ui
hs tap "Continue"
hs wait "Welcome"

That covers more Android automation work than you might expect.

FAQ

Do I need Appium to automate Android?

No. Appium is useful for full mobile test frameworks, especially cross-platform suites, but Android can also be automated from the command line with adb and tools like Handsets.

Can I automate Android without root?

Yes. For normal UI automation, root is not required. You can tap, type, swipe, inspect visible UI, wait for text, and capture screenshots when the current app allows it.

Is this better than Appium?

It depends. Handsets is better for Android-only CLI scripts, LLM agents, and fast tap-heavy flows. Appium is better for cross-platform QA infrastructure and WebDriver-based test suites.

Can I run this in CI?

Yes, as long as your CI runner can access an Android emulator or connected device with adb. The commands are shell-friendly and return normal exit codes.

Handsets vs Appium: Which Android Automation Tool Should You Use?

Elliot Gao — Mon, 25 May 2026 08:55:10 +0000

Appium is the default answer for mobile automation.

It is mature, cross-platform, WebDriver-compatible, and supported by a large ecosystem. If a QA team needs one framework for Android and iOS, reports, Selenium-style infrastructure, and cloud device farms, Appium is usually the right place to start.

Handsets solves a smaller problem.

It is an Android-only CLI for driving phones from shell scripts, Python, or LLM agents. It does not try to be a test-management platform. It tries to make tap, fill, wait, screenshots, and UI inspection fast enough that the automation layer disappears from the critical path.

The short version:

Use Appium when you need a full cross-platform mobile test framework.
Use Handsets when you need fast Android UI control from the command line, especially for tap-heavy scripts and LLM agents.

If you searched for "Handsets vs Appium" or "Appium alternative for Android automation", the practical answer is this: Appium is the safer default for broad QA infrastructure, while Handsets is the sharper tool for Android-only automation where speed, scripting, and prompt size matter.

Best answer by use case

Use case	Better choice	Why
Cross-platform Android + iOS test suite	Appium	One WebDriver-style framework for both platforms
Android-only shell automation	Handsets	Small CLI, no server ceremony, easy CI scripts
LLM-driven Android agent	Handsets	Compact UI table and low per-action latency
Enterprise device farm with reports	Appium	Larger ecosystem and reporting integrations
Tap-heavy RPA workflow	Handsets	Warm daemon path keeps repeated calls cheap
Existing Selenium/WebDriver team	Appium	Familiar mental model and tooling

That table is the whole comparison in one place. The rest of this post explains the tradeoffs.

Quick comparison

Need	Appium	Handsets
Android support	Yes	Yes
iOS support	Yes	No
Protocol	WebDriver / HTTP	Length-prefixed frames over `adb forward`
Install on device	Driver/helper APKs	One small jar, no visible app
Root required	No	No
Tap by visible text	Yes	Yes
CLI-first workflow	Not really	Yes
LLM-friendly UI dump	No, usually XML/page source	Yes, compact action table
Typical tap latency	100-500 ms	2-7 ms after daemon warmup
Best fit	QA infrastructure	Scripts, agents, fast Android control

Appium is broader. Handsets is narrower and faster.

That is the tradeoff.

Setup difference

An Appium setup usually has several moving parts:

Install Node.js.
Install Appium.
Install the Android driver.
Start the Appium server.
Configure desired capabilities.
Connect a client library.
Run a test session.

That is normal for a full framework. It is also more machinery than you want for a small script.

Handsets starts from the terminal:

curl -fsSL https://raw.githubusercontent.com/elliotgao2/handsets/main/install.sh | bash
hs use
hs tap "Continue"

The device side is a small jar started through app_process as the Android shell user. There is no root step and no visible app to install.

API difference

An Appium test usually looks like WebDriver:

el = driver.find_element("xpath", "//*[@text='Continue']")
el.click()

Handsets keeps the same action as a CLI verb:

hs tap "Continue"

Or from Python:

from handsets import Session

with Session() as d:
    d.tap("Continue", visible=True, unique=True)
    d.wait(text="Welcome", timeout="15s")

The difference is not just syntax. It changes how easy it is to compose automation from shell scripts, CI jobs, and LLM tool calls.

Performance difference

Appium's architecture is designed around WebDriver. That buys compatibility and ecosystem support, but every action passes through an HTTP session layer.

For normal test suites, that overhead is often fine. A test that waits for screens, network calls, animations, and assertions will not notice every 100 ms.

For tap-heavy workflows, it matters.

In Handsets benchmarks, a warm tap("Continue") including text lookup runs in roughly 2-7 ms. Appium calls commonly land around 100-500 ms depending on the device, driver, and session state.

That difference matters when:

An LLM agent takes many small actions.
A script taps through hundreds of rows.
A mobile RPA flow spends most of its time in UI actions.
You want fast failure feedback in a CLI loop.

It matters less when your test spends most of its time waiting on network requests, animations, or backend state. In those suites, Appium's overhead may be a small part of total runtime.

UI dump difference

Appium usually exposes the Android UI tree as page source. That is useful for tools, but verbose for LLM agents.

Handsets has a compact UI table:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

For one Settings screen, a UIAutomator XML dump measured 5,762 tokens. The compact Handsets table measured 729 tokens. The model still gets the labels and actions it needs.

That matters if your Android automation is driven by an LLM.

When Appium is better

Choose Appium if you need:

Android and iOS in one framework.
WebDriver compatibility.
Cloud device farm integrations.
Recorders and reporting.
A mature QA ecosystem.
Team workflows built around Selenium-style tests.

Appium is not slow because it is bad. It is slower because it solves a bigger problem.

When Handsets is better

Choose Handsets if you need:

Fast Android-only automation.
Shell-first commands.
No-root device control.
Label-based tapping without coordinate scripts.
A small tool surface for LLM agents.
Python or subprocess integration without a WebDriver server.

The core loop is small:

hs use
hs ui
hs tap "Sign in"
hs fill "Email" "you@example.com"
hs fill "Password" "$PASSWORD"
hs tap "Continue"
hs wait "Dashboard"

That is the lane Handsets is built for.

Recommendation

If you are building a company-wide mobile QA platform, start with Appium.

If you are building Android-only scripts, LLM agents, CLI automation, RPA flows, or fast smoke checks, Handsets is worth trying first.

The tools are not enemies. They are optimized for different jobs.

FAQ

Is Handsets a full Appium replacement?

No. Handsets is Android-only and CLI-first. It does not replace Appium for iOS, WebDriver infrastructure, cloud device farms, or report-heavy QA platforms.

Is Handsets faster than Appium?

For small Android UI actions, yes. A warm Handsets text lookup tap is typically in the 2-7 ms range, while Appium actions commonly land around 100-500 ms depending on setup and device state.

Does Handsets require root?

No. Handsets runs through adb and a small device-side daemon under the Android shell user. The phone does not need to be rooted.

Can I use Handsets from Python?

Yes. You can use the Python package with from handsets import Session, or call hs --json from any language that can run a subprocess.

Which tool should I choose for LLM agents?

For Android-only LLM agents, Handsets is usually the better fit because it can provide a compact action table instead of a large XML tree, and because each action has low overhead.

Related guides

Originally published at https://handsets.dev/blog/handsets-vs-appium/.

A Terminal UI for Driving Android Apps

Elliot Gao — Mon, 25 May 2026 08:55:09 +0000

Most Android automation tools make you choose between two awkward modes.

You can write scripts, which are repeatable but slow to discover:

hs ui
hs tap "Continue"
hs fill "Email" "you@example.com"

Or you can use a visual tool, which is easier to explore but often separate from the thing you later automate.

hs tui is the missing middle: a terminal UI that lets you drive an Android app from the keyboard while showing the same action rows you would put in a script.

It is not a remote desktop. It is not a recorder. It is a live, keyboard-driven inspector for Android's interactive UI.

What it does

Run:

hs use
hs tui

The TUI opens in your terminal and shows the current interactive elements on the device:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

Move with the keyboard. Press Enter to act. If the selected row is a button, it taps. If it is an input field, it opens a small text modal and fills the field.

The useful part is that the TUI speaks the same vocabulary as the CLI:

tap Button "Continue" #continue
fill EditText "Email" #email
swipe up
back

Once the flow works by hand in hs tui, you already know what the script should look like.

Why a terminal UI?

Android automation has a discovery problem.

When a script fails, you often ask:

What does the device see right now?
What text is actually exposed?
Is this button clickable?
Is there more than one matching "Continue"?
Did the screen change after the tap?

The usual answer is to bounce between commands:

hs ui
hs tap "Continue"
hs ui
hs see screen.jpg

That works, but it has friction. You are copying labels out of one command and pasting them into another.

The terminal UI removes that loop. It keeps the UI list on screen and lets you act on the highlighted row.

Keyboard model

The controls are intentionally boring:

Key	Action
`↑` / `↓` or `j` / `k`	Move through interactive elements
`Enter`	Tap or fill the selected element
`PgDn` / `PgUp`	Swipe the device
`Shift+J` / `Shift+K`	Swipe faster
`←` / `Esc`	Android back
`q`	Quit

That is enough for a surprising amount of app navigation.

The point is not to replace touch. The point is to make exploratory Android automation feel like using a terminal tool instead of a mouse, a screenshot viewer, and a pile of copy-paste.

Live UI, not stale dumps

The TUI watches the device state in the background. It polls the accessibility tree and refreshes the list as the app changes.

That matters because Android screens are not static:

Keyboards appear and disappear.
Lists scroll.
Buttons enable after validation.
Loading states replace content.
Animations keep the app from becoming "idle".

Traditional scripts often wait for idle, dump the tree, act, and repeat. That is safe, but it makes exploration feel choppy.

hs tui keeps the display live so you can tap, type, swipe, and watch the list update.

The rows are the API

The TUI is built on the same compact UI model as hs ui.

Each row is an action-shaped description:

tap   Button    "Continue"  #continue  540,860
fill  EditText  "Email"     #email     540,540

That format is doing two jobs:

It is readable enough for a human in a terminal.
It is structured enough to turn into automation.

This is the main design choice. The TUI does not show the full XML tree because that is not what you act on. It shows the controls you can use.

Why this helps scripting

Many automation flows start with exploration:

Open the app.
Find the sign-in path.
Learn the labels.
Discover which waits are needed.
Turn that into a script.

Without a TUI, you do that with repeated dumps and screenshots.

With hs tui, you can walk the app once from the keyboard, then write the script using the labels you saw:

hs use
hs tap "Sign in"
hs fill "Email" "$APP_EMAIL"
hs fill "Password" "$APP_PASSWORD"
hs tap "Continue"
hs wait "Dashboard"

The manual path and the scripted path share the same model.

Why this helps LLM agents

LLM agents need good observations and cheap actions.

A screenshot is useful, but it is heavy and often makes the model infer text visually. A full Android XML dump is faithful, but it can be thousands of tokens of layout noise.

The action table is smaller:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

The TUI uses the same representation humans and agents can both understand. That makes it a useful debugging surface for agent runs: if the model picked the wrong label, you can open the same screen and see what choices it had.

Implementation notes

hs tui is a sibling binary to the core CLI.

The main hs binary stays small. When you run hs tui, it locates and launches handsets-tui with the current daemon host and port.

The TUI itself is built with:

Rust
ratatui
crossterm
the same length-prefixed Handsets wire protocol
the same interactive-node filtering used by hs ui

The device side is still the normal Handsets daemon: one small jar running as the Android shell user through adb. No root required.

What it is not

It is not meant to replace Appium, Espresso, or a full QA platform.

It is also not a pixel-perfect remote desktop. If you need a visual mirror, hs see can open the viewer or save screenshots.

hs tui is for the moment before and between scripts: when you want to drive the device quickly, learn the UI, and turn that knowledge into repeatable automation.

Related guides

Originally published at https://handsets.dev/blog/android-terminal-ui/.

Stop Wasting Tokens on Android Automation

Elliot Gao — Sun, 24 May 2026 15:19:41 +0000

Stop Wasting Tokens on Android Automation

Most LLM-driven Android automation starts by showing the model a screen.

That sounds reasonable. A human looks at the phone, decides what to tap, and taps it. Give the model the same view.

The problem is that "the same view" is expensive.

A full screenshot is expensive. A raw Android UI XML dump is also expensive, just in a quieter way. The model reads thousands of tokens of layout machinery before it reaches the handful of labels that matter:

Email
Password
Continue

For one step, that waste is easy to ignore. For a 50-step mobile agent trajectory, it becomes the bill.

The loop

An Android agent usually does this:

Read the current screen.
Decide what to do.
Tap, type, or swipe.
Wait for the next screen.
Repeat.

The first step is where the token leak begins.

If you use uiautomator dump, the model gets XML like this:

<node index="0" text="" resource-id=""
      class="android.widget.FrameLayout"
      package="com.google.android.apps.nexuslauncher"
      content-desc=""
      checkable="false" checked="false"
      clickable="false" enabled="true"
      focusable="false" focused="false"
      scrollable="false" long-clickable="false"
      password="false" selected="false"
      bounds="[0,0][1440,3120]">

That is one layout node. It says almost nothing an agent can act on.

It is not a bug in UIAutomator. XML is a faithful serialization of the accessibility tree. Faithful is not the same as useful.

The numbers

On a few ordinary Android screens, the difference looks like this:

screen	UIAutomator XML	Handsets `hs ui -i`	reduction
Launcher home	3,153 tokens	246 tokens	12.8x
Settings home	5,762 tokens	729 tokens	7.9x
Settings -> Apps	4,050 tokens	320 tokens	12.7x

Token counts are from tiktoken with the GPT-4 encoding. The deeper write-up is An Android UI Dump for LLMs.

The short version: a typical screen that costs 4,000-6,000 tokens as XML can often be represented in a few hundred tokens as an action table.

Across 50 steps, that is the difference between sending roughly 250k tokens of screen state and sending roughly 25k-40k.

The agent usually makes the same decision either way.

What the model actually needs

For UI automation, the model does not need a DOM-shaped tree.

It needs a list of things it can act on:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

That table gives the model the useful facts:

What action is available.
What label a human sees.
What type of control it is.
Where the tool will tap or type.

The model can now answer:

tap "Continue"

It does not have to parse layout ancestors, negative booleans, fully-qualified class names, or four-number bounds rectangles.

The rule

For LLM tool output, the optimization rule is simple:

Do not serialize facts the model cannot use in its next action.

Android XML violates that rule constantly:

clickable="false" on nodes the agent will never click.
enabled="true" repeated on almost every node.
Empty FrameLayout and LinearLayout containers.
Full class names like android.widget.TextView.
Bounds rectangles when the agent only needs a tap point.
JSON-style key repetition when the reader is a language model, not a parser.

Handsets drops the defaults, shortens the names, computes the center point, and keeps the labels.

The result is not a smaller XML file. It is a different interface:

hs ui
hs tap "Continue"
hs wait "Dashboard"

Screenshots are still useful

This is not an argument against screenshots.

Screenshots are useful when layout matters, when visual state matters, or when an app renders important information without accessible labels.

But screenshots are a poor default for every step. They are large, slow to move, and often force the model to do OCR-like work for text that Android already exposes.

A better loop is:

hs ui > /tmp/screen.txt
hs see --size 768 /tmp/screen.jpg   # only when visual context matters

Give the model the text UI first. Add the image when the text is not enough.

That usually saves tokens and makes the action easier to audit.

Why this matters more for agents than tests

Traditional mobile tests do not care much about token count. A test runner is not paying to read XML.

LLM agents are different. Every loop step has a context budget and a cost. If half the prompt is a UI tree full of dead layout nodes, the model is spending attention on junk.

This shows up in three places:

Cost: repeated screen state dominates long trajectories.
Latency: large prompts take longer to send and process.
Reliability: shorter action-oriented context leaves less room for the model to latch onto irrelevant structure.

The best tool output for an agent is not the most complete representation of the system. It is the smallest representation that preserves the next correct action.

The practical pattern

For Android, the pattern looks like this:

hs use
hs ui
hs tap "Sign in"
hs fill "Email" "you@example.com"
hs fill "Password" "$PASSWORD"
hs tap "Continue"
hs wait "Dashboard"

For an LLM, the important handoff is even smaller:

Here is the current Android UI. Pick the next action by label.

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

The model does not need to know that these nodes live inside three nested FrameLayouts. It needs to know that "Continue" is a button.