Forem: Fenix

Codex /goal and OpenGUI: long-running tasks need state

Fenix — Tue, 05 May 2026 01:55:19 +0000

Long-running agents tend to fail in the second half.

The first step is often fine. Fix a CI failure, open an app, tap a button, search for a keyword. Models can produce a reasonable first action. The trouble starts around step ten: what has already happened, where the task is stuck, what the original boundary was, and when the task is allowed to stop. Those details slide out of context.

Codex CLI 0.128.0 added /goal. The release note describes a persisted goal workflow: app-server APIs, model tools, runtime continuation, and TUI controls for create, pause, resume, and clear. Simon Willison compared it to OpenAI's version of a Ralph loop: set a goal for Codex, then let it keep executing, checking, and correcting until the goal is done or the budget runs out.

In the context of long-running tasks, the change is about where the goal lives. It moves from text in a single prompt to state that can be resumed, paused, cleared, and referenced again later.

Why coding agents need goal

Take a CI failure. The immediate failure may be one broken test. The agent changes the test, then the implementation, then adjusts a type because the code now feels awkward. Each step can be justified, but the final diff is much larger than the original problem.

Code generation is rarely the hard part here. The run has no stable constraint attached to it. The original goal may have been as small as:

/goal 修复当前 failing tests，保持 diff 尽量小，最后跑完 npm test

Or:

/goal 处理这个 PR 的 review comments，不改无关文件，最后给出改动摘要

That kind of goal carries the target, the boundary, and the acceptance condition. It tells the agent where to go, what not to touch, and when to stop.

Without that state, the agent is easily pulled around by the current error. A type looks awkward, so it changes the type. A test is hard to write, so it changes the test. The structure feels messy, so it refactors. Each local move can make sense, while the whole task drifts.

On phones, the hard part is screen state

OpenGUI works on a different kind of long-running task: letting AI operate a real Android phone.

Repository: https://github.com/Core-Mate/open-gui

In a codebase, state can still land in files, tests, and diffs. On a phone, state is a live screen.

For example, ask the phone to open X, search for discussions about mobile AI agents, collect the main points, and summarize what people care about. As a sentence, this looks simple. On the phone, it becomes a series of state checks: is the app open, is this the home page, is the search box focused, did the results finish loading, did a login prompt, permission prompt, or follow recommendation appear in the middle.

The loop of screenshot, tap, screenshot can only carry short tasks. If the screen does not change, the system has to decide whether the tap missed, the network is slow, the page is loading, or the action has no visible feedback. If the page jumps somewhere else, it also has to decide whether to go back, retry, or continue from the new page.

So a goal on mobile has to answer a few concrete questions: which step is the task on, whether the current screen supports the next step, where to recover after a failure, and when the run can end.

OpenGUI turns the goal into a state flow

I ran OpenGUI and read through the source. It connects the backend graph, device connection, and Android-side action execution instead of leaving phone automation as a script.

On the backend, the main entry point is server/apps/backend/src/modules/graph-agent/graph/mobile-agent.graph.ts. Complex tasks go through Plan Supervisor, where the plan is split into executable subtasks. Concrete actions enter executor.graph.ts, the device execution subgraph. The execution result goes back to the supervisor, which decides whether to continue, retry, replan, or hand off to Summarizer.

On Android, actions are applied to the real device. client/core_accessibility/.../GestureService.kt executes GUI actions such as taps and typing. The device keeps a WebSocket connection to the backend, and client/core_network/.../StandbySocketManager.kt handles the standby connection. Feishu/Lark, Telegram, and REST API can sit outside this as remote task entry points, turning the phone from a local demo device into something that can receive work.

OpenGUI spreads the goal across several pieces of state: the plan document, current subtask, device screenshot, execution result, failure classification, and final summary. After each device action, the backend gets fresh device state and decides the next move.

A simple script assumes the page will follow the expected order. OpenGUI assumes the page will change, so the executor has to keep reporting real state back to the backend.

The cost

Putting the goal into a graph makes the system heavier.

You have to maintain task state, keep WebSocket connections alive, handle device standby, send execution results and screenshots back, and design state transitions for continue, retry, cancel, and summarize. Model calls and screenshot analysis also cost money. The longer the task runs, the more that cost becomes an engineering concern instead of a small detail.

But on mobile, it is hard to avoid this cost. Real apps show popups, hang on loading screens, misread taps, and send users to completely different pages. A prompt loop alone quickly turns into screenshot-based while true.

OpenGUI puts that complexity into the system. A bad tap becomes an execution result for the supervisor to consume. The device keeps reporting state. It behaves more like a worker than a screen being clicked. The design is heavier, but it gives long-running tasks a place to be debugged, recovered, and reviewed.

The first use cases I would try are community research, mobile flow testing, ops tasks, and App-only workflows that web automation cannot reach. These tasks may not need the strongest model, but they do need an execution system that can keep following the goal, see failure, and send state back.

In coding agents, Codex /goal keeps the goal as recoverable state. On phones, OpenGUI connects task progress, device feedback, and failure handling into a state flow. A long-running agent has to keep track of the run, not only execute the next step.

References

OpenAI Codex 0.128.0 release: https://github.com/openai/codex/releases/tag/rust-v0.128.0
Simon Willison: https://simonwillison.net/2026/Apr/30/codex-goals/
OpenGUI: https://github.com/Core-Mate/open-gui

OpenGUI：手机上的 OpenClaw

Fenix — Mon, 04 May 2026 13:24:24 +0000

OpenGUI 是一个让 AI 操作真实 Android 手机的项目。

项目地址：https://github.com/Core-Mate/open-gui

OpenClaw 把 AI 接到桌面环境，OpenGUI 则把类似的执行层放到了 Android 手机上。它面向的是手机 App 里的任务：点击、输入、截图、阅读页面、执行流程、返回结果。

很多任务天然发生在移动端：X、Reddit、Hacker News、Telegram、微信、小红书，还有不少只在 App 内完整存在的业务流程。网页自动化覆盖不到这些场景。

基本架构

OpenGUI 由后端和 Android 客户端两部分组成。

后端负责理解任务、生成计划、监督执行和总结结果；Android 客户端连接后端，在真实设备上执行 GUI 操作。除了点屏幕，它还要处理任务状态、设备状态和失败后的恢复。

仓库里能看到几块代码：

后端的任务规划、Executor Graph、复核和总结
Android 端的 AccessibilityService 动作执行
设备通过 WebSocket 保持连接
飞书/Lark、Telegram、REST API 等远程触发入口

这样跑起来后，手机可以保持待命，像一个远程 worker 一样接任务、执行任务、回传结果。

本地启动

本地需要先准备 Android 开发环境，并连接一台 Android 设备。

启动后端：

cd server
./start.sh

启动 Android 客户端：

cd client
./start.sh

后端脚本会准备依赖服务、数据库和 API；客户端脚本会构建 APK、安装到连接的 Android 设备上，并启动应用。

手机侧还需要手动确认 USB 调试、Accessibility Service、悬浮窗权限，以及模型 API Key / 机器人凭证等。权限授权这部分保留人工确认更合理。

难点在哪里

手机 Agent 的麻烦通常出现在后续状态处理上。点一下屏幕只是开始。

比如一个很简单的任务：

打开 X，搜索 mobile AI agents 相关的近期讨论，收集主要观点，并总结大家主要关心的问题。

这个任务看起来不大，但手机上会发生很多不确定的事情：App 可能停在旧页面，搜索框可能没点中，结果页可能加载慢，中间还可能弹出登录、权限、推荐关注之类的窗口。

所以 Mobile Agent 不能只会“看图然后点一下”。它还要知道任务做到哪一步了，当前屏幕是不是符合预期，点错之后怎么恢复，页面没变化时要不要重试，最后怎么把执行结果收回来。

我实际跑了一下，也仔细看了 OpenGUI 的源码。它的思路还挺不错：后端 graph 管任务状态和计划，Executor Graph 负责把具体步骤派到手机上，Android 端通过 AccessibilityService 执行动作，再通过 WebSocket 把设备状态和执行结果传回来。

这会把手机放进执行循环里。后端判断任务要继续、重试还是结束；手机端把真实屏幕和动作结果反馈回来。

这套设计比单纯写脚本靠谱得多。手机可以待命、执行、反馈，更接近一个移动端 worker。

我能想到的第一批用途，是社区信息收集、移动端流程测试、运营任务执行，以及那些网页自动化碰不到的 App-only 流程。

OpenGUI: OpenClaw for phones

Fenix — Mon, 04 May 2026 13:23:21 +0000

OpenGUI is a project that lets AI operate a real Android phone.

Repository: https://github.com/Core-Mate/open-gui

OpenClaw connects AI to a desktop environment. OpenGUI brings a similar execution layer to Android. It is aimed at tasks inside mobile apps: tapping, typing, taking screenshots, reading screens, moving through flows, and returning results.

A lot of work already happens on phones: X, Reddit, Hacker News, Telegram, WeChat, Xiaohongshu, and plenty of business flows that only really exist inside apps. Web automation does not reach those surfaces.

Basic architecture

OpenGUI has two main parts: a backend and an Android client.

The backend understands the task, creates a plan, supervises execution, and summarizes the result. The Android client connects to the backend and performs GUI actions on a real device. Beyond tapping the screen, it also has to handle task state, device state, and recovery after failures.

You can see a few pieces in the repo:

task planning, Executor Graph, review, and summarization on the backend
AccessibilityService-based action execution on Android
WebSocket connections for keeping devices online
remote entry points through Feishu/Lark, Telegram, and REST API

Once it is running, the phone can stay on standby, receive a task like a remote worker, execute it, and send results back.

Local setup

You need an Android development environment and a connected Android device.

Start the backend:

cd server
./start.sh

Start the Android client:

cd client
./start.sh

The backend script prepares the services, database, and API. The client script builds the APK, installs it on the connected Android device, and launches the app.

Some phone-side steps still need manual approval: USB debugging, Accessibility Service, overlay permission, and model API keys or bot credentials. Keeping those steps explicit makes sense.

Where it gets hard

The hard part of a phone agent usually starts after the first tap.

Take a simple task:

Open X, search for recent discussions about mobile AI agents, collect the main points, and summarize what people care about.

That sounds small, but the phone can be in many different states. The app may open on an old page. The search box may not receive focus. Results may load slowly. A login prompt, permission prompt, or follow recommendation can appear in the middle.

So a mobile agent cannot just look at a screenshot and tap once. It has to know where the task is, whether the current screen matches expectations, how to recover after a bad tap, when to retry after no visible change, and how to collect the final result.

I ran OpenGUI and also spent some time reading the source. The approach is pretty good: the backend graph manages task state and plans, the Executor Graph sends concrete steps to the phone, the Android side performs actions through AccessibilityService, and WebSocket sends device state and execution results back.

This puts the phone inside the execution loop. The backend decides whether to continue, retry, or finish; the phone reports what actually happened on screen.

This is much more practical than a plain script. The phone can stand by, execute, and report back. It starts to look like a mobile worker.

The first use cases I can imagine are community research, mobile flow testing, ops tasks, and App-only workflows that web automation cannot touch.

NexAgent: a self-evolving AI agent built on Elixir/OTP

Fenix — Sat, 14 Mar 2026 01:37:33 +0000

OpenClaw showed the world the future of AI Agents. But it got me wondering: if an Agent is going to stick with me for a decade, what should its architecture look like?
Links: NexAgent GitHub: https://github.com/gofenix/nex-agent

TL;DR

OpenClaw's 310k stars are well-deserved.

It proves that the "personal AI Agent" direction is right—that everyday users are willing to pay for an "AI that can actually do work." As a long-time AI observer, seeing that red lobster logo take over the internet makes me genuinely happy. It means Agents are finally moving from a niche toy to the mainstream.

I spun one up myself. But using it surfaced some interesting issues that got me thinking: If an Agent is meant to be a 24/7 "companion" rather than just a "tool"—and if it needs to get smarter over time—how should its underlying architecture change?

There's no single right answer. OpenClaw proved the demand using TypeScript/Node.js. I wanted to explore a different path using Elixir/OTP.

So I built NexAgent.

This isn't meant to compete with OpenClaw. It's an experiment focused entirely on the "long-running Agent" niche.

My Experience Raising a Lobster

Like everyone else, I found OpenClaw on Twitter.

The red lobster logo, 310k stars, the "AI digital employee" pitch—it felt straight out of Iron Man (hello, JARVIS).

I installed it right away, hooked up a Telegram Bot, and gave it a task: "Check my GitHub Issues every morning at 8 AM and ping the high-priority ones to Lark."

Early Days: Pure Magic

Waking up to find my Issues neatly categorized.
Asking "Any bugs today?" on Telegram and seeing it actually remember yesterday's context.
Felt a solid 20% bump in my quality of life.

That Led Me to a Different Question

What if I don't just want to use an Agent, but raise it long-term?
It needs 24/7 rock-solid uptime.
It should compound its intelligence, not wipe the slate clean on every reboot.
It needs to self-evolve instead of waiting for author updates.

That’s when another tech stack came to mind.

Why Elixir/OTP?

Choosing TypeScript/Node.js for OpenClaw was the right move. It dramatically lowered the barrier to entry and brought more than 310k people into the project. That's how open source wins.

But I kept wondering: if the endgame is a "system that never goes down", what else is out there?

That led me to Elixir and OTP. Not for novelty's sake, but because OTP (Open Telecom Platform) was literally built for telecom switches: systems that must run 24/7, stay resilient, and support hot code upgrades.

Scenario	Node.js Approach	OTP Approach
Process Management	Single process + external restart	Supervision tree auto-restart
Memory Isolation	Same process space	Each task runs in isolated process
Hot Updates	Restart the service	Zero-downtime hot reload
Error Recovery	Manual intervention	Auto-recovery + graceful degradation

It’s not about which stack is better—it’s about optimizing for different use cases.

OpenClaw optimizes for accessibility, putting Agents in everyone's hands.
NexAgent optimizes for extreme stability, exploring what long-term AI companionship looks like.

Core Experiments with NexAgent

I rewrote the Agent's core in Elixir and ran a few tests:

Experiment 1: Uptime

I left NexAgent running on my local machine for an extended period:

Stability: Rock solid, zero memory leaks.
Latency: Consistently fast, no degradation over time.
Resilience: When a tool crashed, it restarted instantly without taking down the main loop.

Zero manual restarts. OTP's supervision tree makes the system far easier to operate over long periods.

Experiment 2: Hot Reloading in the Wild

My AMap weather tool suddenly broke, throwing API permission errors.

The Agent self-diagnosed the issue: the API key was bound to iOS, but it was making server-side calls. It autonomously patched its own source code, swapping the logic to read a Web Service key instead.

Four minutes later, it successfully pulled the weather for Shenzhen. No server restart. The chat session never dropped.

Here's the actual screenshot:

From diagnosing the bug to writing the fix and hot-reloading the module—zero human intervention.

Experiment 3: Self-Evolution Pipeline

NexAgent ships with a built-in pipeline for self-improvement:

Reflect: Read the source code of any internal module.
Analyze: Figure out what's broken and draft a fix.
Upgrade: Apply the patch and hot-reload on the fly.

This means that when tool logic needs to change, for example because an external API changes its response format, the Agent can inspect its own code, patch it, and update itself in memory without me having to restart the daemon.

The mechanism works flawlessly. I'm currently exploring more complex "fully autonomous repair" scenarios.

Under the Hood of NexAgent

1. Supervision Trees: Crash and Recover

# lib/nex/agent/application.ex
children = [
  NexAgent.InfrastructureSupervisor,
  NexAgent.WorkerSupervisor,
  NexAgent.Gateway
]

Supervisor.start_link(children, strategy: :rest_for_one)

If the infrastructure tier crashes, all dependent Workers restart. If a single tool crashes, only that specific tool's process restarts. The main Agent loop keeps running.

2. Process Isolation: Each Task Runs in Its Own Process

# lib/nex/agent/tool/registry.ex:181
Task.Supervisor.start_child(NexAgent.ToolTaskSupervisor, fn ->
  tool_module.execute(args)
end)

Every single tool execution gets its own lightweight process. Crashes don't affect the main loop.

3. Hot Reloading: Upgrades on the Fly

# lib/nex/agent/code_upgrade.ex:39
with :ok <- maybe_validate_code(code),
     :ok <- create_backup(module, source_path),
     :ok <- write_source(source_path, code),
     {:ok, hot_reload} <- compile_and_load(module, code),
     :ok <- maybe_health_check(module) do
  {:ok, %{version: version, hot_reload: hot_reload}}
else
  {:error, reason} ->
    _ = rollback(module)  # Auto rollback on failure
    {:error, to_error(reason)}
end

4. Dual-Layer Memory

MEMORY.md: Long-term state (project context, user quirks).
HISTORY.md: Grep-able conversation logs.

Powered by async consolidation: When a chat gets too long, the Agent spins up a background process to summarize the history and extract facts into long-term memory. Zero latency impact on your active chat.

The Six-Layer Evolution Model

NexAgent doesn't just evolve in one way. It has six layers of growth:

SOUL: Personality and core values.
USER: Your profile and how you like to collaborate.
MEMORY: Long-term context and project domain knowledge.
SKILL: Reusable workflows it has learned.
TOOL: Hardcoded integrations and tools.
CODE: The actual Elixir source code.

Every layer compounds over time, and each can evolve independently.

Quick Start

Want to take it for a spin?

# 1. Install Elixir (~> 1.18)
# 2. Clone repo
git clone https://github.com/gofenix/nex-agent.git
cd nex-agent
mix deps.get

# 3. Initialize
mix nex.agent onboard

# 4. Configure config file

# 5. Start gateway
mix nex.agent gateway

More docs: GitHub Repo

Closing Thoughts

OpenClaw opened our eyes to what AI Agents can do. That's a massive win for the whole industry.

NexAgent is simply probing a specific niche: If an Agent is meant to be a long-term companion, how should we build it?

310k people are raising lobsters, experiencing what it feels like to have AI at their fingertips.

I'm planting a tree, waiting for the day it grows into a canopy.

Two different paths, one shared goal: weaving AI seamlessly into our lives.