Forem: Mininglamp

Apple Silicon's AI Ceiling Is Higher Than You Think

Mininglamp — Tue, 26 May 2026 10:33:58 +0000

The consensus narrative around Apple Silicon and local AI inference goes something like this: impressive hardware, hobbyist-grade software, fundamentally memory-bandwidth-bound, ceiling already visible. This narrative is wrong—or at minimum, premature. The architectural headroom in Apple's Unified Memory Architecture (UMA) remains substantially underexploited by current inference frameworks, and recent work from Mininglamp Technology's open-source Cider SDK demonstrates that the compute ceiling sits considerably higher than the community assumes.

This article dissects why the ceiling is higher, how activation quantization unlocks it, and what the benchmark data actually shows.

Apple Silicon UMA: Why the Architecture Suits Inference Better Than You Think

Apple Silicon's UMA is not simply "shared RAM." It is a cache-coherent fabric where CPU, GPU, and Neural Engine access an identical physical address space with zero-copy semantics. On an M5 Pro with 64GB unified memory, the system delivers 307 GB/s of memory bandwidth—shared across all compute units without the PCIe bottleneck that plagues discrete GPU setups.

For LLM inference specifically, this creates three structural advantages:

Zero-copy weight access. Weights loaded once are visible to GPU compute kernels without DMA transfers. No host-to-device copies, no pinned memory gymnastics.
Bandwidth amortization across compute units. The Neural Engine, GPU, and CPU can pipeline different phases of inference (embedding lookup → attention → FFN) without serializing on memory bus contention in the way multi-device setups must.
Large context without OOM cliffs. 64-128GB unified pools mean 70B-class models fit entirely in memory with room for KV-cache growth—something that requires multi-GPU on NVIDIA platforms.

The bottleneck, then, is not the hardware. It is how efficiently software uses the available compute throughput. Current frameworks leave massive headroom on the table by treating Apple Silicon GPUs as bandwidth-limited devices when they are, in fact, compute-capable devices running compute-starved kernels.

MLX's Current State: Weight Quantization and the Prefill Bottleneck

Apple's MLX framework has become the de facto inference engine for Apple Silicon. It handles weight-only quantization elegantly: W4A16 (4-bit weights, 16-bit activations) and W8A16 (8-bit weights, 16-bit activations) are first-class citizens with optimized Metal kernels.

How weight-only quantization works in MLX:

In W4A16, each weight tensor is quantized offline to 4-bit integers with per-group scale and zero-point parameters (typically group size 32 or 128). At inference time, the kernel dequantizes weights on-the-fly back to FP16 before computing the matrix multiplication against FP16 activations. This halves (W8) or quarters (W4) the memory footprint of weights, directly reducing memory bandwidth pressure during the decode phase where each token generation requires a full model pass.

The decode phase—generating one token at a time—is purely memory-bandwidth-bound (small batch, large weight reads). Weight quantization addresses this perfectly. MLX's W4A16 decode speeds are genuinely impressive on Apple Silicon.

But prefill is a different beast entirely.

During prefill (processing the entire input prompt), the computation profile shifts dramatically. With thousands of input tokens processed simultaneously, the matrix multiplications become large GEMMs (General Matrix-Matrix Multiplications) where compute throughput—not just bandwidth—becomes the limiting factor. The activation matrices are wide (sequence_length × hidden_dim), and multiplying FP16 activations against dequantized-to-FP16 weights means every GEMM operates at FP16 arithmetic intensity.

This is where MLX hits its ceiling. On an M5 Pro processing 4516 tokens of context, MLX W8A16 takes 2.839 seconds for prefill. The GPU's INT8 tensor operation units sit completely idle during this phase—unused compute capacity that exists in hardware but is unreachable by the current software stack.

The prefill bottleneck matters because it directly impacts time-to-first-token (TTFT), which dominates perceived latency in agentic workflows, RAG pipelines, and any application that processes substantial context before generating output.

Activation Quantization: The Hard Problem MLX Doesn't Solve

Weight Quantization vs. Activation Quantization: The Fundamental Difference

Weight quantization is an offline problem. Model weights are static tensors—their distribution is known at calibration time, fixed forever after. You can spend hours finding optimal scale factors, per-channel ranges, and outlier handling strategies. The quantized representation is computed once, stored, and deployed.

Activation quantization is an online problem. Activations are computed dynamically at every layer, for every input, at every inference step. Their distributions shift based on input content, sequence position, attention patterns, and layer depth. You cannot pre-compute optimal quantization parameters because you don't know what the activations will look like until they arrive.

Why Activation Quantization Is Harder

Three properties make activations notoriously difficult to quantize:

Dynamic range instability. Unlike weights, which occupy a stable distribution learned during training, activation tensors exhibit input-dependent magnitude shifts. A token attending to a rare pattern might produce activation values 10-100x larger than typical tokens in the same sequence. These outliers, if clipped, destroy model accuracy; if accommodated in the quantization range, they waste precision for the majority of values.

Channel-wise heterogeneity. Different channels (feature dimensions) in activation tensors often have dramatically different ranges. Channel 42 might span [-0.1, 0.1] while channel 1337 spans [-50, 50]. A single per-tensor scale factor cannot serve both without catastrophic precision loss in the narrow-range channels.

Accumulation sensitivity. In matrix multiplications, quantization errors accumulate across the reduction dimension. For a GEMM with reduction dimension K=4096, each output element sums 4096 products. Even small per-element quantization noise (each ±0.01) can accumulate into significant output error, especially when the products are correlated rather than random.

Static vs. Dynamic Quantization Approaches

Static quantization pre-calibrates activation ranges using representative data. Scale factors are fixed at deployment. Advantage: zero runtime overhead for range computation. Disadvantage: any input that deviates from calibration distribution gets clipped or underutilized precision.

Dynamic quantization computes activation statistics (min/max or percentile) at runtime for each tensor. Advantage: adapts perfectly to every input. Disadvantage: the statistics computation itself adds latency—for large activation tensors, computing min/max across millions of elements is non-trivial.

The practical engineering challenge is finding the sweet spot: enough dynamic adaptation to preserve accuracy, with low enough overhead to actually deliver speedups.

Granularity: Per-Tensor vs. Per-Channel vs. Per-Group

Per-tensor quantization uses a single scale/zero-point for the entire activation tensor. Simplest to implement, cheapest computationally, worst for accuracy when channels have heterogeneous ranges.

Per-channel quantization assigns independent scale factors to each channel (feature dimension). Handles heterogeneous ranges well, but requires the GEMM kernel to support mixed scaling—the accumulation must account for different scales per output channel. This is where hardware-specific kernel design becomes critical.

Per-group quantization (e.g., group size 64 or 128) subdivides channels into groups, each with independent scale factors. It sits between per-tensor and per-channel: better accuracy than per-tensor, more flexibility than strict per-channel, but requires kernel support for grouped dequantization during accumulation.

The choice between these granularities is not purely about accuracy—it's a hardware co-design question. Which granularity can the target hardware's GEMM units exploit without introducing pipeline stalls or register pressure?

Cider SDK: INT8 Activation Quantization for Apple Silicon

Mininglamp Technology's Cider SDK answers this hardware co-design question specifically for Apple Silicon's M5+ GPU architecture. Rather than treating activation quantization as a framework-agnostic algorithm, Cider is engineered as an MLX enhancement layer that exploits hardware capabilities MLX currently leaves untouched.

INT8 TensorOps Kernel Design

The core contribution is a set of Metal compute kernels that perform INT8×INT8 matrix multiplications using Apple Silicon's dedicated integer tensor operation units. These units, available on M5-generation chips and newer, can execute 8-bit integer multiply-accumulate operations at significantly higher throughput than the FP16 ALUs used by standard MLX kernels.

Cider's kernel pipeline works as follows:

Dynamic quantization pass. For each activation tensor entering a linear layer, compute per-channel (or per-group) scale factors using a fast min/max reduction kernel.
Activation quantization. Map FP16 activations to INT8 using the computed scale factors. This is a memory-bandwidth-light operation (one pass, streaming).
INT8 GEMM execution. The quantized activation tensor is multiplied against pre-quantized INT8 weights using Metal's integer tensor operations. The accumulation happens in INT32 to prevent overflow.
Dequantization and rescaling. The INT32 accumulator output is rescaled using the product of activation and weight scale factors, producing FP16 output for the next layer.

The key engineering insight is that steps 1-2 (quantization overhead) are bandwidth-bound micro-operations, while step 3 (the actual GEMM) runs at nearly 2x the arithmetic throughput of FP16. The net effect is a substantial prefill speedup where GEMMs dominate total compute time.

Conditional Compilation for M5+ Hardware

Cider uses conditional compilation to detect Apple Silicon generation at build time. On M5+ hardware where INT8 TensorOps are available, the optimized kernel path activates. On older hardware (M1-M4), Cider falls back gracefully to standard MLX execution—no crashes, no silent accuracy loss, just baseline MLX performance.

This design decision reflects engineering pragmatism: INT8 tensor operations are a hardware feature, not a software emulation target. Attempting to simulate them on older generations would produce slowdowns, not speedups.

Three Granularity Options: Performance vs. Accuracy Tradeoffs

Cider exposes three activation quantization granularities, each with distinct performance characteristics measured against MLX W4A16 baseline on prefill:

Granularity	Prefill Speedup vs. MLX W4A16	Accuracy Impact	Use Case
Per-channel	1.8x	Lowest degradation	Production deployment, accuracy-critical
Per-group gs=128	1.5x	Moderate	Balanced default for most workloads
Per-group gs=64	1.3x	Minimal	Maximum accuracy preservation

The inverse relationship between granularity fineness and speedup is instructive. Per-channel quantization uses fewer scale factors and allows the INT8 GEMM to operate on larger contiguous blocks without rescaling interrupts. Per-group gs=64 requires more frequent scale factor lookups and partial accumulations, introducing pipeline bubbles.

Developers choose the granularity based on their accuracy/latency tradeoff requirements. For agentic applications where TTFT dominates UX, per-channel's 1.8x is transformative. For tasks where output quality cannot degrade (medical, legal), gs=64 still delivers meaningful improvement.

Integration with MLX Execution Graph

Critically, Cider is not a fork of MLX—it is a plugin layer. It works with all existing MLX models without requiring model re-export or custom weight formats. The integration point is at the linear layer level: Cider intercepts MLX's GEMM dispatch during prefill, routes eligible operations through the INT8 kernel path, and returns results to the standard MLX execution graph.

This means any model available in MLX format—Llama, Qwen, Mistral, Phi, Gemma—gets Cider acceleration without modification. No special quantization recipes, no model-specific tuning, no breaking changes to existing MLX workflows.

Benchmarks: What the Numbers Actually Show

Full benchmark on Apple M5 Pro, 64GB RAM, 307 GB/s bandwidth. Context length: 4516 tokens.

Configuration	Prefill Time	Decode Speed	Notes
MLX W8A16	2.839s	80.1 tok/s	Baseline—FP16 activations
Cider W8A8	2.519s	79.5 tok/s	INT8 activations enabled
Delta	-12.7%	-0.7%	Prefill gains, decode neutral

Interpreting the Results

Why prefill improves: The 4516-token prefill involves large GEMMs where compute throughput matters. INT8 TensorOps deliver higher effective TFLOPS for these operations. The 12.7% improvement represents the net gain after subtracting quantization overhead (dynamic scale computation + INT8 conversion).

Why decode barely changes: Single-token decode is a batch-1 operation. The GEMM degenerates into a matrix-vector multiply that is purely memory-bandwidth-bound regardless of numeric precision. INT8 activations don't help because the bottleneck is weight loading, not arithmetic. The -0.7% difference is within measurement noise—Cider introduces no decode regression.

The 1.4-2.2x prefill speedup range (cited from Cider's README, measured across different models and configurations against MLX W4A16) reflects the broader performance envelope. The W8A8 vs. W8A16 comparison above is the most conservative case—same weight precision, isolating pure activation quantization benefit. Against W4A16 baselines (where weight dequantization adds further overhead), Cider's advantage widens substantially.

What This Implies for Real Applications

A 12.7% prefill reduction on 4516 tokens translates to ~320ms saved per inference call. In an agentic loop that processes context 10-20 times per task (tool calls, reflection steps, context window re-reads), that compounds to 3-6 seconds of wall-clock improvement per agent task. For RAG applications processing retrieved documents, the speedup applies to every retrieval-augmented generation call.

Mano-P: Where Cider Meets a Full On-Device AI Stack

Cider does not exist in isolation. It is a component of Mano-P, Mininglamp Technology's open-source on-device AI agent framework designed specifically for Apple Silicon Macs.

Mano-P's architecture treats the Mac as a complete AI workstation: model inference (via MLX + Cider), tool orchestration, memory management, and multi-agent coordination—all running locally. No API calls to external services, no data leaving the device, no per-token billing.

The Cider integration within Mano-P means that agentic workflows—where the model processes large contexts repeatedly (screen captures, document analysis, multi-step reasoning)—benefit from activation quantization at every inference call. The 1.4-2.2x prefill improvement compounds across agent loops, materially reducing end-to-end task completion time.

This is the broader thesis Mininglamp Technology is demonstrating: Apple Silicon is not a hobbyist platform with a visible ceiling. It is a production-grade AI inference substrate whose compute capabilities are systematically underutilized by current software. Cider proves the ceiling is higher. Mano-P builds the full stack that exploits it.

Conclusion: The Ceiling Is a Software Problem

Apple Silicon's AI inference ceiling is not set by hardware bandwidth or compute capacity. It is set by how intelligently software exploits the available hardware features. INT8 TensorOps on M5+ chips represent concrete, shipping silicon that the dominant inference framework (MLX) does not yet utilize.

Mininglamp Technology's Cider SDK—Apache 2.0 licensed, compatible with all MLX models, zero-modification deployment—demonstrates that meaningful performance remains extractable through hardware-aware kernel engineering. The 1.4-2.2x prefill improvements are not theoretical projections; they are measured results on production hardware.

The ceiling is higher than you think. The tools to reach it are open source.

Cider SDK is open-sourced under Apache 2.0 by Mininglamp Technology. It requires Apple Silicon M5 or newer for INT8 TensorOps acceleration.

GUI Agents vs RPA: Different Architectures for Different Problems

Mininglamp — Tue, 26 May 2026 10:28:48 +0000

Desktop automation has reached an inflection point. For two decades, Robotic Process Automation (RPA) dominated enterprise workflow automation through deterministic scripting. Today, a fundamentally different architecture—vision-language-action (VLA) GUI agents—challenges the assumption that automation requires brittle, hand-coded selectors. These are not competing products on the same spectrum; they represent distinct architectural paradigms optimized for different problem classes.

This article dissects both architectures at the systems level, examines where each fails, and analyzes how Mano-P, an open-source GUI agent project by Mininglamp Technology, implements the VLA paradigm with on-device inference.

The Structural Fragility of RPA

RPA tools—UiPath, Automation Anywhere, Blue Prism—operate on a selector-action model. Each automation step identifies a UI element via DOM path, CSS selector, accessibility attribute, or pixel coordinate, then executes a predefined action. This architecture carries four compounding failure modes:

DOM Coupling and Selector Fragility. A single UI update—renamed button ID, restructured div hierarchy, relocated modal—breaks the entire downstream chain. Enterprise RPA deployments report 30-40% of maintenance effort goes to selector repair after application updates. This is not a bug; it is the architectural consequence of coupling automation logic to implementation-specific element identifiers rather than semantic intent.

Maintenance Scaling. The relationship between automation count and maintenance burden is superlinear. Each new bot adds not just its own maintenance surface but interaction complexity with shared UI elements. Organizations with 200+ bots frequently employ dedicated "bot repair" teams larger than the original development team.

Cross-Application Boundaries. RPA operates within single-application contexts. Workflows spanning multiple applications require explicit handoff logic—clipboard operations, file watchers, inter-process communication hacks. A task trivial for a human ("copy this table from the PDF into the spreadsheet, then email it") becomes a fragile multi-stage pipeline with failure modes at every boundary.

Semantic Blindness. RPA has no understanding of what it is doing. It cannot distinguish a "Submit" button from a "Cancel" button except by selector match. When an application presents an unexpected dialog ("Are you sure you want to delete all records?"), a selector-based bot either crashes or, worse, proceeds with the wrong action. There is no reasoning layer to evaluate whether the current screen state matches the expected workflow context.

Three Generations of Desktop Automation Architecture

The evolution from scripted automation to intelligent agents follows a clear architectural progression:

┌─────────────────────────────────────────────────────────────┐
│  Generation 1: Selector-Action (RPA)                        │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │ Selector │───▶│  Action  │───▶│ Selector │───▶ ...      │
│  │ (brittle)│    │(hardcoded)│   │ (brittle)│              │
│  └──────────┘    └──────────┘    └──────────┘              │
│  Failure mode: any UI change breaks the chain               │
├─────────────────────────────────────────────────────────────┤
│  Generation 2: Vision + LLM (Set-of-Marks, early agents)   │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │Screenshot│───▶│ LLM Plan │───▶│Click x,y │───▶ ...     │
│  │ + Labels │    │(per-step) │   │(no verify)│              │
│  └──────────┘    └──────────┘    └──────────┘              │
│  Failure mode: no grounding, no error recovery              │
├─────────────────────────────────────────────────────────────┤
│  Generation 3: VLA Unified Model (Mano-P)                   │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │  Visual  │───▶│ Reason + │───▶│  Action  │───▶ Verify  │
│  │ Encoding │    │  Ground  │    │ Predict  │     ──┐     │
│  └──────────┘    └──────────┘    └──────────┘       │     │
│       ▲                                              │     │
│       └──────────────────────────────────────────────┘     │
│  Key: closed-loop perception-reasoning-action-verification  │
└─────────────────────────────────────────────────────────────┘

Generation 1 treats automation as scripting. Generation 2 adds perception but remains open-loop—screenshot in, coordinate out, no verification that the action succeeded. Generation 3, implemented in Mano-P, closes the loop: the same model that perceives the screen also reasons about intent, predicts actions, and verifies outcomes before proceeding.

Mano-P's VLA Architecture: A Deep Dive

Mano-P, open-sourced by Mininglamp Technology under Apache 2.0, implements a unified Vision-Language-Action architecture where visual perception, language reasoning, and action prediction occur within a single model forward pass rather than as separate pipeline stages.

Vision-Language-Action Unified Model

The VLA architecture unifies three traditionally separate capabilities into a single transformer backbone:

Visual Encoding. Raw screen frames are encoded through a vision transformer that produces spatial feature maps preserving both fine-grained element details (button text, icon shape) and global layout structure (window arrangement, relative positioning). Unlike Set-of-Marks approaches that overlay numbered labels onto screenshots, Mano-P's visual encoder learns to ground elements directly from pixel space—eliminating the information loss and visual clutter of annotation-based methods.

Language Reasoning. The language component serves dual functions: (1) interpreting the user's natural language task description and maintaining multi-turn dialogue context, and (2) generating explicit reasoning traces ("thinking") before committing to actions. This is not prompt engineering on top of a general LLM—the language reasoning is jointly trained with visual grounding and action prediction, creating shared representations where linguistic concepts ("the submit button in the bottom-right corner") directly map to spatial features in the visual encoding.

Action Prediction. The action head produces structured outputs—click coordinates, text input, keyboard shortcuts, scroll operations—grounded in the visual scene. Critically, actions are predicted from the model's internal visual representation, not from external element identifiers. This means the same "click the blue submit button" task executes correctly regardless of whether the button's DOM ID changed, its CSS class was renamed, or it moved 50 pixels to the right in a redesign.

The unified architecture means these three capabilities share gradient flow during training. Visual features that help action prediction get reinforced; language representations that improve visual grounding get strengthened. This is fundamentally different from pipeline architectures where each component is optimized independently.

Three-Stage Training Pipeline

Mano-P's training follows a carefully designed progression that mirrors how humans learn complex tasks:

Stage 1: Supervised Fine-Tuning (Behavior Cloning). The model learns from expert demonstrations—recorded sequences of (screen state, reasoning, action) tuples collected from human operators completing real tasks. This establishes baseline competency: the model learns what correct action sequences look like for common workflows. However, behavior cloning alone produces a model that imitates the mean of demonstrations without understanding why certain actions are better than others.

Stage 2: Offline Reinforcement Learning (Advantage Learning). Using pre-collected trajectories (both successful and failed), the model learns to distinguish good actions from bad ones without additional environment interaction. The advantage function estimates how much better a particular action is compared to the average policy at that state. This stage is critical for sample efficiency—it extracts maximum learning signal from existing data before expensive online exploration. The model learns failure recovery patterns: what to do when a click misses, when a dialog appears unexpectedly, when a page loads slowly.

Stage 3: Online Reinforcement Learning (Environment Interaction). The model interacts with live environments (real operating systems, real applications) and receives reward signals based on task completion. This stage handles the distribution shift between demonstration data and real-world conditions—applications update, screen resolutions vary, timing differs. Online RL fine-tunes the policy to handle edge cases that never appeared in demonstrations, producing robust behavior under novel conditions.

This three-stage pipeline—SFT → Offline RL → Online RL—progressively builds from imitation to understanding to adaptation. Each stage addresses a specific limitation of the previous one.

Think-Act-Verify Loop

Unlike open-loop systems that predict an action and immediately move to the next step, Mano-P implements a closed-loop mechanism:

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌──────────┐
│  THINK  │────▶│   ACT   │────▶│ VERIFY  │────▶│  THINK   │
│         │     │         │     │         │     │  (next)  │
│ Reason  │     │ Execute │     │ Confirm │     │          │
│ about   │     │ grounded│     │ expected│     │ Continue │
│ current │     │ action  │     │ outcome │     │ or retry │
│ state   │     │         │     │ achieved│     │          │
└─────────┘     └─────────┘     └─────────┘     └──────────┘

Think: The model generates explicit reasoning about the current screen state, the overall task progress, and what action should come next. This reasoning trace is not just for interpretability—it actively improves action quality by forcing the model to articulate its understanding before committing.

Act: Based on the reasoning, the model predicts and executes a grounded action (click, type, scroll, keyboard shortcut). Actions are specified in the visual coordinate space of the current frame.

Verify: After action execution, the model captures the resulting screen state and evaluates whether the expected outcome occurred. Did the button click actually navigate to the expected page? Did the text input appear in the correct field? If verification fails, the loop returns to THINK with updated context about the failure mode, enabling error recovery without human intervention.

This closed-loop architecture is what separates GUI agents from sophisticated screen scrapers. The verification step means Mano-P can handle the non-determinism of real desktop environments—network latency, animation delays, unexpected popups—without pre-programmed exception handlers.

GSPruning: Efficient Inference Without Accuracy Loss

Running a VLA model on consumer hardware requires aggressive inference optimization. Mininglamp Technology developed GSPruning (Geometric-Semantic Pruning) specifically for GUI agent workloads, addressing the unique challenge of pruning visual tokens while preserving spatial grounding accuracy.

Standard token pruning methods (attention-based, random dropping) catastrophically degrade GUI agent performance because they disrupt spatial relationships—the model can no longer accurately predict where to click if tokens representing spatial structure are removed arbitrarily.

GSPruning solves this through two complementary mechanisms:

Anchor-Based Spatial Structure Preservation. The algorithm identifies "anchor tokens"—visual tokens that serve as spatial reference points for the broader scene (window corners, toolbar boundaries, prominent UI landmarks). These anchors are never pruned, maintaining the geometric scaffold that enables accurate coordinate prediction. Remaining tokens are pruned based on redundancy with nearby anchors, ensuring spatial density stays uniform rather than creating gaps that distort coordinate mapping.

Semantic Outlier Detection. Tokens whose semantic content is highly atypical relative to their spatial neighborhood are preserved regardless of pruning pressure. A notification badge on an otherwise uniform toolbar, a highlighted menu item among gray siblings, an error message in a standard form—these semantically salient tokens carry disproportionate task-relevant information. Standard importance-based pruning often removes them (they have low attention mass because they are atypical), but GSPruning explicitly protects them.

The combined effect: 2-3x throughput improvement with minimal accuracy degradation. On a MacBook Pro M5 Pro, this translates to approximately 80 tokens/second decode speed—fast enough for real-time interactive use without cloud dependency.

Mano-Action: Bidirectional Self-Reinforcement

Mano-P's architecture includes a bidirectional data flywheel between the agent model and the action prediction component. Successfully completed tasks generate new high-quality training data for the action predictor; improved action prediction enables the agent to complete harder tasks, which generates even richer training data. This self-reinforcement mechanism means the model improves with deployment—each successful real-world task execution contributes to future capability, without requiring manual data collection or annotation.

Benchmark Performance

The architectural advantages manifest in benchmark results:

Benchmark	Mano-P (72B)	Comparison
OSWorld	58.2%	72B internal benchmark model
WebRetriever NavEval (Protocol I)	41.7	vs Gemini 2.5 Pro: 40.9, Claude 4.5 Sonnet: 31.3

The open-source release is a 4B parameter model—deliberately sized for on-device deployment rather than maximum benchmark scores. The WebRetriever Protocol I result of 41.7 on NavEval demonstrates that Mano-P outperforms Gemini 2.5 Pro (40.9) and significantly exceeds Claude 4.5 Sonnet (31.3) on real-world web navigation tasks.

Cider SDK: On-Device Quantization Engine

Running a VLA model locally requires more than model architecture innovation—it demands inference engine optimization at the hardware level. Mininglamp Technology's open-source Cider SDK provides production-grade quantization specifically tuned for Apple Silicon's Unified Memory Architecture (UMA).

W8A8 and W4A8 Activation Quantization. Cider implements weight-and-activation quantization (not weight-only) that exploits Apple Silicon's hardware integer units. W8A8 (8-bit weights, 8-bit activations) achieves approximately 12.7% prefill speedup with negligible accuracy loss. W4A8 (4-bit weights, 8-bit activations) pushes further for memory-constrained deployments.

1.4-2.2x End-to-End Speedup. Across different model configurations and hardware targets, Cider delivers 1.4-2.2x throughput improvement over naive FP16 inference. Combined with GSPruning's 2-3x token throughput gain, the full stack achieves real-time GUI agent performance on consumer laptops.

UMA-Aware Memory Management. Unlike discrete GPU systems where data must cross PCIe boundaries, Apple Silicon's unified memory allows CPU and GPU to share the same physical memory. Cider's memory allocator exploits this—model weights, KV cache, and visual features coexist in a single address space without copy overhead, reducing both latency and peak memory footprint.

The critical privacy implication: data never leaves the machine. Screen frames, task descriptions, reasoning traces, action sequences—everything stays in local memory. There is no telemetry, no cloud dependency for inference, no API calls that transmit screen content to external servers. For enterprises handling sensitive documents, financial data, or personal information, this is not a feature—it is a requirement.

When to Use Which Architecture

The choice between RPA and GUI agents is not about "old vs new"—it is about matching the automation architecture to the problem characteristics:

Dimension	RPA (Selector-Action)	GUI Agent (VLA)
Best for	Stable, high-volume, single-app workflows	Cross-app, UI-volatile, reasoning-required tasks
Failure mode	Silent breakage on UI change	Graceful degradation with error recovery
Maintenance	Linear-to-superlinear with bot count	Model update covers all tasks simultaneously
Cross-app	Requires explicit integration	Native—same model operates any application
Speed	Millisecond actions (no reasoning)	Seconds per step (perception + reasoning)
Determinism	100% deterministic (when working)	Probabilistic (verify loop adds reliability)
Setup cost	Per-workflow scripting	One model deployment, natural language tasks

RPA remains optimal for stable, high-volume, latency-sensitive workflows within a single application that rarely updates—payroll processing in legacy systems, mainframe data entry, report generation from stable internal tools. These are problems where the rigidity of selector-based automation is a feature (guaranteed determinism) rather than a bug.

GUI agents excel where RPA structurally cannot: workflows spanning multiple applications, tasks requiring visual understanding of unstructured content, environments that update frequently, and scenarios where the automation must handle unexpected states gracefully.

Architectural Convergence

The future likely involves hybrid deployments: RPA handles the stable, high-throughput inner loops while GUI agents manage the cross-application orchestration, exception handling, and dynamic adaptation layers. The architectures are complementary at the systems level, even as they compete at the individual task level.

Mano-P's open-source availability (Apache 2.0) and on-device architecture lower the barrier to evaluating where VLA-based automation fits within existing enterprise automation stacks. The 4B parameter open-source model runs on a MacBook—evaluation requires no cloud infrastructure, no API keys, no data leaving the organization.

Mano-P on GitHub — Apache 2.0, on-device GUI agent
Cider SDK on GitHub — Quantization engine for Apple Silicon

Harness Tells Your Agent What to Do. GUI Agents Let It Actually Do It.

Mininglamp — Mon, 25 May 2026 10:05:58 +0000

The Rise of Harness Engineering

Harness Engineering has become the defining conversation in AI agent development this quarter. Anthropic published "Effective Harnesses for Long-Running Agents." OpenAI released their own take on constraining agent behavior through software engineering practices. The thesis is straightforward: wrap your AI agent in a structured control layer—task routing, approval gates, verification loops, and retrospectives—so it behaves reliably over extended sessions.

The pattern makes intuitive sense. An unconstrained agent is a liability. A harnessed agent is a tool. The community has responded: open-source harness frameworks are emerging, giving teams reusable scaffolding for decision-level reliability.

But here's the question no one is asking loudly enough: after the harness decides what to do, how does the agent actually do it?

What Harness Solves

A harness framework operates at the decision layer. It answers:

What should the agent do next?
In what order should tasks execute?
When should it pause for human review?
How do we verify the outcome before moving on?

Think of it as the prefrontal cortex of your agent system—planning, sequencing, gating. Frameworks like cow-harness already provide open-source implementations of these patterns: task decomposition, approval workflows, retry logic, and audit trails.

This is genuinely valuable. Without a harness, agents hallucinate plans, skip steps, and compound errors. With one, they become predictable and auditable.

But predictable planning is not the same as reliable execution.

The Execution Gap

Consider a real scenario. Your harnessed agent determines the next action: "Open the CRM, navigate to the customer record for Acme Corp, and update the contract renewal date to June 15."

The harness has done its job. The decision is correct. The approval gate passed. Now... how does the agent physically perform this action?

Current execution methods each carry fundamental limitations:

CLI tools — Powerful but narrow. Only works for systems that expose command-line interfaces. Most enterprise software does not.

API calls — The gold standard when available. But many critical business systems—legacy ERPs, proprietary desktop apps, government portals—simply have no API. Or the API covers 20% of what the GUI exposes.

DOM manipulation — Works for web apps, breaks on desktop. Requires knowledge of the target app's internal structure. One frontend update can invalidate your selectors.

RPA scripts — The enterprise workaround. Record a macro, replay it. Brittle by nature: a single UI change—a moved button, a renamed field, a new modal dialog—breaks the entire flow. Maintenance cost scales linearly with the number of automations.

The common thread: all of these methods require a pre-existing technical interface to the target system. They assume the system was designed to be automated, or that someone has reverse-engineered a way in.

In enterprise reality, the most critical systems are often GUI-only black boxes. No API. No CLI. No stable DOM. Just a screen that a human clicks through.

This is the execution gap. Harness frameworks have nothing to say about it.

Vision-Based GUI Agents as the Execution Layer

What if the agent could interact with software the same way a human does—by looking at the screen and clicking?

That's exactly what vision-based GUI agents do:

Input: A screenshot of the current screen state
Understanding: A vision-language model identifies UI elements—buttons, text fields, menus, labels—and comprehends their spatial relationships and semantic meaning
Output: Precise mouse coordinates and keyboard actions to accomplish the intended task

The key property: zero dependency on target system internals. The agent doesn't need an API, a DOM tree, or accessibility hooks. It sees pixels and acts on them. This works across:

Web applications
Native desktop software
Remote desktop sessions
Terminal UIs
Even systems running in virtual machines

If a human can operate it by looking at a monitor, a vision-based GUI agent can too.

Putting It Together: Harness + GUI Agent

This is where the architecture becomes complete. The harness provides the brain—deciding what to do, when to pause, how to verify. The GUI agent provides the hands—executing actions on any visual interface.

Mano-P is an open-source GUI agent built for exactly this role. Developed by Mininglamp Technology under the Apache 2.0 license, Mano-P implements a Vision-Language-Action (VLA) architecture designed to serve as the execution layer in agentic systems.

The name encodes the philosophy: "Mano" is Spanish for "hand"—the part that acts. "P" stands for Private—your data never leaves the device.

Architecture: Think-Act-Verify

Mano-P operates through an inference loop that mirrors how a careful human operator works:

Think — Observe the current screen state, reason about what UI elements are present, and determine the next action
Act — Execute the precise mouse/keyboard operation
Verify — Capture the resulting screen state and confirm the action had the intended effect

This loop provides built-in error detection. If a click lands on the wrong element or a form doesn't submit, the verify step catches it immediately—enabling retry or escalation back to the harness layer.

On-Device Performance

Mano-P is designed for local execution. The quantized 4B model runs on consumer hardware:

Minimum: Apple M4 chip + 32GB RAM (Mac mini or MacBook)
Performance on M5 Pro: ~80 tokens/s decode speed

The Cider SDK provides W8A8 activation quantization, delivering approximately 12.7% prefill acceleration compared to the W8A16 baseline, and 1.4x–2.2x prefill speedup versus MLX native W4A16. This means real-time interaction with GUIs—no cloud round-trip, no latency spikes.

Benchmark Results

On the OSWorld benchmark—the standard evaluation for GUI agent capabilities across real operating system tasks—Mano-P 1.0-72B achieved a 58.2% success rate, ranking #1 among specialized GUI agent models.

For web navigation specifically, the WebRetriever Protocol I achieved a 41.7 NavEval score, demonstrating reliable multi-step web interaction.

Mano-AFK: The Full Automation Loop

To demonstrate how harness-level planning connects to GUI-level execution, Mininglamp Technology built Mano-AFK—an end-to-end autonomous development pipeline:

Natural language requirement → PRD generation → Architecture design → Code generation → Deployment → E2E testing (Mano-P's visual model drives the browser to test the deployed app) → Bug detection → Fix → Retest

This is the harness + GUI agent pattern in its most complete form. The planning layer decomposes a vague requirement into structured development phases. The GUI agent handles the parts that require visual interaction—browser testing, UI verification, visual bug detection—without any test framework dependencies.

Privacy by Design

In local execution mode, all processing happens on-device. Screenshots are captured and analyzed locally. Model inference runs locally. No data transits to external servers. For organizations handling sensitive information—financial records, medical data, classified documents—this is not a feature. It's a requirement.

Honest Limitations

No technology is universally optimal. Vision-based GUI agents have real tradeoffs:

Overhead on simple web tasks — For well-structured web applications with clean APIs or stable DOM trees, direct API calls or DOM manipulation will always be faster than screenshot-based interaction. If you have a good API, use it.

Accuracy ceiling on complex UIs — The 4B on-device model handles standard interfaces well but can struggle with extremely dense or unconventional UI layouts. The 72B model pushes accuracy significantly higher but requires more compute.

Best suited for specific scenarios:

Legacy enterprise systems with no API
Cross-platform automation spanning web and desktop
Data-sensitive workflows requiring strictly local execution
Systems where UI changes frequently (vision adapts; scripts break)
Remote desktop environments where DOM access is impossible

The right architecture uses the right tool for each target. API calls where APIs exist. DOM methods for stable web apps. And vision-based GUI agents for everything else—which, in most enterprises, is a surprisingly large surface.

Conclusion

The AI agent stack is crystallizing into two distinct layers:

The Brain — Harness frameworks that constrain, route, verify, and audit agent decisions. This is a solved problem with active open-source development.

The Hands — Execution layers that translate decisions into physical actions on real systems. For GUI-bound systems, vision-based agents are the only approach that scales without per-system integration work.

Harness tells the agent what to do. GUI agents let it actually do it. Together, they close the automation loop.

Mano-P is Apache 2.0 licensed and available on GitHub: https://github.com/Mininglamp-AI/Mano-P

Feedback and contributions welcome.⭐

Agent Execution Environments: Cloud Sandbox vs Local GUI vs Hybrid

Mininglamp — Fri, 22 May 2026 10:18:34 +0000

When teams start building AI agents, most of the early energy goes into prompts, models, and tool definitions. Which model should we use? How do we structure the tool-calling loop? What's the right retry strategy?

These are all reasonable questions. But there's another question that usually shows up late — often too late — and shapes everything else:

Where should your AI agent actually run?

The execution environment isn't just an infrastructure detail. It determines what your agent can and can't access, how sensitive data moves (or doesn't), what hardware costs look like at scale, and how much your users are willing to trust the system. Get this decision right early, and a lot of other choices fall into place naturally. Get it wrong, and you're refactoring core architecture six months in.

Let's walk through the three main approaches.

Environment 1: Cloud Sandbox

The most common starting point for agent deployment today is the cloud sandbox model. You spin up an isolated virtual machine or container in the cloud — services like E2B, Modal, or Manus handle the orchestration — and your agent operates entirely within that environment.

How it works

When a task arrives, the platform provisions a clean runtime (often in seconds). The agent gets a shell, a browser, maybe a filesystem and some pre-installed tools. It executes its plan, produces output, and the environment is torn down. From the agent's perspective, it has a full operating system to work with. From the infrastructure perspective, nothing persists between runs unless you explicitly pass state.

What it's good at

Cloud sandboxes shine when the work is web-native. Scraping, form submission, browser automation, API interactions — anything that lives on the public internet is fair game. The isolation model is also excellent for security: if an agent misbehaves or encounters a malicious input, the blast radius is contained to a throwaway VM.

Scalability is another genuine strength. You can run dozens or hundreds of concurrent agent sessions without worrying about resource contention on a shared machine. For demos, CI pipelines, and batch processing workflows, this is hard to beat.

The real constraints

The limitations become visible when your actual work isn't web-native.

Cloud agents can't open your Excel spreadsheet, interact with your internal ERP, or paste results into the desktop app your ops team uses every day. They operate on a synthetic environment — not your environment. Any data that needs to flow into the agent (files, credentials, internal documents) has to leave your machine first.

For many enterprise workflows, that data boundary is the dealbreaker. Sending sensitive customer data or internal business records to a third-party cloud runtime creates compliance exposure that legal teams won't sign off on. And even when data sensitivity isn't the concern, there's a latency and cost dimension: every session spins up a billable runtime, and for long-running tasks the economics can get uncomfortable.

Best fit

Cloud sandboxes are the right choice for: web-only automation, exploratory prototyping, public-data tasks, and workloads where horizontal scale matters more than local access.

Environment 2: Local GUI Agent

Local GUI agents work on a different model entirely. Instead of operating inside a synthetic cloud environment, the agent runs directly on a real desktop — your Mac, your Windows workstation, your on-premises server. It sees the actual screen. It interacts with actual apps. It operates in the environment where your work already lives.

How it works

The agent captures the screen (via screenshots, accessibility APIs, or both), reasons about what it sees, and produces actions — mouse clicks, keyboard input, application-specific commands. The entire loop happens locally: perception, reasoning, action, and observation.

This architecture requires more from the hardware, but it also removes entire categories of constraint. If you can do it by hand on your computer, a local GUI agent can learn to do it too.

What it's good at

The primary advantage is full environment access. Cross-application workflows — copy from a PDF, paste into a spreadsheet, trigger a report in your accounting software, email the result — are natural fits. These tasks are awkward or impossible in cloud sandboxes but routine for local agents.

Data locality is the other major win. When the model and the agent runtime both live on-device, sensitive information never leaves the machine. There's no outbound API call carrying your customer records. Compliance teams have a much easier conversation. For industries with strict data residency requirements — healthcare, finance, defense — local execution isn't just convenient, it's sometimes the only path forward.

There's also an economics angle worth noting. Local models, once running on capable hardware, cost nothing per inference. A cloud-based agent making hundreds of tool calls per session has per-token costs that add up. A local agent on good hardware has roughly fixed compute costs regardless of session count.

Mano-P's architecture: local model inference, screen perception, and action execution all happen on-device.

The real constraints

Local GUI execution has real requirements. You need hardware capable of running capable models — ideally something with a good GPU or a high-bandwidth unified memory architecture (modern Apple Silicon machines, for instance, are well-suited for this). During agent execution, the screen is occupied. If your workflow involves a human using the same machine simultaneously, you'll need to think about scheduling.

And there's a tooling maturity gap. Cloud sandbox providers have years of polished developer experience. Local GUI agent frameworks are newer, and the rough edges show. Documentation is spottier, error handling is less standardized, and debugging a "the agent clicked the wrong button" failure requires different muscle memory than debugging a web automation script.

Best fit

Local GUI agents belong in: enterprise desktop automation, privacy-sensitive workflows, cross-application tasks, long-running automations where per-inference cost matters, and any environment where data residency is non-negotiable.

Environment 3: Hybrid

The hybrid model tries to get the best of both. The most common configuration is a cloud-hosted reasoning layer (the "brain") combined with local execution capabilities (the "hands"). The model runs remotely; actions execute locally. Alternatively: a local model handles most reasoning, with cloud fallback for tasks requiring more capacity.

How it works

In the cloud-brain/local-hands pattern, tool calls route through a local daemon that has access to the desktop environment. The model sees a clean API; the local runtime translates high-level actions into actual screen interactions. In the local-brain/cloud-fallback pattern, a capable local model handles the majority of reasoning, escalating to a remote model when confidence is low or the task is out-distribution.

What it's good at

Flexibility, primarily. Teams that need to handle a wide range of task types — some web-native, some desktop-native — without maintaining two completely separate pipelines. Hybrid architectures also make it easier to right-size compute: fast local models for simple reasoning, large remote models for complex planning.

The real constraints

Complexity is the honest cost of hybrid. Two environments mean two failure domains, two latency contributions, two sets of credentials to manage. The seam between cloud reasoning and local action introduces a synchronization challenge — what happens when the cloud model issues an action that the local daemon can't execute because the target application isn't open? These edge cases are manageable, but they require deliberate design.

For teams just getting started, hybrid is often premature optimization. Pick one environment, get it working well, and evolve toward hybrid when a specific need drives it.

How to Choose: A Decision Framework

Rather than declaring a universal winner, here's a practical checklist:

Question	If Yes →	If No →
Does the task require local app access?	Local GUI	Cloud Sandbox
Is data leaving the machine a compliance concern?	Local GUI	Either
Do you need to scale to 100+ concurrent sessions?	Cloud Sandbox	Either
Is the task entirely web-based?	Cloud Sandbox	Local GUI
Do you have capable local hardware?	Local GUI viable	Cloud Sandbox
Are you building a demo or prototype?	Cloud Sandbox	Consider Local
Cross-app workflow (multiple desktop apps)?	Local GUI	Either

A simpler heuristic: if the task touches local files, local apps, or sensitive data, start with local GUI. If it's web-only and needs to scale, start with cloud sandbox. Move to hybrid when the seam becomes visible and worth engineering.

A Note on Mano-P

We've been building in this space at MiningLamp Technology with Mano-P, an open-source local GUI agent (Apache 2.0). A few specifics that might be useful context for the discussion above:

On the benchmark side, Mano-P's 72B evaluation configuration ranks #1 in the proprietary model category on OSWorld with a 58.2% task completion rate. The open-source release is the 4B quantized version, optimized for real-world on-device deployment.

OSWorld benchmark results — Mano-P 72B evaluation configuration leads the proprietary category at 58.2%. The open-source 4B version is what developers actually deploy.

On the hardware side, Mano-P 1.0-4B running on Apple M5 Pro (64GB, Cider SDK) achieves ~80 tokens/s decode with W8A16 quantization; W8A8 activation quantization speeds up prefill by ~12.7% (source: README Performance Evaluation). The minimum requirement is an M4 chip with 32GB RAM — consumer-grade hardware that makes local agent execution realistic.

The project is on GitHub if you want to dig into the architecture or try it locally: https://github.com/Mininglamp-AI/Mano-P

Why On-Device AI Is Quietly Winning Over Cloud Inference — Three Reasons You Didn't See Coming

Mininglamp — Fri, 22 May 2026 09:46:11 +0000

I noticed something odd a few months ago. Several engineers I respect — people building serious AI pipelines, not hobbyists — quietly shifted from API-based inference back toward running models locally. Not because of some principled stance. Not because they read a blog post. Because they hit real problems and local inference solved them faster than any API change could.

Nobody announced this. There was no "local AI is back" wave on Twitter. It just... happened.

That got me thinking: if experienced engineers are making this choice in silence, the reasons probably aren't the ones being loudly debated. It's not "privacy is important" in the abstract. It's specific, concrete pain points that don't make good conference talks but absolutely dictate engineering decisions.

Here are the three that actually moved the needle.

Reason 1: The Regulatory Pressure Nobody Talks About Openly

Everyone vaguely knows that GDPR exists. Fewer people have internalized what it means when your AI system processes user data through a third-party cloud endpoint.

When you send a user's screen content, text input, or behavioral data to a cloud inference API, you've just created a data transfer to a third-party processor. Under GDPR Article 28, that processor needs a Data Processing Agreement. Under GDPR Chapter V, if that server is outside the EU, you need Standard Contractual Clauses or an adequacy decision. Under China's PIPL, cross-border data transfer requires a government-filed security assessment for anything above certain thresholds.

This is not hypothetical. GDPR enforcement has been escalating steadily — the Irish DPC alone fined Meta €1.2 billion in May 2023 for EU-US data transfer violations. CCPA enforcement in California continues to expand. China's Personal Information Protection Law (PIPL), in effect since November 2021, is tightening cross-border data transfer requirements with mandatory security assessments.

Here's the trap developers fall into: your AI vendor's privacy policy is not your compliance shield.

When your application sends data to an inference API and something goes wrong, regulators look at you — the data controller — not the API provider. The fact that the API provider has good security practices is relevant but not sufficient. You still need to demonstrate lawful basis, purpose limitation, data minimization, and cross-border transfer compliance for every single inference call that processes personal data.

For applications involving GUI automation, document processing, customer service interactions, or anything that touches user-generated content — that's basically every inference call.

Running inference on-device eliminates this exposure cleanly. The data never leaves the user's hardware. There's no cross-border transfer. The DPA requirement with an AI vendor disappears. The compliance surface collapses dramatically.

I've watched legal teams add 3-6 months to product timelines trying to untangle the regulatory implications of cloud inference for EU or China deployments. On-device inference sidesteps the entire conversation. For teams that ship to regulated markets, that timeline compression is worth a lot.

[IMAGE: A diagram showing data flow comparison — cloud inference with multiple regulatory checkpoints (GDPR, CCPA, PIPL) vs. on-device inference where data stays local]

Reason 2: Latency Isn't Just About Speed — It's About Determinism

The average latency numbers for cloud inference look reasonable. Sub-200ms for most major providers, often well under 100ms for smaller models. When someone benchmarks cloud inference, those are the numbers they publish.

The number that actually matters for production systems is P99. Or even P99.9.

Cloud inference latency is variable in ways that are difficult to predict and nearly impossible to bound. A 50ms average can have a 2000ms P99 due to cold starts, regional capacity fluctuations, network path changes, or provider-side throttling. This isn't a criticism of cloud providers — it's inherent to shared infrastructure at scale.

For many applications, this variability is fine. A chatbot that occasionally takes 2 seconds instead of 0.2 seconds is annoying but functional.

For GUI automation agents, variability kills reliability.

When an agent is navigating a UI — clicking buttons, reading screen state, deciding what to do next — it's executing a feedback loop. Each inference call determines the next action, which changes the screen state, which feeds back into the next inference call. The entire loop depends on predictable timing. If one inference step takes 20x longer than expected, the agent may be acting on stale screen state, may miss UI transitions, or may time out waiting for an action to complete.

This isn't a latency optimization problem. It's a determinism problem. The agent needs to be able to reason about timing as part of its control logic.

On-device inference gives you P99 you can actually plan around. On Apple Silicon with appropriate quantization, you get consistent throughput that's bounded by local hardware — not by whatever is happening on a shared inference cluster on the other side of the planet. You can profile it, characterize it, and build your agent's timing assumptions around real measurements.

For GUI automation specifically, the reliability improvement from this determinism is often more impactful than the raw latency numbers suggest. We've observed this pattern repeatedly: switching from cloud inference to on-device inference doesn't just make an agent faster — it makes it work in scenarios where it was previously failing intermittently and unpredictably.

[IMAGE: A latency distribution graph comparing cloud inference (wide spread, long tail) vs. on-device inference (tight distribution, predictable P99)]

Reason 3: The Cost Crossover Most People Missed

This one requires some arithmetic, but it's worth doing.

Cloud inference pricing has been dropping steadily. For context, GPT-4-class inference that cost $0.03/1K tokens in 2023 is now available at a fraction of that from multiple providers. For many use cases, cloud inference is cheap.

But "cheap per call" and "cheap at scale" are different calculations.

Three things happened in the last 18 months that changed the math for on-device inference:

First: W4A8 and W8A8 quantization techniques matured significantly. A model running W4A8 quantization on Apple Silicon achieves quality within a few percentage points of full-precision while running at dramatically higher throughput. This isn't theoretical — it's in production, measurable, and reproducible.

Second: Apple M4 silicon arrived with a substantially improved Neural Engine and memory bandwidth profile. A 4B quantized model on Apple Silicon now achieves throughput that would have required a much larger machine a year ago.

Third: The "zero marginal cost" nature of on-device inference becomes meaningful at enterprise scale.

Here's the calculation people miss: for applications where inference is happening continuously — monitoring, automation agents, real-time assistance — the cost per hour of cloud inference adds up in a way that the per-call pricing obscures.

If you're running an autonomous agent that makes 10 inference calls per minute during active use, and a user is active for 6 hours per day, that's 3,600 inference calls per day per user. At even $0.001 per call (which is optimistic for capable models), that's $3.60/user/day — $1,314/user/year. For a B2B product with 500 users, you're looking at $657,000/year in pure inference costs, scaling linearly with usage.

The break-even against on-device depends on hardware costs and usage patterns, but for enterprise deployments with heavy inference usage, the crossover typically arrives in 12-18 months. After that point, every inference call is essentially free.

This doesn't mean on-device always wins on cost — for bursty, low-volume use cases, cloud inference is clearly more economical. But for continuous-use automation and monitoring applications, the TCO calculation has quietly flipped, and many teams haven't updated their mental model to account for it.

What This Means for Builders

None of this means cloud inference is going away. Cloud inference will remain the right choice for many workloads — burst capacity, the largest models, multi-modal tasks that require more than local hardware can provide, and anywhere the regulatory and latency considerations I've described don't apply.

But the decision is no longer "cloud by default, local if you're weird about privacy." The calculus is more nuanced now:

If you process personal data from users in the EU, California, or China, you need to do the compliance math honestly before assuming cloud inference is viable.
If you're building agent loops where timing matters, P99 latency from cloud inference may be silently causing reliability failures you're attributing to other causes.
If you have sustained, high-volume inference at enterprise scale, you may be past the cost crossover already and not realize it.

The engineers I mentioned at the start didn't arrive at local inference through ideology. They arrived through debugging. They found the compliance lawyers, the intermittent timeouts, the bills that didn't look right.

That's usually how actual engineering decisions get made.

A Project Worth Watching

One example of this shift playing out in practice: Mano-P, an open-source GUI-VLA agent from MiningLamp Technology that runs fully on-device (Apache 2.0, GitHub).

The performance numbers are interesting as a concrete data point for what on-device inference can actually deliver today: Mano-P 1.0-4B running on Apple M5 Pro (64GB, Cider SDK) achieves ~80 tokens/s decode with W8A16 quantization; enabling W8A8 activation quantization speeds up prefill by ~12.7%. The 72B evaluation configuration (not open-sourced — used for benchmarking only) reached 58.2% on the OSWorld benchmark (proprietary model category). The open-source 4B version is what developers actually deploy and run locally.

If you're building in the GUI automation or edge agent space and want to see what current hardware can actually do, it's worth a look:

brew tap Mininglamp-AI/tap && brew install mano-cua

[IMAGE: Screenshot of Mano-P running an on-device GUI task on a MacBook, showing the agent interface and live task execution]

The quiet shift I noticed among those engineers isn't a trend piece. It's just people solving real problems with the best available tools — and the best available tools for a growing set of problems now happen to run locally.

That's worth paying attention to.

The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work

Mininglamp — Thu, 21 May 2026 11:23:31 +0000

AI coding tools have gotten remarkably good at generating code. You describe what you want, and within minutes you have functions, components, even entire applications scaffolded out. But there's a question that rarely gets asked in the excitement: who tests it?

Writing code accounts for maybe 30% of shipping software. The remaining 70% — defining requirements, deploying, testing, finding bugs, fixing them, and verifying the fixes — is where most projects quietly stall. Every AI coding assistant today stops at some variation of "here's the code, good luck." The developer is still left to deploy it, test it manually, discover the bugs, explain the bugs back to the AI, wait for fixes, and re-test.

That workflow isn't autonomous development. It's autocomplete with extra steps.

The Testing Gap Nobody Talks About

Most engineering teams rely on a layered testing strategy: linting catches syntax errors, unit tests verify individual functions, and API tests confirm that endpoints return the right data. These layers are well-understood, well-automated, and widely adopted.

But here's the uncomfortable reality: all three can pass while the application is completely broken for end users.

A button's onClick handler might correctly call an API endpoint that returns valid JSON — and the unit test, API test, and linter will all report green. Meanwhile, the button itself is hidden behind a CSS overflow, or renders off-screen on mobile, or navigates to a blank page because the frontend routing is misconfigured. The backend works. The tests pass. The user sees nothing.

This is the E2E testing gap. It's the difference between "the code compiles" and "the software ships." And it's the hardest layer to automate, because it requires something most test frameworks don't have: the ability to actually look at the application and interact with it the way a human would.

Why Traditional E2E Testing Falls Short

Tools like Selenium and Playwright have been the go-to for browser-based E2E testing for years. They work by programmatically controlling a browser through DOM selectors — clicking elements by their CSS class, filling inputs by their HTML id, asserting text content by XPath.

The problem is fragility. DOM-based selectors break whenever the UI changes. A designer renames a class, a framework update restructures the component tree, a developer switches from a <div> to a <button> — and the entire test suite fails, not because the application is broken, but because the selectors are stale.

This creates a maintenance burden that scales linearly with application complexity. Large teams often dedicate entire QA engineers just to keep Selenium tests from becoming red noise. Smaller teams simply skip E2E testing altogether.

There's a more fundamental issue, too. DOM-based testing can only verify what's programmatically accessible. It can check that a text node contains "Success" but it can't tell you that the success message is rendered in white text on a white background. It can verify that an image element exists but not that the image actually loaded. It operates on structure, not on what the user actually sees.

VLA: Giving Agents Eyes

Vision-Language-Action (VLA) models change this equation. A VLA model takes a screenshot of the application, understands what it sees through visual reasoning, and generates concrete actions — click coordinates, text input, scroll directions — based on that understanding.

The key difference from DOM-based automation: VLA operates on pixels, not selectors. It doesn't need to know that the "Submit" button is a <button class="btn-primary">. It sees a button labeled "Submit" and clicks it, exactly as a human tester would. If the button moves to a different position on the page, the VLA model still finds it. If the framework changes from React to Vue, the visual interface stays the same and the tests still work.

This makes VLA-based testing inherently more robust than selector-based approaches. But it also enables something selector-based tools fundamentally cannot do: visual validation. A VLA model can verify that a chart actually renders with the correct data, that a color-coded status indicator is the right color, that a modal overlay is visible and properly positioned. It tests what the user experiences, not what the DOM describes.

Mano-P's benchmark performance across multiple evaluation dimensions, including GUI grounding and visual understanding tasks.

The Full Pipeline: Build → Test → Fix → Repeat

Individual testing capability is useful. But the real value emerges when visual testing becomes part of a fully autonomous development pipeline — where an AI agent doesn't just write code, but also deploys it, tests it with real browser interactions, and fixes whatever breaks.

Here's what that pipeline looks like in practice:

Step 1: Requirements first. Before a single line of code is written, a structured PRD (Product Requirements Document) is generated with acceptance criteria. Every test case traces back to a specific requirement. Every bug fix maps to an AC number. This eliminates the most common failure mode of AI-generated code: "it works, but it doesn't match the intent."

Step 2: Build and deploy. Code is generated, dependencies are installed, and the application is deployed to a local development server — all without human intervention.

Step 3: Layered testing. The pipeline runs lint checks first (fast, catches syntax issues), then API tests (verifies backend logic), then E2E tests using a VLA model to open the app in a browser, navigate through user flows, and verify that the interface matches the acceptance criteria.

Step 4: Fix loop. When tests fail, the agent reads the failure report, inspects the relevant code, makes targeted fixes, re-deploys, and re-tests. This loop can run for multiple iterations — catching not just the initial bug but also regressions introduced by the fix itself.

The entire cycle — from "build me a budget tracker" to "here's your running app with a test report" — runs without human involvement.

Adversary Review: Why the Builder Shouldn't Test Itself

There's a well-known principle in software engineering: the person who writes the code shouldn't be the only one testing it. Developers have blind spots about their own work. They unconsciously avoid testing the edge cases they didn't think of during implementation.

The same principle applies to AI agents. When a single agent builds and tests, it tends to generate tests that validate its own assumptions rather than challenging them. The tests pass not because the code is correct, but because the tests are aligned with the same reasoning that produced the code.

A more robust approach uses separation of concerns:

A Build Agent writes the code, handles deployment, and fixes bugs
An Adversary Agent independently reviews the PRD and source code to find problems the builder missed
A Main Agent triages each finding through code inspection, API tests, or E2E verification

The adversary operates without knowledge of the builder's implementation decisions. It reads the requirements, reads the code, and asks: "What could go wrong that the builder didn't consider?" This catches usability gaps, data integrity issues, inconsistent behavior across features, and missing edge cases that automated tests alone would miss.

Self-Evolution: Getting Smarter Over Projects

Most AI coding tools treat every project as a fresh start. The context window resets, lessons from previous sessions are lost, and the same mistakes get repeated.

A self-evolving pipeline maintains persistent knowledge across projects through two mechanisms:

Build rules — When a bug takes multiple fix iterations to resolve, the lesson is extracted and applied to all future projects. "Always add loading states to async data fetches" isn't a generic best practice; it's a specific rule learned from a specific failure.
Preference accumulation — Layout patterns, color schemes, component choices, and architectural preferences converge over time. The tenth project reflects accumulated understanding of what the developer actually wants, not just what they described in a single prompt.

This is a meaningful shift from stateless code generation to something that develops institutional memory.

Mano-AFK: An Open-Source Implementation

At Mininglamp, we built Mano-AFK as an open-source implementation of this full pipeline. It takes a natural language description, generates a PRD with acceptance criteria, builds the application, deploys it locally, runs layered testing (lint → API → E2E → adversary review), and iterates through fix loops — up to 10 rounds — until all tests pass or a detailed report is generated.

The E2E testing layer is powered by Mano-P, Mininglamp's on-device VLA model. Mano-P runs entirely on local hardware — the 4B quantized model achieves 76 tokens/s decode speed on an M4 Pro with just 4.3 GB peak memory. No screenshots leave the device, no API keys are required, and there's zero per-test cost. It uses pure vision to understand GUI interfaces without relying on DOM parsing or accessibility trees, which means it works across web apps, desktop software, and any application with a visual interface.

Mano-P's GUI grounding benchmark results — the ability to accurately locate and interact with UI elements is foundational to reliable visual testing.

For teams that prefer cloud-based testing, Mano-AFK also supports Claude CUA as an alternative backend. The local mode with Mano-P is recommended for development workflows where privacy, latency, and cost matter.

What This Means for Development Workflows

The combination of VLA-based visual testing, adversary review, and self-evolving build rules points toward a future where "AI-assisted development" means more than code generation. It means AI agents that can participate in the full software lifecycle — including the 70% that happens after the code is written.

We're still early. VLA models aren't perfect at visual understanding, adversary review can produce false positives, and self-evolution needs many project cycles to show meaningful improvement. But the direction is clear: autonomous development pipelines that close the loop between writing code and shipping software.

Both Mano-AFK and Mano-P are open source and available on GitHub. If this approach to autonomous testing resonates with your workflow, we'd welcome you to try them out and share your experience. ⭐

Three Open-Source Projects That Turn Your Mac Into a Private AI Workstation

Mininglamp — Tue, 19 May 2026 12:05:56 +0000

The idea of running AI agents entirely on your laptop used to be a joke. A fun thought experiment you'd entertain over coffee before switching back to your cloud API dashboard and watching the bills pile up.

In 2026, it's a real workflow.

Not a demo. Not a "technically possible if you squint" proof of concept. An actual, production-grade stack where a vision-language model sees your screen, operates your apps, accelerates inference on Apple Silicon, and builds entire applications from a product spec — all without a single byte leaving your machine.

At Mininglamp Technology, we've been building toward this with three open-source projects. Each solves a distinct piece of the on-device AI puzzle. Together, they form something we think is genuinely new: a complete private AI workstation stack that runs on a Mac.

Let's walk through them.

1. Mano-P: The Agent That Sees Your Screen

Repo: github.com/Mininglamp-AI/Mano-P

Most "AI agents" are glorified API wrappers. They read text, call tools, and hope the tool's interface hasn't changed since the prompt was written. Mano-P takes a fundamentally different approach: it's a GUI-VLA (Vision-Language-Action) model that perceives your screen the way a human does — by looking at it.

Mano-P comes in two sizes:

72B (cloud/server): The full model, currently ranked #1 on OSWorld with a score of 58.2% — a significant lead over the second-place opencua-72b at 45.0%.
4B (local): A distilled model designed to run entirely on-device. On an M5 Pro, it decodes at roughly ~80 tokens/second with a peak memory footprint of just 4.3GB. It runs on M4 chips with 32GB RAM.

What makes this interesting isn't just the benchmark numbers — it's the interaction model. Mano-P doesn't need custom integrations or tool definitions. It sees buttons, text fields, menus, and dialogs the same way you do. Tell it "open Safari and find the latest Hacker News post about Rust," and it navigates the GUI visually, clicking and typing as needed.

The 72B model also includes WebRetriever, a web navigation component that scores 41.7 on NavEval — ahead of Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). Web browsing as a first-class agent capability, not an afterthought.

Why This Matters

The traditional approach to computer-use agents is brittle. You build tool adapters, maintain API schemas, and pray that the next macOS update doesn't break your Accessibility API hooks. A vision-first agent sidesteps all of that. If a human can use the app, Mano-P can use the app.

2. Cider: Inference Acceleration for Apple Silicon

Repo: github.com/Mininglamp-AI/cider

Running a 4B model at 80 tok/s on a Mac doesn't happen by accident. It requires an inference engine that actually understands Apple Silicon's hardware characteristics. That's what Cider is.

Cider is an inference acceleration SDK built specifically for Apple's M-series chips. Its key contribution is activation quantization — specifically W8A8 and W4A8 schemes — which fills a gap that MLX currently doesn't cover. MLX supports weight-only quantization (W4A16, W8A16), but activations stay in full precision. Cider quantizes both weights and activations, which unlocks substantially better throughput.

The Numbers

On an M5 Pro, Cider delivers 1.4–2.2x faster inference compared to MLX W4A16, depending on the quantization granularity you choose:

Quantization	Granularity	Speedup vs MLX W4A16
W8A8 / W4A8	Per-channel	1.8x (fastest)
W8A8 / W4A8	Per-group (gs=128)	1.5x
W8A8 / W4A8	Per-group (gs=64)	1.3x

There's a tradeoff between speed and accuracy, as you'd expect. On the CUA Benchmark (M5, 16GB), W8A16 quantization maintains 58.0% accuracy while W8A8 comes in at 54.0%. Depending on your use case, that 4-point delta may or may not matter — for many agentic workflows, the speed gain is worth it.

Why Not Just Use MLX?

This isn't about replacing MLX. MLX is excellent at what it does. But weight-only quantization hits a wall when you need both low memory and high throughput for real-time agent interactions. Activation quantization is the next lever, and right now, Cider is the open-source option that pulls it on Apple Silicon.

Think of it this way: MLX gives you the foundation. Cider fills the gap in activation quantization that lets you push throughput further on the same hardware.

3. Mano-AFK: The Autonomous App Builder

Repo: github.com/Mininglamp-AI/mano-afk

This is where things get wild.

Mano-AFK takes a PRD (Product Requirements Document) and turns it into a working application. Not a skeleton. Not boilerplate. A deployed, tested application — with zero human intervention in the loop.

Here's the pipeline:

Read the PRD — Parse requirements, extract features, identify tech stack
Write the code — Generate the full application
Deploy it — Spin up a local or containerized environment
Test it visually — Using Mano-P's vision model to actually look at the running app
Find bugs — Compare what's on screen to what the PRD specified
Fix them — Modify code, redeploy, retest

The critical piece here is step 4. Most code-generation tools "test" by running unit tests they also generated — which is roughly as useful as grading your own homework. Mano-AFK uses Mano-P's vision capabilities to perform visual testing: it loads the app, looks at the screen, and verifies that the UI actually matches the spec. A button that's supposed to be blue but renders as white? Caught. A form that submits but shows no confirmation? Caught.

This closes the loop in a way that pure code generation can't. The vision model acts as an independent quality gate that evaluates the artifact, not just the source.

What It's Good For

Mano-AFK shines for internal tools, prototypes, and MVPs where the cost of human QA exceeds the cost of iteration cycles. It's not going to replace your engineering team on a complex distributed system. But for "I need a dashboard that shows these metrics with these filters by Thursday"? It's remarkably capable.

The Stack: Model → Accelerator → Builder

Here's where the three projects become more than the sum of their parts.

┌─────────────────────────────────────────────┐
│              Your Mac (M4+ / 32GB)          │
│                                             │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐ │
│  │  Mano-P  │  │  Cider   │  │ Mano-AFK  │ │
│  │  (Agent) │──│  (Accel) │──│ (Builder) │ │
│  │  4B VLA  │  │  W8A8    │  │ PRD→App   │ │
│  └──────────┘  └──────────┘  └───────────┘ │
│                                             │
│  Data stays here. Always.                   │
└─────────────────────────────────────────────┘

Mano-P provides the vision-language-action intelligence — the ability to see, understand, and act on screen content. Cider accelerates inference so that intelligence runs at interactive speeds on consumer hardware. Mano-AFK orchestrates multi-step autonomous workflows, using Mano-P as both its brain and its eyes.

The result is a stack where:

Your AI agent perceives and operates your entire desktop
Inference is fast enough for real-time interaction (not "wait 30 seconds per action" fast — actually fast)
Autonomous workflows can build, deploy, and quality-test applications without human involvement
Nothing leaves your machine. No API calls to external servers. No telemetry. No data exfiltration vectors. Your code, your screen content, your documents — they stay on your Mac.

That last point matters more than people think. Enterprise teams working with proprietary code, healthcare organizations handling patient data, legal teams reviewing confidential documents — these groups can't use cloud AI agents, period. An on-device stack isn't a nice-to-have for them. It's the only option.

Hardware Requirements

Let's be clear about what you need: Apple M4 with 32GB of RAM is the minimum for running the 4B model at usable speeds. An M5 Pro will give you the best experience. This isn't a "runs on any Mac" situation — you need the unified memory bandwidth and Neural Engine capabilities of recent Apple Silicon.

The Bigger Picture

We're not claiming this replaces cloud AI. The 72B model exists for a reason — some workloads need that scale, and running it requires serious hardware. What we are saying is that the gap between "cloud-only" and "runs on your laptop" has narrowed dramatically, and for a growing category of workflows, the on-device option is not just viable but preferable.

The three forces driving this:

Model distillation has gotten remarkably good. The 4B Mano-P retains enough capability from its 72B parent to handle real-world GUI tasks.
Apple Silicon's unified memory architecture is uniquely suited to LLM inference. High memory bandwidth + large unified pool = exactly what transformer decoding needs.
Activation quantization (via Cider) closes the remaining throughput gap. Weight-only quantization was the easy win; activation quantization is the hard one that makes real-time interaction possible.

The open-source angle matters here too. These aren't black-box binaries. You can inspect the model weights, audit the inference engine, verify that nothing phones home. For privacy-sensitive deployments, "trust us" isn't good enough. "Read the code" is.

Get Started

All three projects are released under Apache 2.0 — use them commercially, fork them, contribute back, or just kick the tires.

Mano-P (GUI Vision-Language-Action Agent): github.com/Mininglamp-AI/Mano-P
Cider (Apple Silicon Inference Acceleration): github.com/Mininglamp-AI/cider
Mano-AFK (Autonomous App Builder): github.com/Mininglamp-AI/mano-afk

If you build something with them, we'd love to hear about it. File an issue, open a PR, or just star the repos if you think this direction is worth pursuing.

The future of AI workstations isn't in the cloud. It's on your desk.

Mininglamp Technology builds AI infrastructure for enterprises. Our open-source projects focus on on-device AI agents, inference optimization, and autonomous software development. Learn more at github.com/Mininglamp-AI.

Agent vs Skill vs MCP vs Tool: The 4-Layer Stack Every AI Developer Should Know

Mininglamp — Thu, 14 May 2026 11:10:04 +0000

The Terminology Problem

The AI agent ecosystem has a vocabulary collision. "Tool" means one thing in LangChain, another in AutoGPT, and something else entirely in Claude's function-calling docs. "Skill" and "agent" are similarly overloaded—an "agent" might be a simple prompt wrapper or a fully autonomous system that books flights and deploys code. "MCP" arrived in late 2024 and added yet another term to the mix.

This matters architecturally. When layers are conflated, testing becomes harder, reuse drops, and swapping a model means rewriting half the system. A function that orchestrates 15 steps gets called a "tool." A prompt that strings together API calls gets called an "agent." The result is codebases where nothing is composable.

A 4-layer mental model resolves most of the confusion—similar to how the OSI model gave networking a shared vocabulary, or how MVC clarified web application structure. It's not a rigid specification, but a framework for making architectural discussions more productive.

The 4-Layer Stack

From bottom to top:

Layer 1: Tools — The Atoms

A tool is a single, stateless function that performs one atomic operation. It clicks a button, reads a file, calls an API, or captures a screenshot. Tools have no memory, no planning capability, and no awareness of why they're being called.

Key properties:

Deterministic (or close to it)
Testable in isolation
Composable — designed to be called by higher layers
Environment-specific — a click() on macOS differs in implementation from click() on Android, even if the interface is identical

Examples:

screenshot() — captures the current screen
click(x, y) — clicks at coordinates
read_file(path) — returns file contents
http_get(url) — fetches a URL

Tools are the smallest composable unit. They accept input, perform one action, and return a result. No side quests. The web analogy: individual HTTP endpoints. A GET /users/:id doesn't know about business logic—it fetches a row from a database and returns it.

Layer 2: MCP (Model Context Protocol) — The Connectors

MCP is a standardized transport layer for tool discovery and invocation across process boundaries. Think of it as GraphQL or gRPC for AI systems—it defines how tools are discovered, described, and called, not what they do.

Before MCP, every agent framework had its own tool integration spec. Building a tool for LangChain meant rebuilding it for AutoGPT. Building it for CrewAI meant doing it again. MCP standardizes three things:

Discovery: "What tools are available on this server?"
Schema: "What parameters does this tool accept? What does it return?"
Transport: stdio, HTTP, or WebSocket—the calling code picks the transport

MCP is about interoperability, not intelligence. An MCP server exposes tools; it never decides when to use them. The calling agent makes all decisions. An MCP server is a waiter that presents the menu and takes orders—it doesn't choose the meal.

When MCP adds value: Tools living in different processes or machines. Multiple agents or frameworks sharing the same tool set. Tool authors who want to write once and have it work across LangChain, Claude, OpenAI Assistants, and others.

When MCP adds overhead without benefit: Everything runs in-process and only one agent consumes the tools. In that case, direct function calls are simpler.

Layer 3: Skills — The Playbooks

A skill is a reusable, multi-step procedure that combines tools to accomplish a meaningful task. The web analogy: a service-layer module. A PlaceOrderUseCase orchestrates inventory checks, payment processing, and notifications—it's not a single endpoint but a choreography of endpoints.

"Fill out a web form" is a skill: it involves locating fields, typing values, handling dropdowns, scrolling, and clicking submit. Each step invokes tools, but the sequence, branching logic, and error recovery are the skill's contribution.

Examples:

"Navigate to Settings > Privacy > Clear Cache" (UI navigation)
"Search for a flight, compare prices, select the cheapest" (multi-step research)
"Read an Excel file, extract key metrics, generate a summary" (data analysis)
"Log into a service, check account status, export a report" (multi-app workflow)

Skills are portable when the underlying tool layer provides the required primitives. A "fill web form" skill works on any OS as long as click, type, and screenshot tools are available underneath.

The skill is the natural unit of reuse. A 3-line function and a 300-line multi-step workflow serve fundamentally different purposes; separating them clarifies what's testable in isolation (tools) versus what requires integration testing (skills). Skills can also be shared across agents—one agent might use a "file analysis" skill in a data pipeline context, another in a customer support context.

Layer 4: Agent — The Decision-Maker

An agent is the autonomous reasoning entity that decides what to do, when, and why. It observes the environment (via tools), reasons about the next action (via its language model), selects the appropriate skill, monitors execution, and adapts when things fail.

An agent owns:

Goal decomposition — breaking "book me a flight to Tokyo" into subtasks
Skill selection — choosing which playbook fits the current subtask
Error recovery — detecting failures and trying alternatives
Memory — tracking what's been done across a session
Termination judgment — knowing when the goal is achieved

Agents are model-powered. Replace the model, and the agent's capability ceiling changes. But in well-layered architecture, skills and tools remain valid regardless of which model drives the agent. This is the key insight: the agent is the most volatile layer (models improve quarterly), while tools and skills are the most stable (click is still click).

How the Layers Compose

Agent (decides what to do)
  ↓ selects
Skill (knows how to do it)
  ↓ invokes via
MCP (discovers and routes)
  ↓ calls
Tool (executes one atomic action)

This separation enables:

Swappable models — upgrade the agent's LLM without touching skills or tools
Portable skills — move a skill from cloud to edge by swapping the tool layer
Testable tools — unit-test each tool independently, integration-test each skill
Interoperable infrastructure — MCP means tools work with any compliant agent

A Real-World Example: Mano-P

Mano-P is Mininglamp Technology's open-source on-device GUI agent for macOS. It illustrates how the Agent and Skill layers work together in a local-first, privacy-preserving architecture.

It is pure vision-driven—understanding screens via screenshots, with no dependency on DOM trees, accessibility APIs, or HTML scraping. A local 4B-parameter model runs the entire inference loop on-device.

At the Tool layer: Screen capture, mouse click, keyboard input, scroll—all native macOS operations. No cloud calls for any action primitive.

At the Skill layer: Multi-step workflows for desktop tasks—form filling, app navigation, data extraction—compose the native tools into reliable sequences. These are packaged as mano-skill, a format callable by external orchestrators like Claude Code or OpenClaw agents.

At the Agent layer: The vision-language model observes screenshots and decides the next action autonomously. On Apple M4 + 32GB RAM, it runs at 76 tok/s using the Cider SDK (MLX inference acceleration with W8A8 activation quantization). Data never leaves the device—no screenshots uploaded to cloud APIs, no keystrokes logged remotely.

On the OSWorld benchmark, Mano-P ranked #1 in the proprietary model category with 58.2% accuracy—demonstrating that smaller local models with well-separated architecture can compete with cloud-dependent systems on real desktop tasks.

Installation:

brew tap Mininglamp-AI/tap && brew install mano-cua

Apache 2.0 licensed. Hardware requirement: Apple M4 chip + 32GB RAM.

When to Use What

Not every project needs all four layers:

Tools alone — deterministic automation with fixed sequences (cron jobs, CI pipelines, simple scripts).

Tools + MCP — tools live in different processes or machines; multiple agents share the same tool set.

Tools + MCP + Skills — multi-step workflows with conditional logic and error recovery; reusable procedures across different agents.

Full stack (Agent + Skill + MCP + Tool) — goals are ambiguous or user-specified at runtime; the environment is dynamic; autonomous operation over extended sessions is needed.

Building from the bottom up tends to work well. Get tools right first. Add MCP when interop is needed. Compose skills when workflows emerge. Add an agent when autonomous reasoning becomes necessary.

Common Architecture Smells

Patterns worth recognizing early:

Monolithic prompts — tools, skills, and orchestration logic all in one system message. Hard to test or debug individual pieces. Hard to reuse across projects.
"Tools" that maintain state — a function doing 15 things with internal state is a skill in disguise. Recognizing this improves testability and makes the codebase legible.
MCP everywhere — wrapping every in-process function call in MCP transport adds complexity without interoperability gains. MCP shines at boundaries, not within a single process.
Platform logic in skills — skills containing OS-specific code instead of delegating to tools lose portability. The fix: push platform specifics down into the tool layer where they belong.
Agent without skills — putting all multi-step logic directly in the agent's prompt creates a brittle system that breaks when the model changes or the prompt grows too long.

Summary

The 4-layer model—Tool, MCP, Skill, Agent—provides a vocabulary for answering recurring design questions:

Where does this logic belong?
What's reusable vs. environment-specific?
What can be tested in isolation?
What changes when the model is swapped?
What survives a model upgrade without modification?

These are the same separation-of-concerns questions that web development answered with MVC, service layers, and API gateways. The AI agent stack is working through equivalent patterns now. The projects that age well will be the ones with clean boundaries between layers—where upgrading the LLM doesn't require rewriting the skill library, and swapping from macOS to Linux only means changing the tool implementations.

Mano-P is open-source at github.com/Mininglamp-AI/Mano-P. If you find this useful, a ⭐ on GitHub helps the project reach more developers.

Why One Giant Model Ruling Everything Is a Bad Idea

Mininglamp — Wed, 13 May 2026 09:56:36 +0000

The Narrative Everyone Accepted Without Questioning

There's a story the AI industry has been telling itself for the past few years, and it goes something like this: bigger is better, and the biggest wins. More parameters. More data. More compute. The leaderboard rewards scale, venture capital rewards scale, and so the entire field marches in one direction — upward.

But spend enough time in the trenches — dealing with real deployment constraints, real failure modes, and real questions about who controls what — and this narrative starts to look, at best, incomplete.

What if scaling up is only half the story? What if the other half — scaling out — is not just a fallback for teams who can't afford the big model, but a fundamentally different architecture that solves problems the monolithic approach structurally cannot?

The Internet Is Changing at the Infrastructure Level

Here's something that doesn't get discussed enough: the internet itself is undergoing a quiet paradigm shift.

The old internet was designed to connect human attention. Search engines, social feeds, recommendation algorithms — they all competed for the same scarce resource: the roughly 16 waking hours each person has per day. The entire ad-tech economy was built on this bottleneck.

The emerging internet connects agent compute. Software agents don't sleep. They don't get bored. They don't have a finite attention span that advertisers fight over. When AI agents become the primary consumers and producers of internet traffic — not just humans browsing pages — the architecture of the network itself needs to change.

This isn't a distant future. It's already happening. API calls between services are growing faster than human page views. Autonomous agents are booking meetings, writing code, filing reports, and negotiating with other agents. The internet is transitioning from a human attention marketplace to an agent cooperation network.

And this transition raises a profound question: should that cooperation network be controlled by a single model, or distributed across many?

Why Scaling Up Alone Is Structurally Risky

To be clear: large models are not inherently bad. They're remarkable achievements. Frontier systems demonstrate capabilities that seemed impossible five years ago. The research behind them is genuinely impressive.

But as an architectural strategy for the entire field, the "one model to rule them all" approach has structural risks that don't go away by throwing more compute at them:

Extreme centralization. Training frontier models costs hundreds of millions of dollars. Only a handful of organizations on Earth can play this game. That means the most powerful AI capabilities are concentrated in very few hands. Whatever your politics, this level of concentration should give you pause.

Black-box decision making. When a single 2-trillion-parameter model makes a decision, good luck auditing why. Interpretability research is making progress, but the field is nowhere near being able to trace a complex reasoning chain through a monolithic transformer with confidence. For high-stakes domains — medicine, law, finance — "trust me, the big model said so" isn't going to cut it.

Diminishing returns on investment. The scaling laws that powered the last generation of breakthroughs are showing signs of flattening in certain domains. Training costs are growing faster than capability gains. At some point, the next 10x in compute doesn't buy 10x in usefulness — it buys marginally better benchmark scores that don't translate to real-world value.

Single points of failure. When an entire AI strategy depends on one provider's API staying up, staying affordable, and staying aligned with the user's interests... that's one policy change away from a very bad week.

None of these are reasons to abandon large models. They're reasons to ask: is there a complementary approach?

Scaling Out: A Different Architectural Bet

An alternative gaining traction in the industry: instead of making one model infinitely large, connect many specialized models over the internet and let them cooperate on tasks.

Consider how the internet itself succeeded. It didn't win by building one giant supercomputer that everyone connects to. It won by creating a protocol that lets millions of different machines — each with their own capabilities, owners, and purposes — collaborate. The genius was in the connection, not the concentration.

Scaling Out applies the same principle to AI. Different agents, potentially running different models optimized for different tasks, coordinate over network protocols to accomplish complex goals. A planning agent delegates to a code-writing agent, which delegates to a testing agent, which reports back. Each agent is independently deployable, replaceable, and auditable.

The advantages mirror those of distributed systems in general:

Resilience. No single agent failure takes down the whole system.
Specialization. Each agent can be optimized for its specific task rather than being a jack-of-all-trades.
Auditability. The communication between agents is inspectable. The reasoning chain is explicit in the messages, not buried in hidden layers.
Accessibility. No billion-dollar GPU cluster required to participate. A well-tuned 7B model running on modest hardware can be a valuable node in an agent network.

MOA vs. MoE: The Difference That Matters

Anyone familiar with Mixture of Experts (MoE) might be thinking: "This is already solved. MoE architectures route different inputs to different expert sub-networks within a single model."

That's true, but there's a crucial distinction.

In MoE, the routing happens inside the model. It's an internal optimization — a way to make a single model more efficient. The experts share weights, share a training process, and share an operator. From the outside, it's still one black box. There's no way to inspect which expert handled a query, no way to audit the expert's reasoning independently, and no way to replace one expert without retraining the whole system.

Mixture of Agents (MOA) is architecturally different. Each agent is a separate system — potentially running a different model, operated by a different team, connected over the internet. The "routing" is explicit: an orchestrator delegates tasks to agents based on their declared capabilities, and the communication happens over observable channels.

This means:

White-box cooperation. Every message between agents is inspectable. It can be logged, audited, replayed. There's no hidden routing decision buried in a softmax layer.
Independent governance. Each agent can have its own safety constraints, access controls, and compliance requirements. A medical agent can enforce HIPAA. A financial agent can enforce SOX. These constraints don't need to be negotiated inside a single model's RLHF training.
Traceable accountability. When something goes wrong, it's possible to point to exactly which agent made which decision based on which inputs. Try doing that with a trillion-parameter monolith.
Evolvability. Swap out one agent for a better version without touching the rest of the system. Upgrade incrementally. No need for a six-month retraining cycle.

MoE is an optimization technique for building better monoliths. MOA is an architectural pattern for building systems of cooperation. They solve different problems at different levels of the stack.

The Bigger Picture: Democratized AI Research

There's one more angle that doesn't get enough attention: what happens when the barrier to contributing to AI systems is lowered?

Right now, a domain expert — a biologist, a materials scientist, a climate researcher — who wants to leverage AI for their field has limited options: (a) fine-tune someone else's foundation model if the budget allows, or (b) hope that the general-purpose model happens to know enough about the niche.

In a Scaling Out world, there's option (c): build a specialized agent for a specific domain and plug it into the network. That agent doesn't need to be a frontier model. It needs to be good at its specific thing — identifying protein structures, simulating material properties, parsing climate data — and able to communicate its results to other agents that handle the parts it can't.

This is how scientific collaboration works among humans. No single scientist knows everything. Progress happens when specialists communicate effectively. There's no reason AI-assisted research should be different.

Imagine an internet where thousands of domain-specific AI agents — each built by experts in their respective fields — cooperate on complex research problems. A genomics agent identifies candidate genes. A chemistry agent predicts binding affinities. A literature agent surfaces relevant prior work. An experiment-design agent proposes validation studies. Each one is modest in isolation. Together, they're formidable.

This isn't just a technical architecture. It's a statement about who gets to participate in the AI revolution. If the only path forward is "build a bigger model," then only the richest organizations get a seat at the table. If the path forward is "build a specialized agent and connect it," then every domain expert in the world is a potential contributor.

Where Does This Leave Us?

Scaling Out does not replace Scaling Up. Large foundation models will continue to be valuable — as general-purpose reasoning engines, as pre-training bases for fine-tuning, as components within larger agent systems. The question isn't "which one wins." It's "what's the right mix, and who decides?"

The more likely future looks less like one omniscient oracle and more like an internet of cooperating specialists. Not because distributed systems are trendy, but because the problems that actually need solving — scientific discovery, complex engineering, personalized medicine, climate adaptation — are too varied, too specialized, and too important to trust to any single system, no matter how large.

The monolithic model is a cathedral. The agent network is a bazaar. History suggests which one adapts faster.

What's your take? Is this overcomplicating things — will a sufficiently large model really handle everything? Or does the distributed approach resonate with how you think about building reliable systems? If you've experimented with multi-agent architectures, what worked and what didn't?

The HN Post That Got 1,700 Upvotes: Local AI Needs to Be the Norm.Why "Local AI" Just Became the Default for Developers

Mininglamp — Tue, 12 May 2026 09:45:13 +0000

The HN Post That Got 1,700 Upvotes: Local AI Needs to Be the Norm

In early 2025, a post titled "Local AI needs to be the norm" hit the front page of Hacker News and stayed there. It collected 1,763 upvotes and over 800 comments. No product launch, no benchmark claim, no drama — just a statement that resonated with a large number of developers simultaneously.

The comments weren't the usual HN contrarianism either. Most of them were agreements, expansions, and stories of people already running models locally for daily work. Reading through that thread felt less like a debate and more like a census.

Something shifted. This article is an attempt to understand what, why, and where it leads.

The Cloud Assumption Is Cracking

For the past two years, the default mental model for AI has been: send your data to a powerful server, get results back. OpenAI, Anthropic, Google — they all operate on this assumption. You pay per token, your data traverses the internet, and the model lives somewhere you'll never see.

This worked fine when models were enormous and consumer hardware was weak. GPT-4 at launch required infrastructure that no individual could replicate. The cloud wasn't just convenient — it was the only option.

But hardware caught up faster than most expected. Apple's M-series chips turned laptops into credible inference machines. The M4 Pro can run a 4-billion parameter quantized model at 476 tokens per second for prefill and 76 tokens per second for decode, using 4.3GB of peak memory. That's not a toy — that's production-grade speed for most interactive use cases.

Meanwhile, the model side moved just as fast. Quantization techniques (GGUF, AWQ, GPTQ) made it possible to shrink models dramatically without proportional quality loss. A well-quantized 7B model today outperforms the full-precision 13B models of 18 months ago on most practical tasks.

The gap between "what you can run locally" and "what you need from the cloud" is narrowing every quarter.

Why Developers Care About Local

The HN thread was revealing because it surfaced the actual motivations, not the marketing ones. Here's what kept coming up:

Privacy isn't paranoia. Developers working on proprietary codebases, medical data, legal documents, or internal communications can't send that to third-party APIs without violating policies, NDAs, or regulations. This isn't about tinfoil hats — it's about professional responsibility. A developer at a bank can't pipe customer data to OpenAI's API, no matter how good the model is.

Latency is UX. A local model responds in milliseconds. No network round-trip, no queue, no cold start. For code completion, text editing, or any interactive workflow, the difference between 50ms and 500ms is the difference between a tool that feels invisible and one that interrupts your flow.

Cost compounds. API pricing looks cheap per call, but it adds up. A team of 10 developers making moderate use of GPT-4 for coding assistance can easily spend $2,000-5,000/month. A local model on existing hardware costs nothing after setup. For startups and indie developers, this matters enormously.

Offline availability. Planes, trains, bad WiFi, rural areas, classified environments — there are many contexts where internet access is unreliable or prohibited. Local models work everywhere your hardware goes.

Control and reproducibility. When you run a model locally, you know exactly which version, which weights, which quantization you're using. Cloud APIs change without notice. Models get updated, deprecated, or have their behavior modified. Local inference gives you a frozen, reproducible environment.

None of these are theoretical. They're daily realities for working developers.

What's notable is that these motivations cut across experience levels and company sizes. A solo indie developer cares about cost. A staff engineer at a Fortune 500 cares about compliance. A researcher cares about reproducibility. A journalist in a hostile regime cares about privacy as a survival matter. Local AI serves all of them with the same architecture.

The Ecosystem That Made It Possible

Local AI didn't become practical because of one breakthrough. It happened because an entire ecosystem matured simultaneously:

llama.cpp made inference accessible. Georgi Gerganov's C++ implementation proved you could run large language models on consumer hardware without Python, without CUDA, without a GPU cluster. It was a proof of concept that became infrastructure.

Ollama made it approachable. Download a model, run it with one command, expose an API. Ollama did for local LLMs what Docker did for containers — it removed the setup friction that kept most developers from trying.

Apple's MLX framework brought first-party support. Apple clearly sees on-device AI as a strategic differentiator. MLX is optimized for Apple Silicon in ways that third-party frameworks can't match, and Apple Intelligence's architecture is explicitly local-first with cloud as fallback.

Hugging Face's ecosystem provided the models. The proliferation of open-weight models (Llama, Mistral, Phi, Qwen, Gemma) meant developers had real choices. Competition drove quality up and size down.

Quantization research made the math work. Papers like GPTQ, AWQ, and QuIP# showed that aggressive quantization (4-bit, even 2-bit) could preserve model quality for most practical tasks. This was the key that unlocked consumer hardware — you don't need 70B parameters if 7B quantized gets you 90% of the way there.

The result: in 2024-2025, running a competent local model went from "impressive hack" to "standard developer workflow." The HN post didn't create this trend — it named something that was already happening.

It's worth noting how fast this moved. In early 2023, running any useful model locally required a beefy NVIDIA GPU and considerable technical skill. By late 2024, a MacBook Air could run a 7B model with no configuration beyond installing Ollama. That's a two-year journey from "research project" to "commodity tool."

Apple's Bet Tells You the Direction

Apple's approach to AI is worth studying because Apple doesn't make speculative bets. They ship what they believe will be the default in 3-5 years.

Apple Intelligence is architecturally local-first. The on-device model handles most requests. Only when a task exceeds local capability does it route to Private Cloud Compute — and even then, Apple designed PCC so that data is processed in a stateless enclave that even Apple employees can't access.

This isn't just a privacy story. It's an architecture story. Apple is betting that the future of AI interaction is:

Most inference happens on-device
The cloud is a capability fallback, not the default
Users shouldn't have to think about where processing happens

The MLX framework, the Neural Engine improvements in each chip generation, the Core ML optimizations — these are multi-year, multi-billion-dollar investments. Apple doesn't spend that money on trends they think will reverse.

When the largest company in the world builds its AI strategy around local inference, that's a signal worth paying attention to.

From Local Models to Local Agents

Here's where the conversation gets interesting, and where the HN thread didn't fully go.

Running a model locally is valuable, but it's still fundamentally a chat interface. You ask, it answers. The model is a brain in a jar — it can think, but it can't act.

The next logical step is obvious: if you can run inference locally, why not run agents locally?

An agent doesn't just generate text — it perceives your screen, understands context, and takes actions. It clicks buttons, fills forms, navigates applications, moves files. The gap between "AI that tells you how to do something" and "AI that does it for you" is the gap between a language model and an agent.

Cloud-based agents have a fundamental problem: they need to see your screen. That means streaming your desktop to a remote server continuously. Every document you open, every email you read, every private message — all sent to someone else's infrastructure. Even if you trust the provider today, you're creating a surveillance surface that didn't need to exist.

Local agents solve this elegantly. The model runs on your machine. It perceives your screen locally. It acts locally. Your data never leaves your device because there's nowhere else for it to go.

This is where the "local AI as norm" argument becomes strongest. For chat and text generation, privacy concerns are manageable — you can be careful about what you paste into a prompt. But for agents that continuously observe your workflow? Local-only isn't a preference; it's a requirement for anyone who takes security seriously.

The Technical Puzzle of On-Device Agents

Building a local agent is harder than running a local chatbot. The challenges are specific:

Vision understanding. The agent needs to interpret screenshots — understand UI elements, read text, recognize buttons, comprehend layouts. This requires vision-language models that are both capable and small enough to run locally.

Action grounding. Seeing a button is different from knowing how to click it. The agent needs to map visual understanding to precise coordinates and actions. This is a harder problem than it sounds — UI elements are dynamic, vary across applications, and don't come with semantic labels accessible to the model.

Speed. An agent that takes 10 seconds to decide what to click is useless for interactive workflows. Inference needs to be fast enough that the agent feels responsive, not laggy.

Reliability. Unlike a chatbot where a bad response is just annoying, an agent that clicks the wrong button can cause real damage. Accuracy matters more when the model has agency.

These constraints push toward a specific architecture: small, fast, vision-capable models that are optimized for action prediction rather than general conversation. You don't need GPT-4-level reasoning for most UI interactions — you need precise, fast, visual understanding.

Why Vision-Only Matters

There are two approaches to building GUI agents:

Accessibility-tree based: Parse the application's DOM or accessibility API to get structured data about UI elements. Feed that structure to the model.
Vision-only: Give the model a screenshot. Let it figure out what's on screen the same way a human would — by looking.

The accessibility approach seems easier, but it's brittle. Not all applications expose clean accessibility trees. Electron apps, games, custom UI frameworks, remote desktops — they all have incomplete or missing accessibility data. You're building on an abstraction that the underlying applications don't reliably provide.

Vision-only is harder to build but more robust in deployment. If a human can see it and interact with it, a vision-based agent can too. No dependency on application internals, no platform-specific APIs, no breaking when an app updates its UI framework.

This mirrors how humans actually interact with computers. We don't read the DOM — we look at the screen and click what looks right. A vision-only agent generalizes the same way.

The Convergence

Put the pieces together:

Local inference is fast enough for interactive use
Vision-language models are small enough to run on consumer hardware
Developers want their data to stay local
Agents are the natural evolution beyond chatbots
Vision-only approaches generalize across applications

The convergence point is clear: on-device AI agents that see your screen, understand your intent, and act locally — with zero data leaving your machine.

This isn't a prediction about 2030. The hardware exists today. The models exist today. The demand — as that HN post demonstrated — has been here for a while.

Where We're Putting Our Work

At Mininglamp Technology, we've been building toward this convergence with Mano-P — an open-source, on-device GUI agent that runs locally on Mac.

Mano-P takes the vision-only approach: it perceives your screen through screenshots and executes actions directly, with no data leaving your device. On the OSWorld benchmark, it achieves 58.2% accuracy — currently ranked #1. The 4B quantized model runs on an M4 Pro at 476 tokens/s prefill and 76 tokens/s decode, with 4.3GB peak memory usage. It's licensed under Apache 2.0.

We built it because we believe the argument in that HN post is correct: local AI should be the norm. And local agents are where that norm leads.

If this direction resonates with how you think about AI tooling, the repo is open. Contributions and stars are always appreciated.

Full-Stack On-Device GUI Agent — Mano-P Model + Cider + AFK, All Open Source

Mininglamp — Wed, 06 May 2026 11:06:58 +0000

Full-Stack On-Device GUI Agent — Mano-P Model + Cider + AFK, All Open Source

Introduction

GUI automation (Computer Use Agent) is becoming a key capability in the AI agent ecosystem. However, most existing solutions rely on cloud-based inference — every screenshot captured during task execution must be uploaded to a remote server for visual understanding. This creates significant data privacy concerns, especially in enterprise and security-sensitive environments.

Today, we are officially open-sourcing the Mano-P 1.0-4B local model, the Cider inference acceleration SDK, and Mano-AFK (an end-to-end automated app builder) — bringing a complete on-device GUI agent stack to Apple Silicon.

All screenshots and task data stay on your device. No cloud APIs required.

What is Mano-P

Mano-P is an open-source GUI-VLA (Vision-Language-Action) agent designed for edge devices. "Mano" means "hand" in Spanish, and "P" stands for Private — we believe individuals and organizations should be able to create their own private AI.

Built on the full Mano technical framework (Mano Technical Report), Mano-P uses a three-stage progressive training pipeline (SFT → Offline RL → Online RL) with a think-act-verify reasoning loop to achieve high-precision GUI understanding and operation.

Benchmark results (Mano-P 1.0-72B):

OSWorld (Specialized GUI Agent Models): 58.2% success rate, ranked #1
WebRetriever Protocol I: 41.7 NavEval score

Mano-P 1.0-4B Local Model

The Mano-P 1.0-4B model runs directly on Apple Silicon devices with no internet connection required.

Hardware Requirements:

Apple M4 chip or above (Mac mini / MacBook)
32GB+ unified memory
Alternatively: Mano-P compute stick via USB 4.0

Performance (Apple M5 Pro, 64GB RAM):

W8A16: Prefill 2.839s, Decode ~80 tokens/s
W8A8 (with Cider): Prefill 2.519s, Decode ~79.5 tokens/s
~12.7% prefill speedup with Cider W8A8

Privacy: In local mode, all inference runs on-device via MLX. No screenshots or task descriptions are transmitted over the network.

Download:

🤗 HuggingFace
🪄 ModelScope

Cider — INT8 Activation Quantization SDK for MLX

Cider is an open-source inference acceleration SDK for macOS, built on Apple MLX.

Why Cider Exists

MLX's built-in quantization is weight-only: QuantizedLinear dequantizes weights to FP16 and runs FP16 GEMM. MLX does not provide a true W8A8 inference path where both weights and activations are quantized to INT8 for computation.

Cider fills this gap with custom Metal kernels that implement fused quantize-matmul-dequant primitives, exposed as MLX custom primitives with full lazy evaluation support.

Supported Modes

W8A8: INT8 symmetric weights + INT8 per-token activation quantization → TensorOps matmul2d
W4A8: INT4 packed weights + INT8 per-token activation quantization → Unpack → TensorOps

Performance (Apple M5 Pro)

End-to-end VLM acceleration: Cider W8A8 achieves 1.4x–2.2x prefill speedup vs MLX native W4A16, while maintaining comparable decode speed.

Compatibility

Cider works with any MLX model, not just Mano-P. It also provides non-invasive compatibility patches for mlx_vlm (verified on v0.4.3), fixing several issues with Qwen3-VL multi-image inference.

Conditional Compilation

INT8 TensorOps C++ extensions build only on Apple M5+. On M4 devices, Cider installs as a pure Python package with is_available() returning False. Use CIDER_FORCE_BUILD=1 to override.

Source: github.com/Mininglamp-AI/cider

Mano-AFK — End-to-End App Builder

Mano-AFK is an automated application construction pipeline powered by Mano-P. From a single natural language description, it autonomously handles:

Requirements clarification → Architecture design → Code generation → Deployment → E2E GUI testing → Bug fixing → Delivering a working application

The E2E testing phase uses Mano-P as the local visual model backend, driving real browsers for GUI automation testing. When tests fail, the system automatically locates defects, fixes code, and re-verifies — forming a complete build-test-fix loop entirely on-device.

CUA Benchmark

Test environment: Mano-P 4B on MacBook Pro M5 (16GB unified memory), 100 tasks across 5 auto-built web applications.

W8A16: 58.0% accuracy, avg 6.1 steps, ~1,253 tok/s prefill
W8A8 (Cider): 54.0% accuracy, avg 6.93 steps, ~1,453 tok/s prefill

Note: On 16GB devices, W8A8 requires storing both original and INT8 weights, nearly doubling weight memory. Memory pressure may offset prefill gains. We recommend 4GB+ free memory beyond model size for full W8A8 benefit.

Source: github.com/Mininglamp-AI/mano-afk

Getting Started

# Install CLI
brew tap Mininglamp-AI/tap
brew install mano-cua

# Set up local mode
mano-cua check
mano-cua install-sdk
mano-cua install-model

# Run locally
mano-cua run "Open Safari and search Python" --local

Open Source Roadmap

Mano-P follows a phased open-source strategy:

Phase 1 (Released): Mano-CUA Skills — for Agent enthusiasts using OpenClaw, Claude Code, etc.
Phase 2 (This Release): Local model + Cider SDK — for developers with high security requirements
Phase 3 (Coming Soon): Training methods, pruning, and quantization techniques — for developers with custom model training needs

Dual Launch! Mininglamp Technology Open-Sources Cider On-Device Inference Acceleration Framework and Mano-P On-Device Model

Mininglamp — Wed, 06 May 2026 10:05:15 +0000

Mininglamp Technology has officially open-sourced its self-developed Cider inference acceleration SDK (Software Development Kit) and the on-device GUI agent model Mano-P. Following the earlier open-sourcing of the Mano-CUA skill, this release of the Mano-P model vividly demonstrates the immense potential of on-device models in real-world business workflows. Meanwhile, the Cider framework addresses computation operators and hardware invocation mechanisms at the foundational level, empowering on-device large models to run smoothly on macOS local compute with greater efficiency and lower memory footprint.

GitHub-Mano-P
Cider SDK

Mano-P: Validating the Deployment Potential of On-Device Agents

Mano-P is Mininglamp Technology's self-developed on-device GUI-VLA agent model. It understands and operates graphical interfaces through pure vision, without relying on traditional API integrations or being limited to browser scenarios. Instead, it can directly interact with desktop software, web-based systems, and more complex graphical workflows.

Complex graphical interface interactions inherently demand robust multimodal visual understanding capabilities from the model. The model must continuously process screenshots at high frequency, precisely locate minuscule UI elements, and execute subsequent actions based on visual feedback. Under traditional cloud-based large model architectures, the token cost incurred by such high-frequency visual interactions is extraordinarily high.

In contrast, the 4B-parameter Mano-P on-device model not only achieves accuracy comparable to cloud-based large models on CUA tasks but also completely eliminates the otherwise prohibitive cloud API call costs. In fully offline local mode, all application screenshots, interaction processes, and task data are strictly confined to the user's local device, making privacy protection a matter of "physical isolation" by design.

Cider: An On-Device Inference Acceleration Framework for Apple Silicon

The core metrics that truly determine the usability of on-device models are local inference speed, hardware utilization, memory footprint, integration cost, and long-term stability. If inference speed is too slow, the AI interaction experience suffers significantly; if memory usage is too high, the model becomes difficult to deploy widely on mainstream devices; if integration costs remain prohibitive, enterprises and developers struggle to rapidly incorporate on-device capabilities into their business pipelines.

Cider was born precisely to address these challenges. As a self-developed and open-sourced SDK from Mininglamp Technology, Cider is built on the Apple MLX ecosystem, purpose-built for macOS and Apple Silicon. It precisely fills the gaps in the native MLX framework regarding activation quantization and specific tensor computation capabilities, serving as a highly efficient on-device inference framework designed for the broad open-source model ecosystem.

Currently, the native Apple MLX architecture already supports weight quantization modes such as W4A16 and W8A16. Building upon this foundation, Cider further provides W8A8 and W4A8 inference paths. Through deep integration of online activation quantization, INT8 TensorOps computation, quantized matrix multiplication, and dequantization pipelines, Cider fully unleashes the underlying computational potential of Apple Silicon, enabling open-source models not merely to "run on Mac" but to operate smoothly with higher efficiency and lower memory consumption.

In benchmark testing, Cider's operator speed in W8A8 mode achieves approximately 1.4x to 1.9x improvement over native MLX mode, with specific performance varying by Batch Size. In W4A8 mode, Cider further reduces weight memory footprint by 50% compared to W8A8 mode while matching the computational speed of native MLX's full-precision W4A16 approach in high-concurrency scenarios.

For the Qwen3-VL series of mainstream vision-language models, Cider demonstrates highly significant acceleration in end-to-end prefill scenarios. Under varying prompt lengths, compared to native MLX W8A16 mode, Cider's W8A8 PC mode delivers approximately 17% to 22% prefill speed improvement for the Qwen3-VL-4B model; for the Qwen3-VL-2B model, this speedup leaps to approximately 57% to 61%.

Additionally, Cider has performed deep optimization and non-invasive fixes for technical challenges such as RoPE position handling in multi-image inference, substantially improving inference stability for complex visual tasks. Since visual interaction tasks typically require processing longer contexts, more complex screenshot information, and denser inference requests, this magnitude of performance improvement is particularly critical for on-device VLMs and GUI agents.

Furthermore, Cider actively explores heterogeneous collaboration between the Apple Neural Engine and GPU on the M4 chip. For a long time, on-device large model inference has primarily relied on GPUs, while the potential of the Neural Engine in Apple chips has remained largely untapped. By introducing an ANE+GPU heterogeneous tensor parallelism mechanism, Cider enables both types of compute units to work in concert, achieving an additional approximately 3% to 16% acceleration in certain test scenarios.

Minimal Integration, Enabling Local Acceleration for More Open-Source Models

Cider seamlessly supports any LLM model, covering Qwen, Llama, Mistral, as well as VLM models such as Qwen3-VL, with a built-in OpenAI-compatible VLM inference service. Enterprises and developers need not rewrite model architectures—with only minimal code adaptation, integration can be achieved effortlessly.

During the prefill phase, Cider supports enabling W8A8 INT8 TensorOps to dramatically boost computation speed; during the decode phase, the framework intelligently falls back to the original weight path, effectively avoiding unnecessary additional overhead.

Whether enterprises aim to deploy highly customized local large language models within their internal networks, or developers are committed to building vertical-domain private AI application ecosystems, Cider provides a robust, reliable, and highly extensible underlying inference infrastructure.

Toward Private AI: Building Local Intelligence Infrastructure

In the past, most large model applications relied on cloud computing. Cloud-based models offer stronger scalability, but in enterprise scenarios, data transmission costs, privacy security, API call expenses, and network dependency have become issues that cannot be ignored. Particularly in scenarios involving internal systems, core business processes, sensitive interface screenshots, and task data, on-device AI brings the model closer to where data originates, reducing transmission risks while improving response speed and autonomous controllability.

By enhancing local inference efficiency, Cider brings "data never leaves the device" closer to a truly viable engineering solution. When local models achieve better inference performance, enterprises gain the confidence to explore private AI deployment across more scenarios—such as local intelligent assistants, enterprise internal Agents, offline task execution, on-device multimodal analysis, and automated workflows with high confidentiality requirements.

Going forward, Mininglamp Technology will also open-source the complete Mano-Action training methodology and related tools, helping enterprises and developers train customized GUI agent models based on their own data, or develop new training techniques on top of Mano-Action, fully empowering enterprise customization and algorithmic innovation.

Mininglamp Technology is extending its deep expertise in intelligent agents, multimodal models, and enterprise-grade AI applications further down to the foundations of underlying inference frameworks and on-device model development. We are committed to providing developers and enterprise users with a complete, out-of-the-box private AI infrastructure, enabling AI to truly achieve private deployment, low-cost operation, and trustworthy real-world implementation.