Forem: Matt Macosko

I Just Watched One Hacker Catch Up to a Trillion-Dollar Data Center

Matt Macosko — Sun, 10 May 2026 05:16:13 +0000

▶ Watch the companion video — three engines, one prompt, one MacBook

Yesterday Salvatore Sanfilippo — the guy who wrote Redis 15 years ago and ran it solo for over a decade — published a few thousand lines of C code and quietly changed what counts as possible on a personal laptop.

The project is called ds4. It's a hand-written native inference engine, Metal kernels and all, built for one specific model: DeepSeek V4 Flash. A 284-billion-parameter Mixture-of-Experts model with a 1-million-token context window. Until last week, that lived inside the kind of GPU clusters that bill more per hour than my truck.

I'm running it on the laptop I'm typing this on.

What I actually did

Today I gave the same prompt to three different AI engines. The same prompt, on the same MacBook:

"Build an animated northern lights scene in a single HTML file — mountains, pine trees, twinkling stars, and a flowing aurora."

Three engines:

DeepSeek V4 Flash running locally through ds4
Cloud Claude through my Max plan, hitting Anthropic's data center
Gemma 4 31B running locally through MLX

Then I watched what came out.

Engine	Time	Output	Hosted on
DeepSeek V4 Flash (ds4 local)	103 s	3,259 tokens	my laptop
Cloud Claude (Max plan)	192 s	~3,500 tokens	Anthropic's GPUs
Gemma 4 31B (MLX local)	131 s	1,992 tokens	my laptop

Local-first DeepSeek beat cloud Claude on raw wall-clock time. Let that one sit for a second.

What the auroras actually looked like

Three completely different interpretations of the same prompt — which is the part that surprised me.

DeepSeek's aurora was a flowing ribbon of teal and lavender drifting over a dense pine forest. The trees got a subtle green underglow from the sky. Most cohesive of the three.
Cloud Claude's went the most cinematic — magenta and teal aurora bands draped across jagged mountains, with a soft luminescent dusting along the peaks. It looks like a desktop wallpaper.
Gemma's went minimalist — a single sweeping streak of green and violet against a starry sky, with a clean line-drawing mountain silhouette. Stylized, almost graphic.

Each one runs forever. None of them needed a network call.

What this means for the cloud AI bill

I've written before about how my Claude Max plan returns roughly 40x value compared to API rates. That math is still true. Anthropic is still subsidizing the difference.

But this changes the calculation. If the cloud got expensive tomorrow — or if my data needed to stay on-device — I now have a credible escape hatch. Not "good enough for testing." Actually frontier-class. Actually long context. Actually integrated with Claude Code.

The hybrid is what I'm settling into:

claude for the hardest reasoning tasks I do all day
claude-ds4 for routine work, long-context document review, and anything I'd rather keep on this machine

The meter at the top of my screen no longer goes up for most of what I do.

Three things `ds4` does that nobody else has bundled

Asymmetric 2-bit quantization. Only the routed Mixture-of-Experts experts get compressed to 2-bit — about 90% of the weight footprint. Every quality-critical path (shared experts, attention projections, output head) stays at full precision. The result is an 81 GB file that calls tools cleanly.
KV cache moved to disk. Modern Apple SSDs are fast enough that "the KV cache must live in RAM" is just no longer true in 2026. ds4 writes session state to disk and reuses it across restarts. The first 25k-token Claude Code system prompt gets prefilled exactly once, ever, and replays from cache after that.
Pure Metal native code. No PyTorch, no TensorFlow, no llama.cpp wrapper layer. The hot path is Metal compute kernels written for one specific 284B model. M3 Max gets ~27 tokens/sec; M5 Max hits ~32.

What I'm doing next

The voice agent stack I've been building — wake-word listening, voiceprint filtering, transcripts piped straight into Claude Code — has been waiting for exactly this. Cloud Claude was good enough to test with, but routing every utterance through someone else's API meant every long agent run had a meter on it.

This week the meter goes away. Same agent, same tools, same cloned voice — running on a single laptop, off-cloud, with a context window five times longer than what I had on the Max plan.

I'll write that one up next.

For now, if you've got 128 GB of RAM and a free hour: clone antirez/ds4, run make, run ./download_model.sh q2, and see for yourself.

May 9, 2026. The day a single C file caught up to the data centers.

Companion repo: github.com/antirez/ds4 · Hugging Face weights: antirez/deepseek-v4-gguf · Model card: deepseek-ai/DeepSeek-V4-Flash · Companion video: [YouTube link TBD]

Companion video on YouTube: https://youtu.be/7l8-s8xkpms. For local-AI consulting for compliance-sensitive firms (law, medical, finance), see AirGap AI.

HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

Matt Macosko — Wed, 29 Apr 2026 18:11:49 +0000

The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.

I just ran HumanEval on it. Wi-Fi off the entire run.

81.7% pass@1 on the full 164-problem benchmark
Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX)
14 minutes wall-clock, $0/month after the model download

YouTube walkthrough (three real problems, code streaming live, tests going green):
https://www.youtube.com/watch?v=muq7VdgxqRk

Why this number matters

The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each).

Methodology

Setting	Value
Benchmark	HumanEval — 164 Python tasks (full)
Metric	pass@1 (first attempt only)
Temperature	0.0 — deterministic
Sampling	single sample per problem, no best-of-N
Execution	Python subprocess, 10s timeout
Hardware	M5 Max MacBook Pro · 128 GB unified memory
Model	Qwen3-Coder-30B-A3B-Instruct-MLX-8bit
Network	Wi-Fi OFF the entire run
Wall clock	14 minutes

For context — Qwen3-Coder 480B's official agentic benchmarks

The Qwen team's published numbers for the 480B flagship sibling (the bigger sibling of the 30B running on this MacBook):

Benchmark	Qwen3-Coder 480B	Claude Sonnet 4	GPT-4.1
SWE-bench Verified (500-turn)	69.6	70.4	—
Terminal-Bench	37.5	35.5	25.3
BFCL-v3	68.7	73.3	62.9
Aider-Polyglot	61.8	56.4	52.4

Source: Qwen team's official blog.

Why the offline part matters

If a tool needs the internet, three things are true:

Someone else can read what you sent.
Someone else can charge you for it.
Someone else can take it away.

If the same tool runs locally, none of those are true. That's a different category of software — and for law firms, medical practices, and accountants handling client material, it's the only legal one.

Reproduce it yourself

Open-source launchers: github.com/nicedreamzapp/claude-code-local
HumanEval dataset: github.com/openai/human-eval
Hardware: any M-series MacBook with ≥32 GB RAM (128 GB Max preferred for full 8-bit weights)
Total monthly cost: $0 after the model download

For law firms, medical practices, and accountants who want help getting this stack running on their own hardware — that's what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Pulling 10x My Subscription Value Out of Claude — While Quietly Building the Backup Plan

Matt Macosko — Sun, 26 Apr 2026 18:06:41 +0000

The math, visualized: every blue bar is one day's API-equivalent token consumption. The green dashed line at the bottom is what I actually paid (pro-rated). April 14 alone — $454 in one day — was more than four months of subscription.

Every Sunday night I watch the meter tick toward 100% again. That's been the rhythm for months — five days of heavy work, one day of cleanup, one day of waiting for the weekly reset. I'm on Claude Max — usually the 5x tier at $100 a month — and I burn through nearly every token they give me.

Out of curiosity I ran the math last week. I'm not sure I should have.

Over the last three weeks, the tokens I've put through Claude Code added up to about $2,976 worth of API usage at Anthropic's published rates. Pro-rated, my subscription cost over that same window was about $70. One Tuesday in mid-April, I spent $454 of token-equivalent value in a single day — more than four months of subscription, in one sitting.

The math doesn't quite work. Which is exactly what I want to talk about.

The honest comparison

If I were paying per-token through the API to do the work I'm doing — coding, writing, agent loops, the whole pipeline — I'd be looking at roughly $1,000 a week at the rate I'm running. The 5x Max subscription pro-rates to about $23 a week.

That's a ~40x multiplier. The headline says 10x because I wanted to be conservative. The real number is about four times that.

Why is Anthropic eating the difference?

This is the part I keep turning over. Anthropic isn't a charity — they have GPU bills, payroll, and investors who eventually want margins. So why hand power users $1,000 of compute for $23 a week?

From where I sit, three reasons line up:

Habit formation beats price optimization. A heavy API user shops on price every quarter. A subscriber builds workflows, custom agents, deep tool integrations — and wakes up two years later unable to leave. Subscriptions are sticky. API calls are mercenary.

Predictable revenue is worth a discount. Investors love subscription math. $100/mo × 100,000 power users is a number you can model. Lumpy API usage isn't.

Land-grab while the war is hot. OpenAI, Google, and Anthropic are all fighting for the same developer seats right now. Subsidizing the heavy users means those users go tell other developers "you have to try this." I've done it myself a dozen times. That word-of-mouth is cheaper than ads.

So the subsidy is rational. It's also temporary.

The tells it's already tightening

A year ago there were no weekly limits on Claude Pro or Max. The tier was straightforward: pay, use it. Now there's a weekly cap that resets on a rolling schedule, and Opus has its own quota separated from Sonnet.

That's the rug being pulled tighter, one notch every few months. Anyone who's watched this pattern in cloud services or streaming knows where it ends — not with the deal disappearing, but with the deal getting expensive enough that you have to think before you use it. I'd rather adjust now than be surprised.

What I built quietly on the side

I love Claude Max. I'd be foolish not to. But I wouldn't bet my business on a subsidized line item with a tightening cap, so I built insurance.

My primary machine is an M5 Max MacBook Pro with 128 GB of RAM. On it I run Gemma 4 31B as my daily local driver — fast, capable enough for editing and routine code work, and it doesn't phone home. The bigger picture is a three-node mesh: the M5 as workstation, a Mac Mini as workhorse, and a small VPS as the gateway. Voice cloning, content drafting, simple agent loops — those run locally now.

They're not as smart as Opus. They don't need to be. The split: cloud Claude does the hard, frontier-grade thinking. Local models do the steady-state work — the kind that, if Anthropic doubled prices tomorrow, I wouldn't want to be paying premium rates for anyway. I'd rather pay once for hardware I own than be one pricing email away from a forced workflow change.

The honest takeaway

Right now is one of the strangest pricing windows in software history. You can rent the most capable AI on the planet for less than a decent meal out and use it like a senior engineer who never sleeps. Use it hard, build the workflows, learn everything the subsidy will let you.

The people still working at this pace in eighteen months won't be the ones who picked a side. They'll be the ones running both — getting maximum leverage from cloud Claude while making sure their work doesn't depend on it lasting.

Forty times my money's worth. I'm grateful. I'm also paying attention.

Matt Macosko runs Divine Tribe and Nice Dreamz LLC out of Northern California. He writes about local AI, ambient computing, and what it actually takes to run a small business with frontier-grade tools.

Originally published at Nice Dreamz Wholesale. For local-AI consulting for compliance-sensitive firms (law, medical, finance), see AirGap AI.

Free AI on a MacBook vs $100-a-Month Claude Code — Hexagon Shootout

Matt Macosko — Thu, 23 Apr 2026 04:32:47 +0000

▶ Watch the race on YouTube: https://www.youtube.com/watch?v=2KeTDDodE0A

April 22, 2026. Anthropic's Claude Code Max plan jumped to $100 a month. I ran a live three-way AI race on the exact same prompt — Gemma 31B local, Llama 70B local, and Claude cloud — on a single MacBook, to see how close a free local stack gets to the paid cloud. Two of three contestants finished with zero cloud calls.

If you just want the video, it's here: FREE AI on a MacBook vs Claude Cloud — Hexagon Shootout.

If you want the repo, it's here: github.com/nicedreamzapp/claude-code-local.

Keep reading for the setup, the numbers, and the three things that surprised me.

The setup — same prompt, three contestants

Hardware: M5 Max MacBook Pro, 128 GB unified memory, Apple Silicon.

Gemma 31B — local, Apple MLX, 4-bit quantized (Google's code-specialized model)
Llama 70B — local, Apple MLX, 8-bit quantized (Meta's generalist)
Claude cloud — the real Anthropic API, using Claude Code unchanged

Same prompt to every contestant:

Build a single HTML file with inline JavaScript that shows a ball bouncing inside a rotating hexagon. Include gravity and realistic bounce physics.

Simple enough that the answer should be a few kilobytes of code. Interesting enough that it exposes how well a model handles real math — collision detection against rotating geometry, energy conservation, boundary clamping. When models trip, they trip here.

Every run was recorded end-to-end with a live stats panel: elapsed seconds, output bytes, tokens-per-second. No cherry-picking, no post-hoc edits to the physics code, no "here's what it SHOULD have said." What you see is what came out.

The results

Contestant	Time to ship working HTML	Tokens/sec	Cloud calls
Claude cloud	22 s	N/A (data center)	yes (via API)
Gemma 31B local	56 s	~30	zero
Llama 70B local	2:17	~11	zero

Claude cloud finished first — it's a data center somewhere. Gemma 31B finished clean in under a minute with working physics. Llama 70B took the longest and produced the most verbose output, but also landed a working demo in the end.

The headline isn't that one is "best." It's that two of the three ran with Wi-Fi that could have been off the entire time. That's the number that matters for anyone dealing with NDAs, PHI, client files, or just a flight without connectivity.

Three things that surprised me

1. Bigger isn't better when "bigger" is a generalist

I went in expecting Llama 70B to beat Gemma 31B on code quality. It's more than twice the parameter count. Gemma beat Llama cleaner and faster on this specific task.

Why: Gemma 4 is a Google model fine-tuned heavily for coding and math. Llama 3.3 70B is Meta's generalist — it's excellent at conversation, reasoning, creative writing, but it wasn't tuned to punch above its weight on HTML canvas physics.

If you're buying a local model for coding, you're better off with a 30B that's code-tuned than a 70B that's general. Don't count parameters, read the model card.

2. Claude Code's harness chokes local models

Claude Code (the CLI agent) sends a 29,000-token system prompt with 60 tool schemas in every request. That's tuned for the cloud — where a frontier model can happily chew through 30K tokens of context before even starting. On a local 70B, that prefill takes a minute or two before generation begins.

When I bypassed Claude Code and hit the MLX server directly with just the prompt, Llama 70B's wall-clock time dropped from 7+ minutes to under 2.

The tradeoff: without Claude Code's harness you lose the Write/Edit/Bash tool-use loop, so you can't use Claude Code as an agent, only as a generator. For research, benchmarking, or any single-shot prompt, direct is way faster. For actual coding sessions, the overhead is real but it's what buys you the agent loop.

3. Circle-approximation collision is the cheat code

All three models eventually produced a bouncing ball. The ones that worked used circle-approximation collision — treat the hexagon as a circle of its apothem radius for collision purposes, reflect velocity when the ball exceeds that radius, clamp the ball back to exactly inside. Five lines of math, reliable, hexagon can rotate as wildly as you want.

The ones that failed tried to do proper polygon-edge collision — compute the six edges of the rotating hexagon each frame, compute point-to-line distance for each, reflect off the appropriate edge. That's the "right" way, and it fails constantly because floating-point error lets the ball slip through edges during the rotation, and then the model doesn't know how to clamp it back.

I wouldn't have predicted this. The "simple" approximation is strictly better for the demo because it can't leak. For anything more complex than one ball, the polygon approach is necessary — but for a benchmark, approximation wins.

Who should care

Developers on laptops with 64+ GB of Apple Silicon unified memory: you can run this today, your hardware already supports it.
Anyone dealing with confidential work — lawyers, accountants, doctors, contractors handling NDAs or PHI: the cost isn't $0 vs $100, it's "does your data leave the machine" vs "does it not."
Frequent flyers and people who travel to places with bad internet: a 70B model on a laptop keeps working when the plane's Wi-Fi is $18 and throttled.
Anyone curious whether Apple's bet on unified memory was actually about AI: it was.

How to run it yourself

The repo is MIT licensed and open source. Full setup is in the README:

→ github.com/nicedreamzapp/claude-code-local

The project pairs a native-MLX Anthropic-API-compatible server with Claude Code. Point Claude Code at localhost:4000 and the official CLI talks to your local model as if it were the cloud API. Swap models with one env var. Ship code without the subscription.

Around 2,000 stars in the first month. If it's useful, a star helps.

TL;DR

Claude cloud: $100/mo, 22 seconds to a working hexagon.
Gemma 31B on my MacBook: $0, 56 seconds to a working hexagon.
Llama 70B on my MacBook: $0, 2:17 to a working hexagon.
Two of three ran with zero cloud calls.
Free AI on Apple Silicon is real, now, for a huge slice of what people use cloud APIs for.

The receipts, in video form: youtube.com/watch?v=2KeTDDodE0A

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

The Era of Hunched-Over-A-Screen Computing Is Ending — Heres Whats Replacing It

Matt Macosko — Wed, 22 Apr 2026 09:54:00 +0000

Look around any coffee shop, any office, any living room. Everyone is bent forward at the same angle, staring into a glowing rectangle, with one hand on a small slab and the other on a bigger slab. The whole posture is wrong. We know it’s wrong — that’s why ergonomic chairs are a $2 billion industry — but we keep doing it because the computers we built require it.

I think we’re at the end of that era. Not because somebody invented a magic new screen. Because computing itself is finally able to leave the rectangle.

I call what’s coming ambient computing. The phrase isn’t new, but most uses of it are about smart speakers or watches — small devices that ask you to look at them too. That’s not what I mean. I mean a way of working with computers that doesn’t require you to face a screen at all. Where the machine listens, talks back, sees what you see, and the keyboard becomes optional rather than mandatory.

The pieces of it are already shipping. They just haven’t been assembled.

What ambient computing actually looks like

Sitting in a hot tub a few weeks ago, I sent a text from my phone: “find me the best rated electric guitar at this price range, screenshot it, and text it back to me.” Two minutes later my phone buzzed with the screenshot. The Mac on my desk had searched, found, captured, and sent back, while I stayed in the tub.

That’s an ambient-computing moment. No screen. No keyboard. The computer was a participant in what I was doing rather than the thing I had to stop and walk over to.

The same week, I had a hands-free coding session — speaking into the room, hearing a cloned version of my own voice narrate what the AI was doing, course-correcting verbally. No mouse. No keyboard. No screen-watching. The work got done. The AI told me when it was done. I went on with my day.

Both of these worked on hardware I already owned. A MacBook Pro on the desk. An iPhone in my pocket. The pieces that turned them into an ambient system are open source and free.

Three pieces that already exist

1. Local AI. A current MacBook Pro can run a 70-billion-parameter language model entirely on the GPU side of its unified memory. That model is good enough to write code, draft documents, summarize content, and run multi-step tool-using workflows. It does this with no internet and no subscription. The model lives on the machine; the inference happens on the machine.

The fact that this is true on consumer hardware is a recent development. It wasn’t true two years ago. And it’s the foundation of everything else.

2. On-device speech. Apple’s SFSpeechRecognizer — the same engine that powers the dictation feature in macOS — runs entirely on your Mac. You can wrap it in a continuous-listening daemon and have it transcribe everything you say into a target window, no cloud round-trip. Pair it with a local TTS engine running a cloned version of your own voice (the cloning runs on the Mac too) and you have full speech in, full speech out, neither end touching a network.

3. Phone-as-remote. iMessage on a Mac can be driven by AppleScript. That means anything your Mac can do — search, code, browse, compose — can be triggered by a text from your phone. The phone becomes a remote for the more powerful machine, and the more powerful machine handles the heavy lift while you’re somewhere else.

Stack those three together and you have a workflow where:

– You can ask the Mac to do something while you’re nowhere near it.

– You can hold a spoken conversation with it without typing or looking.

– It can produce real work — code, documents, research, video — and deliver it back to wherever you are.

That’s ambient computing. Not Siri. Not Alexa. The full deal.

Why this matters now

Two arguments. The boring one: the bodily cost of screen-and-keyboard computing is real and accumulating. Carpal tunnel, posture damage, eye strain, the chair-and-desk economy that exists to patch over the damage we’re doing to ourselves. We’ve been pretending this is fine for thirty years. It’s not.

The interesting one: ambient computing is what makes a different relationship with the machine possible. When the computer is something you face for eight hours a day, it occupies a specific role in your life — interrupt-driven, attention-stealing, mostly adversarial to whatever else you wanted to be doing. When the computer is something you talk to in passing, hand things off to, and check back on later, it occupies a completely different role. It becomes a colleague rather than a chore.

We’re not going to fully arrive there in 2026. But the building blocks are shipping in 2026, and the people who set them up now will look up in two years and realize their working life feels different.

The catch

For now, all of this requires being on a Mac. Specifically, an Apple Silicon Mac with enough unified memory to run a real model — practically, that means an M2 Max / M3 Max / M4 Pro / M5 Max with 32 GB minimum, 64 GB+ for the bigger models. That’s an expensive piece of hardware.

But it’s a piece of hardware most professionals already own, or could justify. And it’s the only piece you need. There’s no recurring AI subscription. No hosting bill. No phone-home telemetry that compromises the whole privacy story.

The gear that gets you into ambient computing is gear you might already have. You just haven’t connected the pieces yet.

What I’m building toward

The longer arc, for me, is robotics. Specifically a Lego-like modular system where you clip together small parts to build whatever the moment needs — a robot arm, a camera mount, a wheeled base — all driven by the same local AI vision system that runs everywhere else in the stack. That’s a few years out.

In the meantime, I’m shipping the parts of the system that work today. The local-AI server is open source (claude-code-local). The voice loop is open source (NarrateClaude). The browser agent is open source (browser-agent). The phone bridge is open source. The iPhone object-detection app that’s part of the same vision is on the App Store (RealTime AI Cam) for free.

If any of this resonates — if you’ve been quietly tired of being chained to a screen, or you can feel the future being built but haven’t been able to put your finger on what it is — clone something, run it, and tell me what you find. Most of the work ahead is figuring out which pieces fit where, and that’s not work I can do alone.

The era of hunched-over-a-screen is ending. The next era is being built in the open, on commodity hardware, by people who decided to stop waiting for someone else to do it.

— Matt

Part of the Nice Dreamz lineup. If you want this set up inside a firm or practice — private, on-device, no cloud — that’s AirGap AI.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

What Its Actually Like to Code By Voice — With the AI Replying In My Own Cloned Voice

Matt Macosko — Wed, 22 Apr 2026 09:53:21 +0000

The closest analogy I can give for what this feels like is having a quiet co-worker in the room who happens to sound exactly like you. You think out loud. They respond out loud. You both work on the same code. Neither of you is touching a keyboard.

It’s still a little uncanny. But it’s also the most natural way to work I’ve found in twenty-plus years of writing software.

The setup runs entirely on my MacBook. Apple’s on-device speech recognition listens for me. A local language model thinks. A cloned-voice text-to-speech says the response back. Nothing leaves the laptop. Nothing requires a network. The whole loop is on-device, and that turns out to matter for reasons I didn’t expect.

How it actually works

A compiled Swift binary wraps Apple’s SFSpeechRecognizer — the same engine that powers macOS dictation — in a continuous-listening daemon. It transcribes everything I say into the active terminal window where Claude Code is running. End-of-utterance is detected by a stability heuristic: if the recognized text stops changing for about 2.5 seconds, the recognizer treats that sentence as final and submits it.

That submission gets injected into Claude Code via AppleScript, addressed to a specific window by ID so it can’t leak into whatever else is open. Claude Code processes the request against a local language model running on MLX (Apple’s native ML framework). The response comes back as text in the terminal — and a separate launcher pipes that text into a TTS engine running a cloned version of my voice. The reply plays through the speakers. The listener auto-pauses while audio is playing, so the model’s spoken reply never gets picked up as a new prompt. Then it resumes listening for the next thing I say.

End-to-end latency, on a current MacBook, is around two to three seconds. Fast enough to feel like a conversation. Slow enough that you notice it’s a different kind of pacing than typing.

What surprises you the first hour

The first thing that surprised me is how much I already narrate while coding. The interior monologue — “okay let me look at the test, that’s failing because the path is wrong, let me grep for the constant, oh it’s in a different file, fix that here…” — turns out to be most of how I work anyway. Speaking it out loud changed nothing about my reasoning. It just routed it to a different output channel.

The second thing that surprised me is how much faster context-switching gets. When you type, you have to break to compose. When you speak, you can just keep going. “That’s done — now check the function signature in the parent class — yeah okay update the docstring to match — git status — looks good, commit it.” Five tasks, no pause, no posture change.

The third surprise is the physical difference. After half a day of voice-driven work I’m not stiff. My eyes aren’t tired. I haven’t held a clamshell wrist position for hours. There’s a real bodily cost to the way we normally use computers, and removing it feels like removing a weight you didn’t know was there.

Why on-device matters specifically for this

You can build a voice-driven coding setup with cloud APIs. Whisper for speech-in, ElevenLabs for speech-out, GPT or Claude for the brain. Many people do. The result is a tool that works great until your wifi gets weird, your API key hits a rate limit, your monthly bill arrives, or you realize you’ve just sent every word you said in front of your laptop today to three different vendors’ servers.

The on-device version doesn’t have any of those failure modes. It works on a plane. It works in a Faraday cage. It works when the rest of the internet is on fire. The bill is a one-time hardware purchase, not a perpetual subscription. And nothing — no audio, no text, no inference request — ever crosses the network. For me that’s the difference between an interesting demo and a tool I actually use day-to-day.

The cloned voice is a real thing, not a gimmick

The cloned voice is the part everyone reacts to first. When the AI reads its response in your own voice, your nervous system files it under “internal monologue” rather than “external announcement.” It’s a smoother experience than a stranger’s TTS voice and it doesn’t pull your attention the same way.

But it works because the cloning is also on-device. The voice clone trains and runs locally — Pocket TTS in my case, but other local TTS engines slot in if you have a preference. Cloud voice services would mean my own voice (and everything I make it say) is sitting on someone’s server. Not interested.

Where it falls short today

Three real limitations:

1. Domain-specific vocabulary. Apple’s recognizer is excellent at general English, less excellent at obscure software terms, library names, and acronyms. “Refactor the YOLOv8 inference loop” often comes through as “refactor the yellow vate inference loop.” The fix is a custom vocabulary file you can register with the recognizer; that closes most of the gap but takes setup time.

2. Background noise. A quiet office is fine. A coffee shop is workable. A hot tub with the jets running, surprisingly, also fine. A room with kids and a dog is harder. The continuous-listen mode is robust but not magic.

3. Long pauses. If you stop talking to think for 20 seconds, the recognizer will sometimes finalize a partial sentence that wasn’t done yet, and you have to restart it. Workable but a real friction point I’m still iterating on.

None of these are fundamental. All of them get better as the recognition models get better, which they’re doing every macOS release.

What this unlocks for me personally

Coding while pacing the room. Coding while cooking. Coding while in the hot tub. (Yes, really. The mac is on a desk; my voice carries; I check back in by walking up to the screen when something needs visual confirmation.) Holding voice work sessions that last for hours without my body breaking down.

It also lets me work in spaces that aren’t desks. Most of my best thinking happens away from a screen anyway — the keyboard part was always the bottleneck. Removing it doesn’t make me think differently. It makes the time I spend actually capturing the thinking more honest about how that thinking happens.

If you want to try it

The local-AI server is at github.com/nicedreamzapp/claude-code-local. The voice listener and dispatcher are at github.com/nicedreamzapp/NarrateClaude. Both are MIT-licensed, both run entirely on a MacBook with Apple Silicon, and both ship with double-click launchers so the install is closer to “set up an app” than “build a system.”

You will spend an evening getting it tuned to your voice and your vocabulary. After that, it just works. And the working life it produces is, in my experience, qualitatively different from screen-and-keyboard.

— Matt

Part of the Nice Dreamz lineup. If you’re a firm exploring private on-device AI, AirGap AI is the engagement I do for setting this up inside law / medical / accounting practices.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

A Field Guide to Ambient Computing — The Words for the Thing Thats Coming

Matt Macosko — Wed, 22 Apr 2026 09:53:15 +0000

I’ve been building toward something I don’t have a clean name for yet. The pieces — local AI, on-device speech, the phone-as-remote, hands-free voice loops — exist and ship today. The pattern they form, when you put them together, doesn’t have a settled vocabulary. So I’m coining one. Not because I want to invent jargon, but because a thing without a name doesn’t exist for most people, and this thing is going to exist for everyone within a few years.

Here’s the working glossary. Use these terms. Argue with them. Replace them with better ones if you can think of any. The goal is to give shape to a way of working with computers that’s already real but not yet common.

The Rectangle

Noun. The screen. Short for the glowing rectangle of damage. The thing you’re hunched in front of right now. The thing that defines posture, attention, work pace, and back pain for most people who use computers professionally.

“I want to spend less of my day inside the Rectangle.”

The Rectangle is not bad. It’s just been the only option for so long that people forgot it was an option at all rather than the default state of computing. Once you have a working alternative — and you do now — the Rectangle stops being a given. It becomes one of several places work can happen.

Off-Screen Work

Noun. Productive computing done without facing a screen. The opposite of screen work, not the absence of work. Hands-free voice coding is off-screen work. Texting your Mac from the hot tub and getting back a finished research summary is off-screen work. Listening to your AI narrate a long task in your own cloned voice while you walk around your house is off-screen work.

“Half my morning was off-screen. I shipped more than I usually do at a desk.”

Ambient Computing

Noun. Computing that happens around you instead of in front of you. The machine listens, talks back, sees what you point a camera at, and the keyboard becomes optional rather than mandatory. Ambient computing isn’t smart speakers. Smart speakers ask you to talk to a brand. Ambient computing is your own machine, doing your own work, in your own voice, in the room with you.

“I’m building toward ambient computing — a stack you can talk to, hand things off to, and check back on later.”

To Airgap (verb)

Transitive verb. To configure an AI workflow such that it runs entirely on local hardware with no outbound network traffic — making client data, prompts, and responses physically incapable of leaving the machine.

“We airgapped the firm’s drafting workflow last week. Nothing they paste into the AI hits the internet.”

This is the verb form of AirGap AI, the consulting practice I run for firms in regulated industries. It’s also the right word for what you do when you set up your own local-AI stack on a MacBook and turn the wifi off to prove it works. Both of those count.

Hand-Off Computing

Noun. A workflow where you give the computer a task, walk away, and it tells you when it’s done. Distinct from interactive computing, where you sit and wait, and from background computing, which you forget about until it crashes. In hand-off computing the machine knows you walked away, finishes the work, and notifies you back through whatever channel you set up — usually a text to your phone.

“I just hand-off it the data analysis and go make breakfast. It buzzes my phone when the report’s ready.”

The Two-Slab Posture

Noun. The body shape you assume when working with a laptop and a phone simultaneously — head tilted forward, shoulders rounded, both hands pulled in toward the body. The dominant posture of professional computing in 2026, and the source of much of the chronic pain that office workers attribute to “stress.” The Two-Slab Posture is what off-screen work makes optional.

“By 4pm every day I’m locked in the Two-Slab Posture and I can feel it in my neck.”

Backpack Supercomputer

Noun. A current-generation laptop with enough on-device compute to run frontier-class AI models locally. Specifically: an Apple Silicon MacBook Pro with 64+ GB of unified memory, of the M2 / M3 / M4 / M5 Max generation. The phrase emphasizes that this hardware fits in a backpack while delivering performance that would have required a server rack five years ago.

“My M5 Max is a backpack supercomputer. I take it to client offices and it runs a 70-billion-parameter model on the train ride.”

The Ambient Stack

Noun. The minimum set of pieces required to assemble an ambient-computing setup. As of 2026, on Apple hardware:

A local LLM running through an MLX-native server. (claude-code-local.)
A continuous-listen daemon wrapping SFSpeechRecognizer. (NarrateClaude.)
A cloned-voice TTS for spoken responses in the user’s own voice.
A phone bridge so iMessage can drive the Mac. (Custom AppleScript.)
A local browser agent for web tasks. (browser-agent.)

Five components. All open source. All run on hardware you may already own. The entire stack costs $0 in recurring fees once installed.

To Whisper-Code

Verb. To do programming work via voice, with the AI replying in the developer’s own cloned voice. Distinct from voice dictation (which still requires you to be at the screen to read the result). Whisper-coding is an end-to-end conversation about code, in audio, where the developer never has to look at the screen unless they choose to verify something visually.

“I whisper-coded the fix while pacing the kitchen. Saw the diff after lunch and it was right.”

The Cloned-Voice Loop

Noun. A feedback loop where the AI’s spoken responses are rendered in a TTS clone of the user’s own voice. This makes the response feel less like an external announcement and more like internal monologue, which the human nervous system processes more naturally and at lower cognitive load than a stranger’s voice. The loop runs on-device for the same privacy reasons as the rest of the ambient stack.

“After a week with the cloned-voice loop, hearing a stranger TTS feels jarring.”

Hot-Tub Coding

Noun. Sending a coding or research task to your Mac from your phone while not being at the Mac. Originally literal — sending the task while in a hot tub — now a generic term for any phone-driven hand-off computing session. The hallmark of hot-tub coding is that the human is doing something else entirely while the computer is doing the work the human ordered.

“That whole feature was hot-tub coded. I never sat down to write any of it.”

Local-First AI

Adjective phrase. A system architecture in which AI inference defaults to running on the user’s own device, with cloud as a fallback used only when local can’t handle the task — not the other way around. The cultural and technical opposite of cloud-first AI, which has been the default since 2019. Local-first AI is the architecture every privacy-sensitive industry is now going to need, whether they realize it yet or not.

“For NDA work, the only sane architecture is local-first AI.”

A Note On Ownership

These words don’t belong to me. If you find them useful, use them. If you build on the stack and coin a better word for something I’ve named here, I’ll switch to using yours. The point of giving things names is to make them discussable, not to lock down a vocabulary.

But I do want the pattern they describe to take hold. We have, in 2026, the technology to fundamentally change how working with computers feels — physically, mentally, ergonomically, financially. Most people don’t know it yet because the pattern doesn’t have a name they recognize. These are the names I think will help.

If any of this resonates: clone the open-source stack, assemble your own ambient setup, and tell me what you find. The next decade of computing isn’t being decided in any one company’s roadmap. It’s being decided by who shows up and starts using the parts that already exist.

— Matt

Part of the Nice Dreamz lineup. If you’re a firm that wants the ambient stack installed and air-gapped for compliance reasons, AirGap AI is the engagement I do for that.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Your Medical Practice Is Probably Using Cloud AI on PHI Right Now — Heres the HIPAA Problem Nobody Is Talking About

Matt Macosko — Wed, 22 Apr 2026 09:27:41 +0000

Walk into any small medical practice today and ask the front-desk staff if they’ve ever pasted a chart note into ChatGPT to “rewrite this so the patient understands it” or to “summarize this lab result.” A lot of them will say yes. Some will say no but their browser history says otherwise. A few will look genuinely surprised that anyone’s asking.

Here’s what they’re not thinking about: protected health information (PHI) under HIPAA includes more than the obvious identifiers. It includes anything that, in combination, could identify a patient — symptoms plus visit date, lab values plus condition, even free-text descriptions if specific enough. Once that text leaves the practice’s network and lands in a cloud AI service, the practice has technically engaged that AI vendor as a business associate, and a Business Associate Agreement (BAA) is required. The big AI providers offer BAAs only on enterprise tiers — usually $$$$ a month. Most practices using ChatGPT or Claude on patient data have no BAA in place at all.

That’s a HIPAA breach waiting to be discovered. And it’s already everywhere.

Why this is suddenly a real problem

For years, the assumption was that nobody on staff would use a “ChatGPT” on actual patient data — it’d be obvious that PHI shouldn’t go to a third-party server. That assumption no longer holds. The tools are too useful. Practice managers, MAs, billers, NPs, even physicians are routinely pasting things in: prior-auth letters, patient instructions, summary letters to referring providers, insurance appeal templates, lab interpretations.

Each of those sessions, on the major cloud AI providers, is potentially a HIPAA-reportable event. The practice doesn’t know it. The vendor doesn’t know who the patient is. But under the regulation, the disclosure happened the moment the text crossed the network boundary without a BAA in place.

OCR enforcement actions in the past few years have been heavy on exactly this kind of “we didn’t realize we were using a third-party processor” finding. Penalties for unintentional disclosure under the HIPAA Privacy Rule start at $137 per violation and can reach $68,928 per violation depending on culpability — and “violations” can be counted per record disclosed. A single staff member pasting 30 patient summaries into ChatGPT over a quarter is, by the regulation’s math, 30 violations.

Most practices are not okay if they’re audited tomorrow.

The fix that actually works: on-device AI

The practice doesn’t have to give up AI. It just has to keep it on the practice’s own machines, where there’s no third-party processor relationship.

Modern Apple Silicon Macs — the kind a lot of practices already have at the front desk or in clinician offices — can run open-weight language models locally. The model runs in the Mac’s unified memory using Apple’s MLX framework. Prompts and responses never touch a network connection. There’s no API key, no vendor account, no outbound traffic to log or audit.

For HIPAA purposes, this changes the legal posture entirely. Software running on a covered entity’s own hardware, with no data transmission outside the entity’s secured environment, is not a third-party disclosure. It’s the same legal category as a Word document — local software processing data the practice already has lawful access to.

The HIPAA Security Rule still applies (the Mac itself needs to be physically secured, encrypted at rest, with access controls), but those are the same controls the practice already runs for its EMR workstation. No new vendor risk. No new BAA. No quarterly compliance review of the AI provider’s SOC 2 report.

What it looks like inside a practice

A typical small-practice install:

2-5 MacBooks (front desk, clinician workstations, billing). Most practices already have these.
An open-source MLX server installed on each, running a 31B or 70B language model.
A simple chat interface on the desktop. Looks and feels like ChatGPT. Behaves the same. Just doesn’t phone anywhere.
A one-page HIPAA AI Use Policy documenting that the practice’s AI tools run on-premises with no third-party data processors. This goes in the practice’s compliance binder.
An hour of staff training on what tasks make sense for the local AI vs. what should still go through the EMR.

After install, the practice’s AI usage is HIPAA-clean. Nothing to add to the BAA log. Nothing to disclose to patients. Nothing to argue about in an audit.

The specific wins for a medical practice

Patient instructions in plain language. Convert “post-op care: keep wound site dry x 5 days, rotate dressing q12h, NSAIDs prn” into a paragraph the patient will actually read. Local model, no PHI exposure.
Prior auth letters. Drafting these from chart notes is a huge time sink. Local AI can generate the first draft from the relevant note, with the chart never leaving the practice.
Insurance appeals. Same pattern. The AI sees the denial letter and the relevant clinical history; the practice’s data stays local.
Letters to referring providers. Clean, professional, fast — without sending the patient’s chart to a cloud LLM.
Patient education content customized to the practice (not the same generic handouts every other clinic uses).

None of these are dramatic. All of them are time savers worth tens of hours per month per provider. And every one of them is safe to do with on-device AI in a way that’s genuinely not safe to do with cloud AI.

What the cost actually looks like

A small practice using cloud AI properly (with a BAA-covered enterprise tier) is looking at $40-100 per user per month, plus the legal and compliance overhead of vetting the vendor and adding them to the BAA log. That’s $5,000-15,000+ per year in subscription cost for a 5-person practice, before any compliance staff time.

A one-time on-device install for the same practice runs $8,000 to $15,000 all-in (hardware aside — most practices already have the Macs). After that: zero recurring AI subscription cost. The AI runs on hardware the practice already owns, indefinitely.

The financial argument is real, but it’s secondary to the compliance argument. The compliance argument is: on-device AI is the only AI configuration that doesn’t create a HIPAA business-associate relationship. That’s not a marginal advantage. That’s a categorical difference.

Who should be looking at this now

Solo and small group practices doing primary care, behavioral health, dermatology, OB/GYN, mental health, dentistry — anywhere clinicians are tempted to use AI on chart text.
Therapy and counseling practices where session notes are particularly sensitive and where most cloud AI tools are an obvious non-starter.
Concierge / direct-pay practices where patients explicitly chose the practice for higher privacy expectations than chain medicine offers.
Practices that already had a HIPAA scare — a near-miss, an OCR letter, a malware incident — where the leadership now takes data flow questions seriously.
Any practice in California, New York, Massachusetts, or other states with privacy laws that exceed HIPAA in scope.

If the practice’s leadership doesn’t know exactly what AI tools the staff are currently using on patient text, that’s the answer to “should we look at this.” The fix is not to ban AI (it’ll go underground), it’s to give the practice an AI that doesn’t create a vendor-risk problem.

I do on-device AI installations for small medical and therapy practices — fixed-fee, one week start to finish, including the HIPAA AI Use Policy and staff training. If your practice is quietly accumulating AI usage without a clear compliance posture, this is the cleanest fix on the market.

More detail: AirGap AI — book a 15-minute call from that page and I’ll walk through whether on-device is the right fit for your specific setup.

— Matt Macosko, Nice Dreamz LLC

The open-source software the install is built on is public at github.com/nicedreamzapp/claude-code-local — you or your IT contractor can review exactly what runs on practice hardware.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Three Generations of Running Claude Code Locally on a MacBook — What I Actually Learned

Matt Macosko — Wed, 22 Apr 2026 07:43:25 +0000

Timely update — April 22, 2026. XDA Developers reports Anthropic is A/B-testing a Pro plan that doesn’t include Claude Code — affecting ~2% of new signups, with the pricing page updated to show Claude Code unchecked in the test variant. Current Pro users keep access for now, but the signal is clear: if you want Claude Code in the cloud, the cheapest path is moving toward the $100+ Max tier. What’s below is the free local version that runs exactly the same Claude Code CLI against a model on your own Mac, no subscription needed.

I spent a weekend trying to get Claude Code running against a local model on my Mac. I ended up rewriting the whole setup three times before I had something that didn’t embarrass me on a real coding task. Here’s what went wrong and what finally worked.

The project is open source — it’s at github.com/nicedreamzapp/claude-code-local if you want to skip the story and just run it.

Why do this at all

Claude Code is great, but every call you make sends your code to Anthropic’s cloud. For a lot of what I do — NDA work, client code, things that legally shouldn’t leave the room — that’s a non-starter. I don’t want to turn the AI off, I want to run it somewhere the data doesn’t travel. On a MacBook with 128 GB of unified memory, that’s not hypothetical anymore. You can fit a 70B+ parameter model in RAM and get real work done without a single packet leaving the box.

Gen 1: Ollama + a translation proxy

The obvious first move. Ollama speaks the OpenAI API. Claude Code speaks the Anthropic API. So you write a little Python proxy in the middle that translates one to the other.

It worked. It was also painfully slow — 30 tokens/second, and a real coding task took 133 seconds. The proxy was doing two API translations per turn, and every tool call meant serializing and re-deserializing JSON twice. For anything more complex than “write a loop” the thing spent more time shuffling bytes than running inference.

Honestly, Gen 1 was mostly useful for proving the idea was sound. Nothing more.

Gen 2: llama.cpp + TurboQuant

I tried to fix the speed problem by swapping Ollama for llama.cpp and adding Google Research’s TurboQuant KV cache compression. That got me to 41 tok/s — a real improvement on the model side. But Claude Code tasks still took 133 seconds, because the proxy was the bottleneck, not the model.

This was the lesson I needed: you can’t fix a translation-overhead problem by making the translator’s client faster.

Gen 3: kill the proxy entirely

This is where things clicked. Instead of making Ollama or llama.cpp speak Anthropic’s API through a middleman, I wrote a native MLX server that speaks the Anthropic API directly.

No proxy. No translation layer. Claude Code connects to localhost:4000 thinking it’s talking to api.anthropic.com, and the server routes straight into MLX (Apple’s native Metal-based ML framework) running the model on the GPU side of unified memory.

Same Claude Code task: 17.6 seconds. That’s 7.5× faster than anything I had before, and it came from deleting code, not adding it.

Throughput with Qwen 3.5 122B (a mixture-of-experts model where only 10B of the 122B params activate per token) hits 65 tok/s on an M5 Max. That’s faster than cloud Opus and, depending on how you measure, close to cloud Sonnet. At zero dollars a month and with nothing leaving the machine.

What I didn’t expect

Three things surprised me:

1. Tool-call recovery mattered more than raw speed. Small local models sometimes emit garbled tool calls — XML syntax mixed with JSON keys, that sort of thing. Claude Code just silently retries when it can’t parse a call, and without a recovery layer you get infinite loops of “let me try that for you” that never actually run anything. Writing a parser that catches the common garbles and re-infers tool names from parameter keys turned out to be the difference between a demo and something I’d actually use.

2. The 10K-token Claude Code harness prompt is too big for local models. Claude Code sends a giant system prompt with every request that’s tuned for cloud Claude. Local models see it and often respond with “I am not able to execute this task.” I added an auto-detection that recognizes a Claude Code coding session (presence of Bash/Read/Edit/Write tools) and swaps in a 100-token version tuned for local models. Prompt token count drops by 99% and the refusal problem goes away.

3. KV cache quantization bits matter. 4-bit KV cache saves memory, but small models lose coherence on long conversations. Bumping to 8-bit (starting at token 1024) fixed the “wait, what were we doing?” drift without a meaningful memory hit.

Where I ended up

The repo ships three model options now — Gemma 4 31B (fast, fits a 64 GB Mac), Llama 3.3 70B (slowest but smartest, full 8-bit precision), and Qwen 3.5 122B (fastest throughput, MoE sparsity). Same server, same API, one env var swaps the model.

It also has a browser agent that drives Brave via Chrome DevTools (so local AI can actually do research, not just write code), an iMessage pipeline for when I’m away from the Mac, and — the part I’m proudest of — a hands-free voice loop where Apple’s on-device SFSpeechRecognizer listens and a cloned-voice TTS replies. Both halves of that loop run locally. Your voice never leaves the laptop.

If you want to try it

git clone https://github.com/nicedreamzapp/claude-code-local
cd claude-code-local
bash setup.sh

setup.sh auto-detects your RAM, picks an appropriate model, downloads it (one-time, 18-75 GB), installs the MLX server, and drops a launcher on your Desktop. Double-click and you’re coding locally.

The full source is MIT. It’s ~1000 lines of Python plus a few shell launchers. No dependencies I couldn’t audit from scratch, and zero outbound network calls from any of it (Claude Code’s own binary makes one non-blocking startup handshake to Anthropic that you can firewall off with no loss of function — documented in the README).

From where I sit, the interesting piece isn’t the speed numbers. It’s that “your code never leaves the machine” stopped being aspirational. If you’re a lawyer, an accountant, a doctor, or anyone whose work comes with confidentiality obligations, this is the version of AI coding you can actually use.

On-device isn’t just a MacBook thing

This project is one of two I’ve shipped in the on-device-AI space. The other one is RealTime AI Camera — a free iPhone app that detects all 601 object classes from Open Images V7 fully offline, at an average of 10 FPS. Every other iPhone detection app I’ve seen caps out at the standard 80 COCO classes because the bigger model is much harder to get running on-device. I spent weeks on the PyTorch → CoreML conversion, hallucination tuning across the extra 521 classes, and the memory-bandwidth bottleneck in the camera pipeline — wrote up the whole build here if that’s your lane.

Different hardware, different model size, same philosophy: your data doesn’t have to leave the device for the AI to be useful. claude-code-local is the MacBook version of that idea. RealTime AI Camera is the iPhone version.

— Matt

More of my open-source lineup: nicedreamzwholesale.com/software. If you want this set up inside a firm or practice, that’s AirGap AI — book a 15-min call there.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

This Is What a Robot Can See Now — 601 Objects, Live, Offline, on Your iPhone

Matt Macosko — Wed, 22 Apr 2026 07:43:19 +0000

Hold up a banana. Your phone says “banana.” Hold up a ukulele. It says “ukulele.” A stapler, a french horn, a goose, a CT scanner, a waffle iron — it names them all, live, at 10 frames per second, with the internet turned off.

That’s the part I keep coming back to. We’ve quietly crossed a line where a machine running on the 6-ounce thing in your pocket can recognize 601 different objects in the world around it without phoning anywhere. No cloud. No account. No waiting. That’s an extraordinary amount of sight to hand to a piece of consumer hardware, and it’s available right now, for free.

I built RealTime AI Camera to show people what that actually feels like. The app is on the App Store, it’s free, and the source is on GitHub. Point it at your kitchen and watch it label everything in real time. Most people don’t realize how far the on-device models have come until they see it happening in their own hand.

Here’s why it’s a bigger deal than it sounds — and what it took to actually build it.

80 things vs 601 things

Pretty much every iPhone object detection app you’ve ever used recognizes the same 80 things. That’s the default COCO class set that ships with YOLO — person, car, dog, cup, laptop, banana, the obvious ones. For years that’s been the practical ceiling on-device. If you wanted your app to detect anything beyond that, you sent the frame to a server, which meant goodbye offline, goodbye privacy.

RealTime AI Camera uses the much bigger Open Images V7 class set: the original 80 plus 521 more. Musical instruments by type. Animals you’ve never heard of. Kitchen appliances, tools, clothing specifics, scientific instruments, transportation, furniture subtypes. A model that understands the shape of “french horn” distinct from “tuba,” “goose” distinct from “duck,” “stapler” distinct from “hole punch” — all running on your phone, all offline.

The fact that you can walk around a house and have a device narrate the world back to you at 10 FPS without a network connection is a genuinely new thing. I keep meeting people who assume any “AI” on their phone must be sending data somewhere. Watching them flip airplane mode on and watch the app keep working is sometimes the first moment they actually believe on-device AI is real.

What the app actually looks like

Screenshots from the shipping app on iPhone — object detection, OCR, offline translation, LiDAR depth overlays.

Why 80 → 601 isn’t just “more classes”

Going from 80 classes to 601 is not a linear problem. The model has to learn all of it, and then it has to fit on an iPhone, and then it has to run fast enough to feel real-time.

Every one of those requirements individually is solvable. Stacking them together is where it gets mean.

The PyTorch → CoreML conversion

The weights started life in PyTorch. Apple doesn’t run PyTorch natively — everything on-device goes through CoreML, which means a conversion step. For an 80-class YOLO, that conversion is mostly automatic. For a 601-class YOLOv8 with Open Images V7 weights, I spent days on it. Some ops translate cleanly, some fall through into “custom layer” territory, some silently produce a model that runs but outputs garbage. The first few versions I converted ran at full FPS and detected nothing accurately — the tensors were all being reshaped wrong at inference time.

The published model is on HuggingFace: divinetribe/yolov8n-oiv7-coreml. If you want to skip the conversion pain and just use the model, that’s the shortcut I wish I’d had.

Hallucination at scale

When your class count is 7.5× bigger, you don’t just get 7.5× more possible correct detections — you get many multiples more wrong ones. The model starts seeing “musical instrument” in every shadow, “building” in every wall, “person” in every coat on a chair. At 80 classes the false-positive rate is manageable because the class boundaries are mostly clean. At 601, every low-confidence detection is a roll of the dice across a much bigger space of wrong answers.

Fixing this took iteration on three knobs: confidence threshold, NMS (non-max suppression) threshold, and per-class filtering. Some classes in Open Images V7 are just noisy — the ground truth in the training set was fuzzy to begin with. For the app, I tuned the thresholds conservatively enough that the user sees confident detections and not a constant light show of wrong guesses. That conservatism costs some recall, but the experience is night and day better than “technically detecting 601 things.”

Screens across multiple iPhones

This part I didn’t expect to be hard. SwiftUI is supposed to make responsive layout easy. In practice, running a live camera feed with overlaid bounding boxes + OCR text + LiDAR distance badges + a control strip, across iPhone SE through iPhone 17 Pro Max, is a real problem. Aspect ratios differ. The notch and Dynamic Island move. LiDAR is only on Pro models so the UI needs to gracefully degrade when it’s missing. Older iPhones can’t sustain 10 FPS on the bigger 601-class model so the app has to throttle.

Most of the hours on the “done except for…” list at the end of the project were UI hours, not ML hours. Nobody tells you that until you’ve lived it.

The bottleneck

The performance bottleneck on iPhone is not the Neural Engine — it’s memory bandwidth between the camera pipeline and the inference step. CoreML + Metal Performance Shaders + the Neural Engine are all fast. Moving pixel buffers from AVCaptureSession into a format the model wants, at 30+ FPS without dropping frames, is where you lose time. I ended up doing a lot of zero-copy plumbing — reusing the CVPixelBuffer that the camera hands you, avoiding intermediate UIImage conversions, keeping everything on-GPU through the pipeline. Average 10 FPS on the 601-class model across the iPhone lineup came from that plumbing, not from model optimization.

The shipped app has four features that all run in that same pipeline:

Object detection — YOLOv8, 601 classes, Open Images V7
English OCR — on-device printed text recognition (Vision framework)
Spanish → English translation — offline, rule-based + dictionary (no cloud translation)
LiDAR distance — per-object depth, on Pro models

All of it on one device, no network, no account, no ads.

Why this matters beyond the app

On-device AI is having a moment. Everybody’s talking about it, most shipped products still cheat by falling back to cloud when the model isn’t up to the job. RealTime AI Camera is a small proof that you can actually ship a nontrivial AI experience that never leaves the device, for free, and have it work.

This is the same thesis behind my other big open-source project, claude-code-local — running Anthropic’s Claude Code CLI against a local AI model on a MacBook, zero cloud calls, full coding experience. Different target (MacBook vs iPhone), different model sizes (31-122 billion params vs ~10 million), same philosophy: your data doesn’t need to leave the machine for the AI to be useful.

If you’re building in this space and thinking about on-device-first, I’d love to hear what you’re running into. Open an issue on the RealTimeAICam repo, or drop me a line.

Try it

Free on the App Store: RealTime AI Cam

Source on GitHub: github.com/nicedreamzapp/RealTimeAICam

Model on HuggingFace: divinetribe/yolov8n-oiv7-coreml

— Matt

Part of the Nice Dreamz Apps lineup — private, on-device AI tools. If you want this kind of thing set up inside a firm or practice, that’s AirGap AI.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Cloud AI Coding Costs Keep Climbing — How to Pay $0 and Still Use Claude Code

Matt Macosko — Wed, 22 Apr 2026 07:42:41 +0000

Update — April 22, 2026. XDA Developers reported yesterday that Anthropic is A/B-testing a version of the Pro plan that doesn’t include Claude Code — affecting about 2% of new signups. Current Pro users keep it; the pricing page has been updated to show Claude Code unchecked for the test variant. If the test rolls out fully, the cheapest way to use Claude Code jumps from the $20 Pro tier to the $100+ Max tier. Either way, the direction is clear: the cost of cloud AI coding is going up, not down. What’s below is how to keep using Claude Code regardless.

Every few weeks there’s another headline about an AI company raising prices, tightening rate limits, or putting a favorite tool behind a higher tier. If you’re a developer who uses Claude Code or a similar agent every day, the math starts getting uncomfortable. What used to be a $20/month habit can quietly become $100/month — per seat — before you notice. Multiply by a small team and it’s a real line item.

I got tired of watching that number climb so I built a way around it. Claude Code still works great. You just run it against a local AI on your own Mac instead of the cloud.

The project is open source and free: github.com/nicedreamzapp/claude-code-local. Bash one script, double-click a launcher, and you’re coding against a local model that costs nothing to run beyond electricity.

Why the cloud bill keeps growing

Cloud AI pricing isn’t stable and shouldn’t be treated like it is. The models keep getting more capable, and each jump in capability gets repriced into a higher tier. Context windows expand and the per-token cost goes up to match. “Included in your plan” features get quietly moved up a tier. None of this is wrong — the cost to serve a frontier model is real — but the net effect on a developer’s monthly AI spend is a steady upward drift you can’t plan for.

Meanwhile, the machine on your desk has gotten radically more capable at running those same kinds of models. An M-series MacBook with 32+ GB of RAM can now comfortably run a coding model at 15–65 tokens per second. That’s real, useful inference speed. Five years ago this wasn’t possible. Today it’s ordinary.

The question stops being “can my Mac run this?” and becomes “why am I paying a subscription for something my Mac can do by itself?”

How the local setup works

Claude Code — Anthropic’s CLI coding agent — talks to api.anthropic.com by default. It sends your prompts and code to the cloud, gets responses back, runs tool calls. Fast, reliable, costs money.

The workaround is a tiny server that runs on your Mac and pretends to be the Anthropic API. Claude Code thinks it’s talking to the cloud. It’s actually talking to localhost:4000, which hands the request to an MLX-powered local model running on your Apple Silicon GPU. The response comes back formatted exactly like the real API would format it. Claude Code doesn’t know the difference.

I went through three generations of this before it actually worked well:

Gen 1 — Ollama + a translation proxy. 30 tok/s, 133 seconds per task. Worked, but slow.
Gen 2 — llama.cpp with KV cache compression + same proxy. 41 tok/s. Still 133 seconds per task because the proxy was the bottleneck, not the model.
Gen 3 — Killed the proxy entirely. Wrote a native MLX server that speaks Anthropic’s API directly. 65 tok/s and 17.6 seconds per task. 7.5× faster than anything I had before, and it came from deleting code rather than adding it.

The key insight: when you’re running everything locally, every layer of translation or RPC between Claude Code and the model is overhead that buys you nothing. Direct is always faster than proxied.

What it costs

Setup — once, ~20 minutes:

git clone https://github.com/nicedreamzapp/claude-code-local
cd claude-code-local && bash setup.sh
Double-click the launcher that lands on your Desktop.

Ongoing — $0/month. Electricity maybe adds a dollar or two to your power bill if you’re running inference constantly. That’s it. No API key. No usage tier. No rate limits. No dashboard to check.

Running Claude Code against a local model also means:

Your code never leaves your Mac. For NDA work, client files, or anything under compliance rules, this isn’t a “nice to have” — it’s the difference between being able to use AI and having to turn it off.
Works without internet. Plane, train, coffee shop with bad wifi, client office with firewall restrictions — if the machine is running, the AI is running.
No surprise bill. You won’t wake up to a $400 month because you left a loop running overnight.

The tradeoff

To be fair: local models aren’t cloud Claude. Cloud Sonnet and Opus are still more capable on really hard reasoning tasks. For 80-90% of day-to-day coding — writing functions, refactoring, explaining code, running tool calls, doing multi-step edits — the local models are good enough that I’ve stopped reaching for cloud. For the hardest 10-15% of problems, cloud still wins.

The honest recommendation: run local as your default, and keep a cloud subscription only if you’re reaching for it weekly. Most developers will find they rarely need to.

Where the ceiling is going

Local models keep getting better faster than cloud pricing falls. The 31B and 70B open-weight models available today are what people paid cloud premiums for 18 months ago. In another 18 months, what’s running on your Mac will be indistinguishable in capability from what the cloud was charging $200/month for this year.

That’s the real story. Not “cloud is bad.” Cloud is fine. It’s that local has caught up, and the cost math has flipped. If you’re a developer who cares about the bill, or a firm whose files can’t legally leave the premises, or just someone who doesn’t want to be surprised by next quarter’s pricing page — this is the version that makes sense now.

Try it: github.com/nicedreamzapp/claude-code-local. MIT license. Zero dependencies you can’t audit. Run it, or don’t, but either way you’ll have one less subscription to worry about.

— Matt

Part of the Nice Dreamz lineup. If you’re a firm (law, accounting, medical, therapy) that can’t legally put client files through cloud AI, AirGap AI is the private on-device setup I do for firms — book a 15-min call there.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

If Your Law Firm Is Using Cloud AI on Client Files, You Probably Have a Problem

Matt Macosko — Wed, 22 Apr 2026 07:42:35 +0000

Most lawyers I talk to are already quietly using AI — ChatGPT, Claude, Copilot — for drafting, research, and summarization. It’s useful. It saves hours. And in almost every case, the firm never formally decided whether it was allowed. A paralegal or an associate started, then a partner tried it, then it was everywhere, and nobody circled back to the ethics question.

Here’s the uncomfortable part: under ABA Model Rule 1.6, the duty of confidentiality applies to everything relating to a client’s representation, not just privileged communications. Anything you paste into a cloud AI service is data you’ve handed to a third-party processor. Some states have issued specific guidance warning that AI use requires the same diligence as any other vendor — data handling agreements, security audits, informed client consent.

Most firms are doing zero of that. They’re just pasting.

What “on-device AI” actually means

There is another way to run AI tools inside a firm — locally, on the machines the firm already owns, with no frame of the document ever touching a network.

This isn’t hypothetical. Apple Silicon MacBooks from roughly 2022 onward can run open-weight coding and writing models at useful speeds, entirely inside their own unified memory. No API call. No cloud round-trip. No “your data may be used to improve the service” footnote.

The technical term is air-gapped AI. Same underlying capability — large-language-model drafting, document review, summarization — minus the third-party data processor relationship. Because nothing leaves the Mac, there’s nothing to disclose to the client, nothing to document under a vendor-risk framework, nothing that can be subpoenaed from a cloud provider’s server. The surface area for a breach is the machine itself, which you already secure as part of normal firm operations.

Why this wasn’t practical a year ago

For years the “local AI” story was a lab-bench curiosity. You could technically run a language model on your laptop, but it was slow, it couldn’t follow complex instructions, and the output quality was visibly worse than what cloud tools produced. Clients would notice.

That changed faster than most firms realize. The current generation of open-weight models running on Apple Silicon produces output that’s indistinguishable from cloud Claude or GPT for most legal-adjacent tasks — drafting correspondence, summarizing long documents, extracting facts from depositions, plain-language explanation of statutes, cite checking against uploaded documents. The models have caught up. The hardware has caught up. The thing that hasn’t caught up is firms’ awareness that the option exists.

The setup, in plain English

A typical firm install looks like this:

A handful of MacBooks (or an existing one per attorney who does heavy drafting).
An open-source MLX server that runs a 31- to 70-billion-parameter model on each machine.
A CLI or chat interface that matches what the attorneys are already used to — Claude Code for tech-heavy work, a local chat app for drafting.
A one-page firm policy that documents “our AI runs on premises, does not use third-party processors, complies with [your state’s] duty of confidentiality guidance.”

The technical install is half a day per machine. The policy paperwork and a lunch-and-learn for the attorneys takes another day. After that it runs like any other piece of firm infrastructure — quietly in the background.

What it costs vs what cloud AI costs

A cloud AI subscription for a 10-attorney firm runs anywhere from $200 to $1,500 per month depending on tier and usage. Annual: $2,400 to $18,000. That’s ongoing forever, and each vendor price increase flows straight through to the firm’s overhead.

A one-time install of on-device AI across the same 10 attorneys is a fixed engagement — typically $8,000 to $15,000 all-in, hardware aside (most firms already have the Macs). After that, zero recurring AI spend. Year two onward the firm’s marginal AI cost is the electricity to run the MacBook.

The financial case is real, but it’s not actually the strongest argument. The strongest argument is: if a regulatory body ever asks how your firm handles client data in AI tools, “we run it on-device, no third-party processor” is the only answer that ends the conversation.

Who this is for

This isn’t the right fit for every firm. If your practice doesn’t handle confidential client information, or you’ve already done the vendor-risk work and your clients have signed off on cloud AI, cloud is fine and probably cheaper to start.

Where on-device is specifically the right move:

Solo and small firms handling family law, trusts/estates, criminal defense, or any practice where client files carry heavy privacy expectations.
Firms with specific-client confidentiality agreements that explicitly prohibit third-party data processors.
Regulated practice areas (healthcare-adjacent, financial-services-adjacent, government contracting) where the compliance overhead of vendor AI is genuinely disproportionate.
Firms in California, New York, Florida, Texas, or any state whose bar has issued specific AI guidance that firms haven’t actually complied with yet.

How to think about the decision

If you’re a partner reading this, the easiest diagnostic is: pull your associates and paralegals into a room and ask a yes/no question — “In the last 30 days, have you pasted client work into ChatGPT, Claude, or similar?” If the answer is anything other than a clear no, you already have an on-device-AI-worth-evaluating moment on your hands, whether you knew it or not.

The fix is not banning AI — banning it just moves the usage underground and makes it worse. The fix is giving attorneys a sanctioned AI that doesn’t create a vendor-risk problem. That’s what on-device is.

I do firm installations of exactly this setup — fixed-fee, one week start to finish, including the policy paperwork and attorney training. If your firm is quietly accumulating AI usage without a formal stance, or you’ve been waiting for the tech to get good enough to deploy safely, it’s there now.

More detail on the service: AirGap AI. Book a 15-minute call from that page and I’ll tell you in plain terms whether on-device is the right fit for your specific practice.

— Matt Macosko, Nice Dreamz LLC

The open-source setup I use for installs is public at github.com/nicedreamzapp/claude-code-local if you or your IT person want to review exactly what runs on firm hardware.

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Forem: Matt Macosko

I Just Watched One Hacker Catch Up to a Trillion-Dollar Data Center

What I actually did

What the auroras actually looked like

What this means for the cloud AI bill

Three things ds4 does that nobody else has bundled

What I'm doing next

HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

Why this number matters

Methodology

For context — Qwen3-Coder 480B's official agentic benchmarks

Why the offline part matters

Reproduce it yourself

Pulling 10x My Subscription Value Out of Claude — While Quietly Building the Backup Plan

The honest comparison

Why is Anthropic eating the difference?

The tells it's already tightening

What I built quietly on the side

The honest takeaway

Free AI on a MacBook vs $100-a-Month Claude Code — Hexagon Shootout

The setup — same prompt, three contestants

The results

Three things that surprised me

1. Bigger isn't better when "bigger" is a generalist

2. Claude Code's harness chokes local models

3. Circle-approximation collision is the cheat code

Who should care

How to run it yourself

TL;DR

The Era of Hunched-Over-A-Screen Computing Is Ending — Heres Whats Replacing It

What ambient computing actually looks like

Three pieces that already exist

Why this matters now

The catch

What I’m building toward

What Its Actually Like to Code By Voice — With the AI Replying In My Own Cloned Voice

How it actually works

What surprises you the first hour

Why on-device matters specifically for this

The cloned voice is a real thing, not a gimmick

Where it falls short today

What this unlocks for me personally

If you want to try it

A Field Guide to Ambient Computing — The Words for the Thing Thats Coming

The Rectangle

Off-Screen Work

Ambient Computing

To Airgap (verb)

Hand-Off Computing

The Two-Slab Posture

Backpack Supercomputer

The Ambient Stack

To Whisper-Code

The Cloned-Voice Loop

Hot-Tub Coding

Local-First AI

A Note On Ownership

Your Medical Practice Is Probably Using Cloud AI on PHI Right Now — Heres the HIPAA Problem Nobody Is Talking About

Why this is suddenly a real problem

The fix that actually works: on-device AI

What it looks like inside a practice

The specific wins for a medical practice

What the cost actually looks like

Who should be looking at this now

Three Generations of Running Claude Code Locally on a MacBook — What I Actually Learned

Why do this at all

Gen 1: Ollama + a translation proxy

Gen 2: llama.cpp + TurboQuant

Gen 3: kill the proxy entirely

What I didn’t expect

Where I ended up

If you want to try it

On-device isn’t just a MacBook thing

This Is What a Robot Can See Now — 601 Objects, Live, Offline, on Your iPhone

80 things vs 601 things

What the app actually looks like

Why 80 → 601 isn’t just “more classes”

The PyTorch → CoreML conversion

Hallucination at scale

Screens across multiple iPhones

Three things `ds4` does that nobody else has bundled