Forem: Ooi Yee Fei

Communicate Ideas Visually: Let AI Run the Feedback Loop

Ooi Yee Fei — Tue, 26 May 2026 15:40:30 +0000

Share a rich HTML doc, gather everyone's comments on it, let AI converge and facilitate continuously

I shared htmldrop earlier — a Claude Code skill + CLI that turns any HTML file into a one-command shareable link. But sharing is only half the problem. The other half is discussion — and today that goes to die in buried chat threads, or worse, in competing V1/V2/V3 documents that never actually respond to each other. Someone sends a spec; you reply with your own AI-polished counter-spec; now there are two confident docs talking past each other.

This is the other half — making the shared document the place the conversation actually happens, and then making it easy to act on that conversation.

A link and a password is all it takes to get in — and the page lands as a place to talk, not just read. From there, anyone you've shared it with reacts right on the document: highlight a sentence to comment, select a passage, or drag a box around a diagram — every note pinned to the exact part it's about, no account needed.

Then the part I care about most. Getting comments is easy; knowing what to do with them isn't — some are clean fixes, some are real forks in the road. So you bring the whole thread back to your coding agent: it folds in the clear wins (a vague metric becomes a concrete target) and, on the genuine disagreements, lays out both sides and leaves the call to you. Re-publish and the same link shows the next version, comments intact — the document evolves from the conversation instead of forking into V1/V2/V3 copies that never answer each other.

Example of how it looks like after convergence:

It's open source, works with whatever coding agent you already use (Claude Code, Codex, Gemini CLI, Antigravity…), and your-own-data — self-host the comment backend, or just keep every comment in your repo as the record.

If you want to try it yourself, here's the detailed walkthrough:

Read the full writeup on Build Signals

Communicating Ideas Shouldn’t Be This Hard: An HTML Sharing Skill for Claude Code

Ooi Yee Fei — Sat, 16 May 2026 16:13:38 +0000

Thariq recently wrote about HTML being a more effective output format from AI agents — his core point being that it helps humans "stay in the loop." Karpathy echoed the sentiment. The argument is simple: when agents produce output for humans, richer formats win. HTML is visual, navigable, and people actually read it. Markdown beyond 100 lines? Most don't.

I agree. But something else has been on my mind.

As tech and non-tech folks alike start using Claude Code and other AI tools to generate documents for each other — specs, proposals, analysis — I've noticed we're getting better at producing ideas and worse at understanding each other's. Everyone generates. Few truly read. Everyone wants to be understood; few pause to understand first.

"Seek first to understand, then to be understood." — Stephen Covey

I think this principle is harder to practice now than ever. AI validates whatever framing we give it. We feel understood by the machine, so we stop trying to understand the human on the other end.

The Pain Point

This tool won't solve that communication problem. But it aligns with what Thariq is saying — if we're going to pass ideas around, the least we can do is make them easy to consume.

What I kept seeing:

People generate beautiful HTML files with Claude Code
But then they're stuck — it's a local file
Sharing means "here's an attachment" or "double-click to open in Chrome"
Non-technical teammates can't easily view or access them

So I built htmldrop — an open source CLI + Claude Code skill that makes sharing a one-command operation.

Two Ways to Use It

Mode 1: CLI — for terminal users:

npm install -g @yeefeiooi/htmldrop && htmldrop init   # one-time (email + verify)
htmldrop push my-spec.html                 # → shareable link
htmldrop push secret.html --password xyz   # → encrypted, password-protected
htmldrop push internal.html --noindex      # → public but hidden from crawlers

Mode 2: Claude Code skill — conversational, guided flow:

Just say: "Share this HTML file"
Claude guides you: public or private? password? block crawlers? → done.

The skill is especially useful for non-technical users — no flags to remember, just plain English. It asks what you need and handles the rest.

Privacy

Password protection encrypts the entire HTML content (AES-256) before upload. Even if someone finds the URL, they see a password prompt and encrypted gibberish. No AI agent or crawler can index the content.

Open Source

Both are fully open source (MIT). No platform lock-in, no subscription, no data on someone else's server. Use it, break it, give feedback, contribute.

I wrote a longer piece about the communication angle — how AI is making us better at producing but worse at understanding each other, and where I think this needs to go next.

Read the full writeup →

Sentrix: An AI SRE Copilot That Debates Its Own Scaling Decisions

Ooi Yee Fei — Mon, 09 Mar 2026 12:17:32 +0000

Every SRE team has the same nightmare: it's 3am, traffic spikes, and nobody predicted it. By the time CloudWatch alerts fire, customers are already frustrated and revenue is lost.

I built Sentrix — an AI-powered SRE copilot that predicts infrastructure problems before they happen and autonomously scales your cloud resources. But what makes it different isn't just prediction — it's debate.

Three Agents, One Decision
Instead of a single AI making decisions, three Bedrock Claude agents argue about every scaling call:

AGENT_SRE fights for reliability: "Scale now, we can't risk downtime."
AGENT_FINANCE pushes back on cost: "That's 5x the replicas — do we really need all of them?"
AGENT_ARBITER synthesizes both: "Scale to 3x now, monitor for 5 minutes, then reassess."
The result is decisions that balance reliability and cost — and every decision is scored 5 minutes later via a Step Functions feedback loop. The scores become thought signatures that feed back into future analysis. The AI literally learns what works for your specific infrastructure.

What it looks like in practice
I ran Sentrix through a full incident lifecycle: traffic spike → cost optimization → regional degradation → cascading AWS failure → cross-cloud GCP failover → autonomous recovery.

Watch the demo:

During a 4536% traffic surge, the brain detected it in milliseconds and scaled EKS from 2 to 10 pods — no human needed. When traffic normalized, the Finance agent argued for scaling down, and the system optimized from 10 back to 5 pods. When AWS regions cascaded, all three agents unanimously agreed on GCP failover. The feedback loop scored that decision 100/100.

The whole system runs on a single AWS CDK stack — Lambda, Bedrock, EKS, DynamoDB, Step Functions, EventBridge, CloudFront — deployed in 5 minutes.

Full writeup
The full post covers the architecture, severity-based model selection (Haiku for low severity, Sonnet for critical), the thought signature self-evolution mechanism, and a phase-by-phase demo walkthrough with screenshots.

Read the full writeup on Build Signals

I submitted Sentrix to the AWS 10,000 AIdeas competition. The top 300 most-liked articles advance to the next round. If you found this interesting, a like on the article would genuinely help — it takes 2 seconds.

Like the article on AWS Builder Center

Your AI Chief Product Officer: Claude Code Skill for Solo Builders

Ooi Yee Fei — Tue, 10 Feb 2026 15:37:17 +0000

Building alone means wearing every hat — including product manager. I keep hitting the same walls: competitor research is exhausting, feature prioritization feels like guesswork, and I don't have a PM team to lean on.

So I built a Product Management skill for Claude Code. It handles the PM workflow I was doing manually — researching competitors, scoring feature gaps, generating PRDs, and creating GitHub Issues — all from the terminal.

The key piece is a WINNING filter that scores every feature gap on pain, timing, execution capability, and defensibility. It takes 50+ potential features and narrows them to 3–5 high-conviction priorities. No more building something just because "it would be cool."

It also pairs with spec-kit for implementation handoff — the PRD creates a GitHub Issue, and spec-kit picks it up from there. One workflow, no context switching.

What it looks like in practice

I ran /pm:prd on a new feature module for one of my products. Claude analyzed the codebase, identified reusable components (approval workflows, audit trails, role-based access), and generated a full PRD with user stories, functional requirements, and architecture reuse analysis.

The PRD was saved locally and created as a GitHub Issue automatically — with MVP scope, P0/P1/P2 priorities, and a breakdown of what I could reuse vs. build new.

From "I want to add this feature" to a structured GitHub Issue with a technical spec ready to generate — in one session.

Full writeup

The full post covers all the commands (/pm:analyze, /pm:landscape, /pm:gaps, /pm:prd), the WINNING filter scoring breakdown, the spec-kit integration, data storage, and a step-by-step walkthrough with screenshots.

Read the full writeup on Build Signals

Links:

Starting my own newsletter — Build Signals

Ooi Yee Fei — Mon, 09 Feb 2026 08:30:10 +0000

I've been writing about Claude Code skills, LLM dev workflows, and lessons from building products solo. Thanks to everyone who subscribed on Medium — I appreciate you following along, especially now early on when I was just figuring things out. Going forward, I'll be publishing on my own newsletter — Build Signals.

Same content, just a more direct way to stay connected. I'll still post on Medium, but full writeups will go out on Build Signals first.

If you'd like to follow along:

Subscribe to Build Signals

[AI-Powered ASL Communication App] - Part 2: Custom ASL Model on EKS + Claude Facilitation via Context Engineering

Ooi Yee Fei — Sat, 31 Jan 2026 15:41:24 +0000

Part 1 covered training an ASL recognition model on EKS. This part focuses on deploying that model for inference and designing a real-world conversation flow using Bedrock Claude with context engineering.

The Problem

A deaf person and a hearing person want to have a conversation. No interpreter available.

Deaf person signs → hearing person needs audio
Hearing person speaks → deaf person needs text

Seems straightforward - just translate between modalities. But there's a deeper challenge most people miss:

ASL isn't English in sign form. Grammar, word order, and expression differ fundamentally. Direct translation doesn't work.
ML models aren't perfect. Our model is 65% accurate. Users need guidance when detection fails.
Vocabulary is limited. 100 signs vs thousands of English words. Users need help staying within what the system understands.

This is where AI facilitation matters. Claude doesn't just translate - it understands conversation context, suggests responses within vocabulary constraints, and keeps the dialogue flowing even when the recognition model stumbles. It bridges the gap between imperfect ML and usable conversation.

The app runs in a browser. No downloads, no accounts.

Some sample demo images, more detailed in below discussion:

System Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                            Browser (React)                               │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐  ┌──────────────┐  │
│  │   Webcam     │  │  MediaPipe   │  │  Speech     │  │ Conversation │  │
│  │   Feed       │──│  Hand Track  │  │  Recognition│  │ View + TTS   │  │
│  └──────────────┘  └──────┬───────┘  └──────┬──────┘  └──────────────┘  │
└────────────────────────────┼────────────────┼────────────────────────────┘
                             │                │
                             │ landmarks      │ text (browser API)
                             ▼                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      CloudFront + Lambda                            │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  /v1/asl/predict  →  EKS Inference (PoseLSTM)                 │  │
│  │  /v1/suggestions  →  Bedrock Claude (context-aware prompts)  │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Flow:

Deaf → Hearing: Webcam → MediaPipe → EKS model → Text → Browser TTS (audio)
Hearing → Deaf: Microphone → Browser Speech Recognition → Text display

Deploying the Inference Endpoint

EKS Setup for Inference

Training used g6.12xlarge (4 L4 GPUs). For inference, that's overkill. The model is 2M params - runs fine on CPU.

Key decisions:

CPU inference - Model is ~2M params, no GPU needed
Minimal resources - 256Mi memory, 100m CPU request
FastAPI server - Async-friendly, good for ML serving

The Inference API

Input: 32 frames of hand landmarks (126 features per frame)
Output: predicted sign, confidence score, top-5 predictions
Latency: ~40ms per prediction

Why Lambda + CloudFront (Not Direct EKS)

Lambda as a proxy instead of exposing EKS directly:

Security - EKS stays in private subnet, no public exposure
Caching - CloudFront caches repeated requests (same sign = same response)
Cost - Lambda scales to zero, only pay for actual requests
Auth - CloudFront + OAC handles authentication without custom code

User → CloudFront → Lambda → EKS (private)
         ↓
      (caching)

Conversation Facilitation: The Design Challenge

Sign recognition is solved. The harder problem: making the conversation actually work.

The Communication Gap

ASL is not English - Grammar, word order, and concepts differ
Vocabulary mismatch - Our model knows 100 signs. Real conversations need more.
Context matters - "BOOK" could mean "I want a book" or "I'm reading a book" depending on context

Solution: Context Engineering with Bedrock Claude

Claude handles conversation facilitation - not translation, but generating contextually relevant response suggestions.

Key requirement: Claude needs to know not just what was said, but who said it and how they communicate.

// suggestionService.ts - the actual implementation

interface ConversationTurn {
  type: 'asl' | 'voice';  // Who said it and how
  text: string;            // What was communicated
}

export async function getLLMSuggestions(
  conversationHistory: ConversationTurn[],
  mode: 'asl' | 'voice'
): Promise<string[]> {
  const prompt = buildSuggestionPrompt(conversationHistory, mode);
  const suggestions = await callClaudeBedrock(prompt);
  return suggestions;
}

Context Engineering: The Four Key Decisions

1. Sliding Window (Last 6 Turns)

We don't send the entire conversation history. Just the last 6 turns.

conversationContext = history
  .slice(-6)  // Only recent context
  .map(turn => `[${turn.type === 'asl' ? 'Deaf user' : 'Hearing user'}]: ${turn.text}`)
  .join('\n');

Why 6? Enough context to understand the topic, not so much that it confuses the model or wastes tokens. Most conversations have natural topic shifts every 4-6 exchanges anyway.

2. Mode-Aware Prompting

The same conversation needs different suggestions depending on who's responding next:

const modeContext = isASLMode
  ? `The deaf user will respond using ASL signs. Suggest common ASL signs from the WLASL-100 vocabulary.
Available signs include: HELLO, HELP, YES, NO, THANK-YOU, PLEASE, GOOD, WANT, NEED, LIKE, GO, EAT, DRINK...
Keep suggestions to single words or short phrases that are actual ASL signs.`
  : `The hearing user will respond by speaking. Suggest natural, conversational responses.
Keep suggestions brief (under 8 words each) and appropriate for the context.`;

This is critical. ASL suggestions must be constrained to signs the model can actually recognize. Voice suggestions can be natural language.

3. Turn Type Labeling

Each message is tagged with who sent it:

// In the prompt, turns look like:
// [Deaf user (ASL)]: HELLO
// [Hearing user (Voice)]: Hi! How are you?
// [Deaf user (ASL)]: GOOD

Claude can track the conversation flow and understand the back-and-forth pattern. This helps it generate appropriate responses for each party.

4. Response Caching

Same context = same suggestions. No need to hit the API twice.

const suggestionCache = new Map<string, { suggestions: string[]; timestamp: number }>();
const CACHE_TTL_MS = 30000; // 30 seconds

// Cache key is the full context hash
const cacheKey = `${mode}:${conversationHistory.map(t => `${t.type}:${t.text}`).join('|')}`;

30-second TTL means rapid back-and-forth doesn't hammer the API, but suggestions stay fresh as the conversation evolves.

Additional Design Decisions

Graceful fallbacks - If API fails, default suggestions like ['YES', 'NO', 'HELP', 'THANK-YOU'] are shown
Vocabulary display - UI shows all 100 supported signs so users know what's possible
JSON output format - Claude returns suggestions as JSON array for reliable parsing

Real-Time Flow

Deaf user signs → Hearing user hears:

1. Webcam captures video stream
   ↓
2. MediaPipe extracts hand landmarks (client-side, ~10ms)
   ↓
3. Collect 32 frames of landmarks (~1 second of signing)
   ↓
4. Send to /v1/asl/predict (EKS)
   ↓
5. Model returns prediction + confidence (~40ms)
   ↓
6. If confidence ≥ 60%: show confirmation dialog (threshold balances false positives vs missed detections)
   ↓
7. User confirms → sign added to conversation
   ↓
8. Browser TTS speaks the sign to hearing user
   ↓
9. Bedrock generates suggestions for next response

Hearing user speaks → Deaf user reads:

1. Browser Speech Recognition API captures audio
   ↓
2. Real-time transcription (browser-native)
   ↓
3. User confirms → text added to conversation
   ↓
4. Deaf user reads the message
   ↓
5. Bedrock generates ASL sign suggestions (vocabulary-constrained)

Total latency: ~500ms for ASL detection flow. Speech recognition is near-instant (browser-native).

How This Differs from Other ASL Apps

Most ASL apps:

Teaching ASL (educational)
Translating ASL to text (one-way)
Avatar-based signing (uncanny valley)

VoxSign:

Bidirectional (both parties can initiate)
Context-aware (Claude tracks conversation flow)
Confirmation-based (handles model uncertainty)
Suggestion-driven (guides users within vocabulary limits)

Unsolved Challenges

100 signs isn't enough - Real conversations need 500+ signs minimum
Two-hand signs are harder - Model struggles with signs requiring hand interaction
No sentence-level understanding - We detect individual signs, not full ASL sentences
One-way ASL - Deaf user reads text, doesn't see ASL video (video generation was too slow)

Key Takeaways

Context engineering > prompt engineering - Deciding what to send Claude (last 6 turns, turn types, mode context) matters more than prompt wording. The sliding window was a bigger win than any prompt change.
Mode-aware prompting is essential - Same conversation, different constraints. ASL suggestions must be vocabulary-constrained; voice suggestions can be natural language.
Confirmation UX handles model uncertainty - User confirms predictions before adding to conversation.
Cache aggressively - Same context = same suggestions. 30-second TTL saved API costs and improved latency.
Bedrock is cost-effective - ~$0.003 per conversation turn. Context engineering to reduce token count pays off directly.

ClaudeCode Streak Skill Got a Telegram Bot: Check In From Your Phone

Ooi Yee Fei — Mon, 05 Jan 2026 23:58:56 +0000

This is a follow-up to my first Streak post where I introduced the Claude Code skill for tracking personal challenges.

A few weeks ago, I shared the Streak skill - a Claude Code skill for tracking any personal challenge. The response was better than I expected. People actually found it useful. I saw some started to try it out and raise for features.

More importantly, I just completed my first 30-day challenge using it.

What 30 Days of Tracking Taught Me

After the challenge ended, I asked Claude Code to do a retrospective analysis. Not just "here's what you did" - but a comprehensive review of patterns, habits, what worked, what didn't.

The insights were... humbling.

Claude actually analyzed thoroughly and proactively surfaced many aspects of retrospective insights - without me prompting or guiding it through. It covered things I built, impact and wins to measure performance and results over time (Win Rate even! Claude was brutal on that one). It also automatically did Tech Stack Analysis without me asking, identifying patterns of tools I use, projects I built, things I left in backlog lists dusted but forgot, and Emerging Patterns that were new and surprising to me - stuff I picked up outside my comfort zone that I hadn't consciously noticed.

There was a Top 10 Learnings section ranked by impact, work-related insights like product decisions, feature prioritization, inspiration and new ideas that cross-influenced from my builds. And of course, What Worked, What Didn't Work, and Recommendations for Next Steps.

All this from the data I'd been logging daily. For someone who's the least planned and structured person, having Claude surface these patterns was eye-opening.

My experiment has been on technical work I did, but I'm excited to double down on effort with more aspects of life - fitness, habits, learning - and do more cross-analysis to understand how each area impacts and affects the others.

But... there was a small challenge.

The Problem: Friction

Here's what started bothering me after week 2.

Some days got busy. Really busy. And even with the best intentions, I'd forget to check in. Not because I didn't want to - but because the friction was too high.

To log a check-in, I had to:

Open terminal
Navigate to the right directory
Start Claude Code
Remember the command
Actually do the check-in

That's 4 steps before the actual work. On a hectic day, steps 1–4 just… didn't happen. And those I have to track back and log to fill up the days , which was okay and I still managed to do all 30-days logging in the end, but is not ideal.

I had a calendar reminder option, but that just reminded me to open terminal. The friction was still there.

I wanted something that could:

Ping me when a check-in was due
Let me check in right there, without opening anything else
Work from my phone (because that's always in my hand)

Exploring Options

I started thinking about chat-based notifications. Something that lives where I already am:

Platform	Pros	Cons
Slack	Already use it for work	Paid for full API, mixes with work
WhatsApp	Everyone has it	Business API is complex, costs money
Discord	Free, good bot support	Don't use it daily
Telegram	Free, excellent bot API	Need to install app

Telegram won. Free API, easy bot creation, works on all devices, and the bot API is surprisingly good.

So I built it.

Added: Claude Code Streak Skill Telegram Bot

The bot does everything the Claude Code skill does - but from your phone or laptop chat app:

┌─────────────────────────────────────────────┐
│           @mystreak_bot                     │
│                                             │
│  /list    → See all your challenges         │
│  /switch  → Change active challenge         │
│  /streak  → Interactive check-in            │
│  /stats   → View progress & streaks         │
│  /insights → Cross-challenge patterns       │
│  /new     → Create new challenge            │
└─────────────────────────────────────────────┘

What It Looks Like

Main Menu:

Welcome to Streak Bot!

Your active challenge: morning-workout
Status: 5-day streak

[Check In]  [List Challenges]
[Stats]     [Insights]

Check-in Flow:

Check In: morning-workout

How did it go today?

[Great]  [Good]  [Okay]  [Struggled]

What did you work on?
> Push day - bench press, overhead press, tricep dips

Any notes or learnings?
> Finally hit 60kg bench! Form felt solid.

Session 15 logged!
Current streak: 6 days

List Challenges:

Your Challenges:

ACTIVE:
* morning-workout (Fitness) - 6 day streak
  learn-rust (Learning) - 3 day streak
  read-12-books (Learning) - due today

PAUSED:
  meditation-habit (Habit) - paused 2 weeks ago

Use /switch [name] to change active challenge

One Bot, All Challenges

Here's an important design decision: one bot manages ALL your challenges.

Not one bot per challenge. Not separate bots for fitness vs learning vs work.

Why? Because your challenges are interconnected.

Your morning workout affects your coding productivity. Your learning enables your building. Your meditation habit influences your creative work.

By keeping everything in one place, the bot can detect patterns across challenges:

Cross-Challenge Insight:

Your "morning-workout" sessions correlate with
higher productivity in "learn-rust".

Sessions where you worked out first show 40%
more concepts covered.

This is the whole point of Streak - not just tracking individual things, but understanding how they connect.

How It Works: Both Stay in Sync

The Telegram bot and Claude Code skill read/write the same files:

┌─────────────────────────────────────────────────────────────┐
│                    .streak/ folder                           │
│                   (Source of Truth)                          │
└─────────────────────┬───────────────────────┬───────────────┘
                      │                       │
                 reads/writes            reads/writes
                      │                       │
                      ▼                       ▼
        ┌─────────────────────┐   ┌─────────────────────┐
        │   Claude Code       │   │   Telegram Bot      │
        │   (Terminal)        │   │   (Phone)           │
        │                     │   │                     │
        │   Deep work:        │   │   Quick check-ins:  │
        │   - Planning        │   │   - Log progress    │
        │   - Research        │   │   - View stats      │
        │   - Analysis        │   │   - Switch context  │
        └─────────────────────┘   └─────────────────────┘

Check in from Telegram in the morning. Do deep analysis in Claude Code later. They see the same data.

Optional GitHub sync: If you want to access from multiple devices, commit your .streak/ folder and enable git sync. The bot can auto-pull before reading and auto-push after check-ins.

Push Notifications

This is the feature that I wanted to test out

The bot doesn't just wait for you to message it - it proactively pings you when challenges are due:

🔔 Streak Check-in Reminder

❗ Overdue:
• morning-workout (2d overdue)

📅 Due Today:
• learn-rust (streak: 5 days)
• read-24-books (weekly check-in)

Tap /streak to check in

[✓ Check In Now]  [📋 List All]

No more forgetting. No more "I'll do it later." The notification lands on your phone at your configured time (default 9 AM), and you can check in right there with two taps.

Configure your notification time:

# In your .env file
TIMEZONE=Asia/Singapore    # Your timezone
NOTIFICATION_HOUR=9        # 9 AM
NOTIFICATION_MINUTE=0      # On the hour

This is the friction-killer I was looking for. Bot runs 24/7 in Docker, pings me every morning, I tap "Check In Now", done in 30 seconds.

Setup: Easier Than You Think

Prerequisites (5 minutes, manual by you, one-time)

1. Create a Telegram bot:

Open Telegram, message @BotFather
Send /newbot
Pick a name: My Streak Bot
Pick a username: mystreak_bot (must end in bot)
Save the token it gives you

2. Get your chat ID:

Message @userinfobot on Telegram
Save the number it returns

3. Message your bot:

Find your bot by username
Send /start (required before it can message you)

4. Create .env file in your project folder:

cat > .env << EOF
TELEGRAM_BOT_TOKEN=your-token-here
ALLOWED_USERS=your-chat-id-here
TIMEZONE=Asia/Singapore
EOF

Deploy Automatically(One Command)

/streak-telegram

That's it. The command:

Verifies your credentials
Copies the bot files
Adds .env to .gitignore
Starts Docker container
Shows you management commands

The bot runs in the background and auto-restarts on reboot.

Management

docker-compose logs -f      # View logs
docker-compose restart      # Restart bot
docker-compose down         # Stop bot

Real Examples

Example 1: Sarah's Reading Challenge

Sarah wants to read 24 books this year. She created a Learning challenge:

Challenge: read-24-books
Type: Learning
Goal: 2 books per month
Cadence: Weekly check-in

Her typical week in Telegram:

Sunday evening:

/streak

How's the reading going?

> Finished "Atomic Habits" - great stuff on habit
> stacking. Started "Deep Work" yesterday.

Progress on milestone?
> 4/24 books done

Session 8 logged!
Current streak: 8 weeks

Mid-week reminder (from bot):

Streak Reminder

Your challenge "read-24-books" is due for
check-in in 2 days.

Current streak: 8 weeks

[Check In Now]  [Remind Later]

Example 2: Mike's Fitness Journey

Mike's tracking a strength training habit:

Challenge: strength-training
Type: Fitness
Goal: Consistent 4x/week lifting
Cadence: Daily

Morning check-in:

/streak

What workout today?
> Pull day - deadlifts, rows, pull-ups, bicep curls

How did your body feel?
> Good energy. Deadlift felt heavy but form was solid.
> Hit 100kg for 3 reps - new PR!

Any notes?
> Need more sleep. Was yawning between sets.

Session logged! PR detected - nice work!

Stats view:

/stats

strength-training Statistics

Sessions: 23
Current Streak: 12 days
Longest Streak: 12 days (current!)
Average: 4.2 sessions/week

Best Days: Tuesday, Thursday, Saturday
Common Notes: "energy", "form", "sleep"

Achievements:
  First Flame (3-day streak)
  On Fire (7-day streak)
  Dedicated (10 sessions)

Example 3: Cross-Challenge Insights

Alex runs three challenges: morning-meditation (Habit), learn-python (Learning), and side-project (Building).

After a month, the /insights command surfaces:

Cross-Challenge Patterns:

1. Meditation → Productivity Link
   Days with morning meditation show 35% longer
   coding sessions in both learn-python and
   side-project.

2. Skill Transfer Detected
   Your "learn-python" Session 12 (learned async/await)
   directly enabled "side-project" Session 8
   (built async API client).

3. Best Sequence
   Your most productive days follow this pattern:
   meditation → workout → coding
   Consider making this your default routine.

My New Workflow

Here's how I use both together now:

Morning (phone):

Wake up, check Telegram
Quick /streak check-in for morning routine
See what's due today

During work (terminal):

Claude Code for deep work sessions
Research, planning, analysis
Detailed check-ins with context

Evening (phone):

Quick log of what got done
/stats to see progress
Plan tomorrow's focus

The friction is gone. I haven't missed a check-in in 2 weeks.

Try It Out

If you're already using Streak:

# Update the skill
/plugin update ccc-skills@ccc

# Set up Telegram (5 min)
# 1. @BotFather → /newbot → save token
# 2. @userinfobot → save chat ID
# 3. Create .env file

# Deploy
/streak-telegram

If you're new to Streak:

# Install
/plugin marketplace add ooiyeefei/ccc
/plugin install ccc-skills@ccc

# Create your first challenge
/streak-new

# Add Telegram (optional but recommended)
/streak-telegram

What's Next

A few things I'm thinking about:

Notification timing - Smart reminders based on your check-in patterns, not just fixed schedules
Voice check-ins - Telegram supports voice messages. Could be useful for quick logs
Weekly digests - Automated summary sent every Sunday
Multi-device sync - Better GitHub integration for teams or multiple devices

But for now, the basics work. And that's enough.

The goal was simple: reduce friction so much that checking in becomes automatic.

Telegram on my phone. Always there. One tap to log progress. No terminal, no directory navigation, no startup time.

For someone as unstructured as me, that's the difference between a system that sticks and another forgotten tool.

Give it a try. Let me know what breaks - or what features you'd love to see!

Got ideas? Found a bug? Want a feature that would make this work better for your use case? Open an issue on GitHub - I'd love to hear what challenge types you're tracking and how the tool can better support them. The best features come from real users with real needs.

Links:

[AI-Powered ASL Communication App] - Part 1: Cost-Effective Sign Detection Model Training on AWS EKS

Ooi Yee Fei — Sat, 03 Jan 2026 23:47:40 +0000

Recently I had an idea for an app that helps with ASL sign communication - wanted to experiment and see if it's feasible. But first, I need a model that can detect signs well enough. I'm not an ML scientist - more of an AI engineer. So ML training is something I'm still learning.

I started by researching existing models. Couldn't find a ready-to-use one that works for my use case:

Google SignGemma - Not released to public yet
SignLLM - Interesting approach but designed for a different workflow (sign-to-text translation vs real-time detection)

Most existing solutions focus on fingerspelling (alphabet) rather than word-level signs. So I explored training my own model with an ASL dataset.

Initial results:

Challenges

Three things I needed to figure out:

No suitable existing model - Nothing I could just download and use for word-level ASL detection
Dataset selection - Which dataset has enough samples per class to train reliably?
Training approach - How to get effective results without massive compute?

Dataset Selection: Why WLASL-100

I evaluated several datasets:

MS-ASL - Looks promising but requires downloading from YouTube. Many videos are now unavailable. Gave up.
WLASL-2000 - 2000 classes but the class-to-sample ratio is terrible. Some signs have only 3-5 videos. Not enough for training.
WLASL-100 - 100 classes with more samples per class. Better balance.

Went with WLASL-100 (Word-Level American Sign Language). The tradeoff: smaller vocabulary but more reliable training.

Model Selection: Why Pose LSTM (Not VideoMAE or VLMs)

I started with VideoMAE - seemed like the good choice for video understanding. But:

86M parameters is overkill for ~750 training samples
Fine-tuning took forever, results were mediocre (~40%)
Inference was slow for real-time detection

Then I tried a simpler approach: extract hand landmarks with MediaPipe, feed into a lightweight LSTM. This made sense because:

ASL signs are primarily about hand positions and movements
MediaPipe gives us 21 landmarks per hand (x, y, z coords) = 126 features
Much smaller input than raw video frames
Can focus the model on what actually matters

Model Experiments: The Failures Matter

Tried several approaches before finding what works:

Model	Params	Result	What Happened
VideoMAE (fine-tuned)	86M	~40%	Too heavy, slow inference
Pose LSTM v1	~2M	51.52%	Decent baseline
Pose LSTM v2	14M	0.61%	Massive overfitting
Pose LSTM v2-lite	~1M	58.79%	Stripped down, worked
Pose LSTM v3	4.6M	1.82%	FocalLoss + too many params
Pose LSTM v3-enhanced	~2M	65%+	Final model

Key learnings:

Bigger model != better results (especially with ~750 training samples)
14M params on 750 samples = disaster
MediaPipe hand landmarks + BiLSTM + attention pooling = sweet spot
Label smoothing and mixup augmentation helped

AWS Infrastructure: EKS + Spot Instances

Here's what I set up:

Cluster Config:

AWS EKS in us-east-1
Node group: g6.12xlarge (4x L4 GPUs each, 24GB VRAM per GPU)
Used spot instances - ~70% cost savings (~$1.72/hr vs $5.67/hr on-demand)
Training time: ~1.5-2 hours per run

Why g6.12xlarge:

L4 GPUs are newer and cheaper than older V100s
4 GPUs per instance = can run multi-GPU DDP training
Spot availability is good for this instance type

Checkpoint Strategy for Spot Instances

Spot instances can be terminated anytime. My solution:

# Save checkpoint every epoch to S3
class CheckpointCallback:
    def __init__(self, s3_bucket, s3_prefix):
        self.s3_bucket = s3_bucket
        self.s3_prefix = s3_prefix

    def on_epoch_end(self, trainer, epoch, metrics):
        # Save locally
        checkpoint_path = self._save_checkpoint(trainer, epoch)

        # Upload to S3 immediately
        self._upload_to_s3(checkpoint_path)

Key points:

Checkpoint every epoch (~5-10 min intervals)
Upload to S3 immediately after each checkpoint
Max loss on spot interruption: one epoch of training
Resume from S3 checkpoint on new instance

S3 bucket structure:

s3://asl-model-checkpoints/
  └── runs/
      └── 2025-12-26-v3-enhanced/
          ├── checkpoint-epoch-001/
          ├── checkpoint-epoch-002/
          └── best/

How I Train

The actual training flow:

Data preprocessing
- Download WLASL videos
- Extract 32 frames uniformly sampled
- Run MediaPipe to get hand landmarks (21 points x 2 hands x 3 coords = 126 features)
- Compute velocity features (frame-to-frame differences)
Model architecture (v3-enhanced)
- Input: 126 features (hand landmarks)
- BiLSTM with 3 layers, hidden_dim=384
- Attention pooling (learns which frames matter)
- Classifier head with layer norm + dropout
Training config

   batch_size = 32
   learning_rate = 1e-3  # with warmup
   epochs = 50
   early_stopping_patience = 10
   label_smoothing = 0.1

Augmentations
- Random scaling (0.9-1.1)
- Random rotation (-15 to +15 degrees)
- Mixup (alpha=0.2)
- No horizontal flip (would change sign meaning)

Results

Final model (v3-enhanced):

Validation accuracy: 30–40% → 65%+ (from VideoMAE baseline)
Top-5 accuracy: ~90%
Inference time: <50ms per video
Model size: 86M → ~2M params (43x smaller)

Not state-of-the-art, but good enough for a demo app.

Cost Breakdown

For one successful training run:

g6.12xlarge spot: ~$1.72/hr x 2 hours = ~$3.44
S3 storage (checkpoints + data): ~$0.50/month
Total per experiment: under $5

I ran maybe 15-20 experiments total during development. Spot instances saved me hundreds of dollars.

What I'd Do Differently

Start with simpler models first (I wasted time on VideoMAE)
More aggressive data augmentation earlier
Consider synthetic data generation for underrepresented classes
Set up proper MLflow experiment tracking from day one

What's Next

This post covered the ML training part. Next, I'll write about deploying the model and building the app:

User signs → Webcam capture → MediaPipe → Model inference → Predicted gloss → TTS → Audio output

Part 2 will cover the AI engineering side - FastAPI server, EKS deployment, and how it all connects.

Building AI Video Generation Pipelines with AWS Lambda Durable Functions

Ooi Yee Fei — Wed, 24 Dec 2025 07:27:05 +0000

At re:Invent 2025, AWS announced Lambda Durable Functions — a new capability that lets you write long-running, stateful workflows as simple sequential code while the SDK handles checkpointing, retries, and state management automatically.

I wanted to test it with some relevant use case, so I tested with a content generation platform that transforms product photos into social media content using Gemini for image and video generation. This post covers why I chose Lambda Durable Functions and the patterns that made it work.

The Problem: AI Video Generation is Slow

Video generation with models like Veo 3.1 takes ~90 (or longer) seconds. That's a problem when:

API Gateway times out at 29 seconds
Lambda's synchronous invocation limit is 15 minutes
Users expect a response, not a "check back later"

Traditional solutions involve Step Functions, SQS queues, or webhook callbacks. All require orchestration code, state management, and error handling boilerplate.

What is Lambda Durable Functions?

Lambda Durable Functions uses a checkpoint-and-replay model:

Execute & Checkpoint — The function runs, and the SDK saves progress at each context.step() call
Wait & Suspend — When encountering a wait or external call, the function terminates gracefully, preserving state
Resume & Replay — On the next invocation, completed steps are skipped using their checkpointed results

Key capabilities:

Workflows up to 1 year — Overcomes the 15-minute limit
Automatic retries — Handles failures with exponential backoff from the last successful step
No infrastructure — No Step Functions state machines, no queues, no additional services

The SDK (@aws/durable-execution-sdk-js) wraps your handler and manages the checkpointing transparently.

Architecture Overview

In my implementation, the Lambda function handles multiple workflow modes:

GENERATE_VIDEO — Image → Veo 3.1 → Video URL (with polling)
GENERATE_IMAGE — Prompt + Reference → S3 URL
GENERATE_CONTENT — Image → Content Strategy JSON
UPLOAD_IMAGE — Base64 → S3 (no durable steps needed)

Pattern 1: Checkpointed Polling

Video generation is async. You start a job, get an operation ID, and poll until completion. Here's the pattern:

export const handler = withDurableExecution(
  async (event, context: DurableContext) => {
    // Step 1: Start the operation
    const operationId = await context.step('start-generation', async () => {
      return await startVideoGeneration(prompt, imageBase64);
    });

    // Step 2-N: Poll with unique step names
    let result = null;
    let pollCount = 0;

    while (!result && pollCount < 60) {
      pollCount++;

      const status = await context.step(`poll-${pollCount}`, async () => {
        return await checkOperationStatus(operationId);
      });

      if (status.done) {
        result = status.videoUrl;
        break;
      }

      // Durable wait — function suspends here
      await context.step(`wait-${pollCount}`, async () => {
        await sleep(10000);
        return true;
      });
    }

    return { videoUrl: result };
  }
);

Each poll iteration has a unique step name (poll-1, poll-2, etc.). If Lambda times out mid-poll, it resumes from the last completed step on the next invocation.

Pattern 2: Large Payload Handling

Durable function checkpoints have a 256KB limit. Generated images easily exceed this. The solution: upload to S3 first, return the URL.

// Wrong — returns large base64, checkpoint fails
const imageData = await context.step('generate-image', async () => {
  return await generateImage(prompt); // ~500KB base64
});

// Correct — upload to S3, return small URL
const imageUrl = await context.step('generate-and-upload', async () => {
  const { imageBase64, mimeType } = await generateImage(prompt);
  const url = await uploadToS3(imageBase64, mimeType);
  return url; // ~200 characters
});

This pattern applies to any step that produces large outputs.

Pattern 3: Selective Durability

Not every operation needs checkpointing. Quick operations like generating presigned URLs or uploading base64 to S3 can run without durable steps:

if (mode === 'UPLOAD_IMAGE') {
  // No context.step() — runs directly, no checkpoint overhead
  return await handleUploadImage(body);
}

if (mode === 'GENERATE_VIDEO') {
  // Uses context.step() for long-running polling
  return await handleGenerateVideoWorkflow(body, context);
}

Reserve durable steps for operations that:

Take more than a few seconds
Involve external API calls that might fail
Need retry capability

CDK Infrastructure

Deploying durable functions requires enabling the feature on the Lambda:

const workflow = new lambda.Function(this, 'Workflow', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  timeout: Duration.minutes(15),
  memorySize: 1024,
});

// Enable durable execution
const cfnFunction = workflow.node.defaultChild as lambda.CfnFunction;
cfnFunction.addPropertyOverride('DurableConfig', {
  Enabled: true,
});

The function URL provides direct HTTPS access without API Gateway, avoiding the 29-second timeout entirely.

Lessons Learned

1. Step names help with debugging

While the SDK tracks execution sequence (so identical step names work), using descriptive unique names like poll-status-1, poll-status-2 makes CloudWatch logs and debugging much easier.

2. Step outputs are serialized

Everything returned from context.step() must be JSON-serializable. No functions, no circular references, no Buffers.

3. Cold starts add latency

Each resume is a new Lambda invocation. For workflows with many steps, cold starts accumulate. Consider provisioned concurrency for latency-sensitive workflows.

4. Debugging requires understanding replay

Console logs in completed steps won't appear on replay — the step returns its cached result immediately. Add logging outside steps or use CloudWatch to trace the full history.

When to Use Durable Functions vs Step Functions

|        Use Case.         | Durable Functions | Step Functions |
| - - - - - - - - - -- - - | - - - - - - - - - | - - - - - - - -|
| Simple linear workflows  |          ✓        |                |
| Complex branching logic  |                   |        ✓       |
| Visual workflow designer |                   |        ✓       |
| Code-first development   |          ✓        |                |
| Sub-second coordination  |                   |        ✓       |
| Long waits (hours/days)  |          ✓        |        ✓       |

Durable Functions shine when you want to write workflows as regular code without learning a new DSL or managing state machine definitions.

Results

With Lambda Durable Functions, the video generation pipeline:

Handles 30-90 second video generation without timeout issues
Automatically retries failed API calls from the last checkpoint
Scales to concurrent requests without queue management
Costs nothing while waiting (Lambda suspends between polls)

The entire backend is a single Lambda function. No Step Functions, no SQS, no orchestration infrastructure.

Custom Claude Code Skill: Auto-Generating / Updating Architecture Diagrams with Excalidraw

Ooi Yee Fei — Sun, 21 Dec 2025 08:34:33 +0000

I maintain several GCP infrastructure projects with Terraform. Every time I onboard someone or need to explain the architecture, I face the same problem: my diagrams are either outdated or does not even exist.

The Problem

If you use any coding agent you can generally ask Claude or Gemini to scan your codecase and generate architecture diagram anytime. I tried different approaches:
Mermaid diagrams - Great for code-based diagrams, but they get cluttered fast. A Terraform stack with 15+ resources becomes unreadable. And they don't support the freeform layout I need.

Manual Excalidraw - Beautiful, collaborative, easy to share. But I have to manually update it every time infrastructure changes. Which means it's always outdated.

What I wanted: Claude Code analyzes my codebase and generates an Excalidraw diagram automatically. High-level, clean, with proper arrows.

Most importantly, as I continue working on the project, Claude Code can update the diagram, by understanding the up-to-date codebase, comparing against the old diagram which give useful contexts sometimes.

Exploring Solutions

I considered different approaches to make this convenient:

Ad-hoc prompts — Just ask Claude Code when I need an update
Claude Code Skill — Teach Claude the Excalidraw format once, reuse everywhere
Plugin with slash command — /excalidraw to trigger generation
MCP server — More complex, but could watch for file changes
Hooks — Auto-trigger diagram updates after certain git operations

I started with Claude Code Skills — the simplest to implement. Skills are markdown files that teach Claude how to perform specific tasks. No scripts, no servers, just knowledge files. Perfect for experimenting with quality and feasibility before investing in more complex solutions.

Results

Here’s what it generated for an EKS AI/ML Terraform stack — Karpenter, GPU workloads, Rafay orchestration — from a single prompt:

Full Writeup

The full post covers how the skill generates valid .excalidraw JSON with proper labels and elbow arrows, installation, usage commands, and more examples.

Read the full writeup on Build Signals

Links:

GitHub (For the Skill)
GitHub (For the Sample repo I used)
Excalidraw
Claude Code

Beyond Coding: Your Accountability Buddy with Claude Code Skill

Ooi Yee Fei — Mon, 15 Dec 2025 00:11:45 +0000

After months of demanding work and endless context-switching recently, it got me thinking: out of the 1001 personal learning topics, build ideas, and reading lists I've been wanting to tackle over the past 7-8 months - what have I actually achieved? How am I progressing? Have I drifted? Are those still the priorities I remember? What even are they anymore?

I realized I'd lost track. And time keeps slipping by.

So I wondered: for someone who's not the most organized person - someone who tends to be random and follows what they want to do impromptu - how can I have a better, systematic way to handle and track all this? Something that gives me insights when I need them, lets me reflect on where I'm heading, and tells me if I'm still on track. But most importantly, keeps the flexibility and openness I need to explore new things as they come.

My first thought: not another task tracker. Not the 999th document or note tool or app that I've tried and given up on (or forgotten about) within a week. How can I make this more fun? Something that suits my way of thinking - how I actually feel motivated?
Challenge. That was my first thought. But how do I build it? I need help. Since I'm a big fan of Claude Code day-to-day, is there something I can do with it differently?

So I decided to give it a try. I had no idea what would work - only time would tell. If this system makes me remember it and stick with it for more than a week, it works. I started brainstorming and designing with Claude. One thing on my long list was exploring how Claude Code skills could be used in different scenarios. No apps. No tools. Just a Claude Code skill - simple enough that I can "inject" it into my beloved Claude Code to help me with this.

After letting the brainstorming juice flow, I ended up with this skill: GitHub

What it does: every day I check in, log progress, track ideas, and try to maintain momentum. But my original workflow was locked into a specific folder structure and only worked for "building" type challenges.

What if someone wanted to track a fitness challenge? A reading goal? A meditation habit?

I decided to build a Claude Code skill that could track any type of challenge. After discussing with friends, there seems to be a working pattern that's open, flexible, and adaptable enough regardless of challenge type - fitness, learning, habits. We all have some aspect of our life we hope to improve. It seems to fit just right with what Claude Code skills can do.

How It Ended Up

My previous /daily-checkin workflow worked great for my 30-day AI/ML challenge. It had:

challenge-log.md for tracking progress
daily-context.md for setting up each session
ideas-backlog.md for things to try - whenever my random brain pops up with an idea, I just throw it in the backlog. Or if I'm lazy, I tell Claude in scattered-brain chatting style and it logs it in a structured way for me.
preferences.md for my stack and tools

But it was hardcoded for tech challenges. Questions like "what did you ship?" and "tech stack used?" are useless if you're tracking a workout routine or trying to read 12 books this year.

I wanted something that:

Works for any challenge type (learning, building, fitness, creative, habits)
Asks the right questions based on what you're tracking
Keeps the same useful file structure but adapts the content
Detects connections across different challenges (compound learning)

The Approach

Instead of building separate trackers for each domain, I realized the file structure could stay universal - only the content needs to adapt.

Think of it like this: A preferences file is useful whether you're tracking code or workouts. For code, it stores your stack and tools. For fitness, it stores your equipment and workout types. For food, maybe cuisine and diet type. Same purpose, different content.

This led to the "type-adaptive" design:

Same files for everyone (preferences.md, backlog.md, today.md, etc.)
Different sections filled in based on challenge type
Guided creation flow that asks type-specific questions

How It Works

The full post covers challenge types, type-adaptive preferences, cross-challenge insights, installation, usage commands, and real examples (learning + fitness challenges) — plus what I learned after 2 weeks of using it.

Read the full writeup on Build Signals

Links:

From Video to Voiceover in Seconds: Running MLX Swift on ARM-Based iOS Devices

Ooi Yee Fei — Fri, 05 Dec 2025 06:09:27 +0000

I started exploring ARM-based AI applications - how generative AI and machine learning models can run locally on ARM devices. Video editing friction sparked the idea: how can I cut down the time to polish demo videos?

That led to ScriptCraft - an iOS app that transcribes video, cleans up the script with an on-device LLM, generates new narration, and exports the final video. No cloud APIs. No uploads. Just your phone doing the work.

The Idea

I wanted to repurpose video content quickly. Record something, get an AI-polished transcript, hear it narrated professionally, export. No cloud APIs. No waiting for uploads. Just your phone doing the work.

How it works:

Import a video
Transcribe the audio with on-device speech recognition
Enhance the transcript with a local LLM
Generate narration via TTS
Replace the original audio and export

This gives an end-to-end ready video that is refined and ready to share.

MLX Swift: The Promise and The Reality

Apple's MLX framework lets you run ML models on device. MLX Swift brings this to iOS. I wanted to use it for the transcript enhancement step - clean up filler words, fix grammar, make it more readable.

The model: Qwen 0.5B 4-bit quantized. Small enough for mobile, supposedly capable enough for basic text tasks.

Setting it up looked straightforward:

let modelConfiguration = ModelConfiguration.configuration(id: "mlx-community/Qwen2.5-0.5B-Instruct-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: modelConfiguration)

Setting Up Physical Device Testing

MLX requires Metal GPU for inference. I used an iPhone 13 mini, which supports Metal.

Device setup:

Enable Developer Mode: Settings > Privacy & Security > Developer Mode. Restart and confirm.
Match Xcode version to iOS version to avoid build errors.
For local servers, configure with your Mac's network IP (e.g., http://10.0.0.100:5055) and bind to all interfaces with --host 0.0.0.0.

The Transcription Challenge

iOS has a built-in speech recognizer. Use it:

let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
let request = SFSpeechURLRecognitionRequest(url: audioURL)
request.requiresOnDeviceRecognition = true

let result = try await recognizer.recognitionTask(with: request)

Worked great for short clips. For a 3-minute video? It returned maybe 30 seconds of text.

The issue: SFSpeechRecognizer has limits. Apple doesn't document them clearly, but around 1 minute of audio seems to be the practical ceiling for a single request.

My fix: chunk the audio.

private let chunkDuration: TimeInterval = 30.0

func transcribe(asset: AVAsset) async throws -> String {
    let duration = CMTimeGetSeconds(asset.duration)
    var chunks: [String] = []

    for startTime in stride(from: 0, to: duration, by: chunkDuration) {
        let chunkResult = try await transcribeChunk(asset: asset, from: startTime)
        chunks.append(chunkResult)
    }

    return chunks.joined(separator: " ")
}

Better, but still inconsistent. Some chunks came back empty. The video had mixed audio sources (voice + background music from editing). The on-device recognizer struggles with that.

Added a fallback:

// Try on-device first
let onDeviceResult = try await transcribeWithMode(url: url, onDevice: true)
if !onDeviceResult.isEmpty {
    return onDeviceResult
}

// Fallback to server-based (uses Apple's servers)
return try await transcribeWithMode(url: url, onDevice: false)

What I learned: On-device speech recognition is good for clean audio, short clips. For real-world content with mixed sources, you need fallbacks.

The Hallucination Problem

Got transcription working. Fed it to Qwen 0.5B. The output?

"Sure, here's a rewritten version of your transcript with improved flow and engagement:

[Completely fabricated content about topics never mentioned in the original]"

The model hallucinated. My original prompt asked it to "enhance and improve" the script. The 0.5B model interpreted that as "make stuff up."

The fix was embarrassingly simple: ask for less.

private func buildPrompt(transcript: String) -> String {
    """
    Clean up this transcript. ONLY fix grammar and remove filler words.
    DO NOT add any new information or content.

    IMPORTANT: Output the same content, just cleaned. Do not invent or add anything.

    Original: \(transcript)

    Cleaned:
    """
}

No more "enhance." No more "improve." Just "clean up." The model stopped inventing.

Think of it like this: A 0.5B model is like a meticulous proofreader - excellent at catching typos and cleaning up grammar, but ask them to ghostwrite your memoir and they'll start making up your childhood. Keep the job description tight.

What I learned: Small LLMs need small tasks. The 0.5B model can clean text. It cannot creatively rewrite. Know your model's limits and prompt accordingly.

Audio Playback: One More Gotcha

Generated the narration. Called AVAudioPlayer.play(). Silence.

Turns out iOS needs explicit permission to play audio:

let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playback, mode: .default)
try audioSession.setActive(true)

Without this, audio plays in simulator but not on device (unless headphones are connected). Another simulator-vs-device difference.

The Final Pipeline

Video Import
    |
[Extract Audio] --> AVAssetExportSession
    |
[Chunk Audio] --> 30-second segments
    |
[Transcribe] --> SFSpeechRecognizer (on-device + server fallback)
    |
[Enhance] --> MLX Swift + Qwen 0.5B (cleanup only)
    |
[Generate Speech] --> Kokoro TTS via mlx-audio
    |
[Compose Video] --> AVMutableComposition (original video + new audio)
    |
Export

Each step has its own service class. Each handles its own errors. The view coordinates them.

What Surprised Me

MLX Swift works. Once you're on a real device with Metal GPU, inference is fast. The 0.5B model runs in under a second for short texts.

iOS has a lot of ML built in. SFSpeechRecognizer, Vision framework, Natural Language framework. You can build surprisingly capable apps without any external models.

MLX needs Metal. Set up physical device testing early when working with on-device ML.

What I'd Do Differently

Explore ARM-native frameworks first. iOS has powerful built-in ML capabilities - SFSpeechRecognizer, Vision, Natural Language, Core ML. Understand what's already optimized for ARM before adding external models.
Simpler prompts first. Start with "clean this text" and only add complexity if the model handles it.
Test with real content. My test videos were clean screen recordings. Real videos have background noise, music, multiple speakers. Test with messy content early.

The Takeaway

On-device ML is powerful but unforgiving. The APIs exist. The models exist. But the gap between "works in simulator" and "works on device" is larger than I expected.

The reward: an app that processes video entirely locally. No cloud uploads. No API costs. No latency.

Worth the debugging sessions? Absolutely.

Tech Stack

Platform:      iOS 18+ (requires Metal GPU)
Language:      Swift
ML Framework:  MLX Swift
LLM:           Qwen 0.5B 4-bit (mlx-community)
Speech:        SFSpeechRecognizer (Apple)
TTS:           Kokoro via mlx-audio (server)
Video:         AVFoundation