Forem: Koustubh

What OpenClaw Teaches Us About Personal AI Security

Koustubh — Sun, 08 Feb 2026 03:58:13 +0000

OpenClaw and gharasathi are both local-first AI assistants. Both run on your own hardware. Both handle personal data. In February 2026, OpenClaw had a very bad month — and the lessons aren't what you might expect.

Quick Context: gharasathi's Setup

gharasathi runs on a ByteNUC mini PC using Talos Linux — an immutable, minimal OS built for Kubernetes. No SSH. No shell. No package manager.

LAN only. No port forwarding. No public IP. The only way to reach these services is to be on my home Wi-Fi.

What Happened to OpenClaw

OpenClaw (formerly Clawdbot) is an open-source AI agent that went viral in early 2026 — 150K+ GitHub stars. It's also local-first: you install it on your machine, it stores data locally, and it connects to LLMs for task automation.

But OpenClaw is far more ambitious than gharasathi. It can execute shell commands, control browsers, send emails, and automate multi-step workflows. It integrates with WhatsApp, Telegram, and Slack. It has a community marketplace (ClawHub) for third-party skills. gharasathi just queries a database and explains the results in natural language.

That difference in scope turned out to matter a lot. In February 2026, multiple security teams published findings within days of each other:

CVE-2026-25253: One-Click RCE (CVSS 8.8)

A browser-based attack that let attackers hijack any OpenClaw instance — even ones running only on localhost. A user visits a crafted webpage, JavaScript steals the gateway token via WebSocket, and the attacker gains full operator access: disable security features, escape Docker containers, execute arbitrary commands on the host. (The Hacker News)

This is worth pausing on. OpenClaw was running locally. It was listening on localhost. And it was still compromised — because the victim's own browser initiated the connection. "Local" alone doesn't mean "secure."

Misconfigured Instances Leaking Data

Users who exposed their OpenClaw instances to the internet — against best practices — were found leaking API keys, chat histories, and credentials. Some had zero authentication. (Trend Micro)

341 Malicious Skills in ClawHub

Security researchers found 341 malicious packages on ClawHub, OpenClaw's community marketplace. These impersonated legitimate tools but contained Atomic Stealer malware and ClawHavoc C2 implants — harvesting SSH keys, browser passwords, crypto wallet keys, and API tokens. (The Hacker News, The Register)

The Honest Comparison

It would be easy to say "gharasathi is local, therefore safe" — but OpenClaw is local too. The RCE vulnerability worked on localhost. "Runs on your machine" is not a security strategy.

The actual differences that matter are about scope and attack surface:

	OpenClaw	gharasathi
What can the AI do?	Execute code, send emails, control browsers, shell access	Query Neo4j (read-only), explain results
External integrations	WhatsApp, Telegram, Slack, browser automation	None — LAN-only chat interface
Plugin/skill marketplace	ClawHub (2,800+ community skills)	None — all code in monorepo
Cloud API keys	GPT-4, Claude, Gemini keys in local config	None — local Ollama only
Blast radius if compromised	Full host access, all connected services	Read access to household data

The RCE vulnerability (CVE-2026-25253) is a browser-based WebSocket attack. Could something similar affect gharasathi? Honestly — yes, in principle. Any service with a web interface is potentially vulnerable to browser-based attacks. I won't pretend otherwise.

But the blast radius is fundamentally different. If someone compromised OpenClaw, they got shell access, email sending, browser control, and every API key stored locally. If someone compromised gharasathi's chat interface, they'd get... read-only access to my grocery spending and photo metadata. The LLM can't execute code. It can't send messages. It can't modify data. It queries Neo4j and formats the response.

This Is a Trade-Off, Not a Win

I want to be clear: OpenClaw's broader scope is exactly what makes it useful. People use it to automate real workflows — managing emails, booking flights, controlling browsers, orchestrating multi-step tasks across services. gharasathi can't do any of that. It answers questions about household data. That's it.

OpenClaw with 150K+ GitHub stars solves problems that gharasathi doesn't even attempt. The shell access, the messaging integrations, the plugin marketplace — those features exist because users need them. Removing them isn't a security strategy anyone building a general-purpose AI agent can adopt.

gharasathi gets away with a narrow scope because it was designed for a narrow purpose: surface household data in natural language. That's a fundamentally different ambition from what OpenClaw is trying to do, and comparing them on security alone misses the point. The security comparison is only meaningful if you understand that these tools serve very different needs.

What I Actually Learned

Three takeaways from watching OpenClaw's February:

1. "Local-first" is necessary but not sufficient. Running on your own hardware avoids cloud data exposure, but it doesn't protect against browser-based attacks or supply chain compromises. Don't confuse deployment model with security model.

2. Scope is your best defense. The most effective security decision in gharasathi isn't the network topology — it's that the LLM can only read data and explain it. No code execution. No external integrations. No plugins. Every capability you add is attack surface you have to defend.

3. Marketplaces need guardrails. Plugin ecosystems are how tools like OpenClaw scale — 2,800 community skills is an incredible achievement. But 341 of them were malicious, and they ran with the same permissions as the core agent. The lesson isn't "don't have a marketplace" — it's that plugins touching personal data need sandboxing, permission scoping, and review processes that match the sensitivity of what they can access.

Where This Stands

gharasathi is still an MVP. The architecture works: graph database for structured household data, local LLM for natural language, Kubernetes for orchestration, all on a $200 mini PC.

What's proven:

Neo4j handles household data relationships elegantly
A small LLM (phi3:mini) works for structured query-and-explain
K8s resource limits keep 5 services running in 6GB RAM
Minimal scope limits blast radius even if something goes wrong

What's not:

The LLM can't do real analysis (needs hardware upgrade for larger models)
Photo processing isn't implemented yet
It's a one-household system, not a product

The Idea Scales

The current MVP handles finances and photos, but the core idea — put your household data in a graph, give it a natural language interface — extends to almost anything: health records, home energy usage, kids' school schedules, vehicle maintenance, meal planning. Each new data domain is just another node type in Neo4j with relationships to what's already there. The query patterns stay the same: start at a node, walk relationships, explain the result.

The interface scales too. Right now it's chat. But the same LangGraph pipeline that answers a typed question could answer a spoken one — swap the text input for speech-to-text and the response for text-to-speech. The architecture doesn't change. The hardware constraints are the real bottleneck for now, not the design.

I'm not open-sourcing gharasathi. The codebase contains patterns specific to my household's financial accounts and data structure. Releasing it would require sanitizing all of that, and "privacy-first project accidentally leaks developer's banking patterns" is exactly the kind of irony I'd rather avoid.

Your most personal data — your finances, your photos, your family's memories — deserves to stay personal. Not discoverable on Shodan. Not accessible through a marketplace plugin you didn't audit. Just yours, on a box in your garage.

This is Part 3 of a 3-part series. Use the series navigation above to read Part 1 (Architecture & Neo4j) and Part 2 (LLM Model Selection).

Sources:

Running an LLM on 6GB RAM — Model Selection for Edge AI

Koustubh — Sun, 08 Feb 2026 03:58:12 +0000

The hardest part of building a self-hosted AI isn't the architecture — it's choosing the right LLM for your actual use case and your actual hardware. A multimodal model that understands images is pointless when your task is explaining bank transactions. A model that tops benchmarks is useless if it takes 20 seconds to respond on your CPU. Here's what happened when I learned both lessons.

The Hardware

The entire gharasathi system runs on a ByteNUC mini PC:

Spec	Value
CPU	Intel (no discrete GPU)
RAM	6GB total
Disk	2TB SSD
OS	Talos Linux (immutable, K8s-native)
Cost	~$200

6GB RAM. That's it. And it needs to run the OS, Kubernetes, Neo4j, four microservices, and an LLM.

The Memory Budget

Every megabyte matters. Here's how the 6GB is carved up, based on actual K8s resource limits from our deployment manifests:

Service	K8s requests	K8s limits	Notes
Ollama sidecar	2Gi	3Gi	LLM inference engine
FastAPI (aapla-hushar)	512Mi	1Gi	Python service
Neo4j	512Mi	1Gi	Graph database
aapla-dhan	256Mi	512Mi	Go finance service

The LLM service gets ~3GB total (Ollama + FastAPI together in one K8s pod via sidecar pattern). After the Python runtime, FastAPI, and LangChain dependencies eat ~500MB, the actual model has roughly 2.5GB to work with.

The Model Selection Journey

I evaluated 15+ open-source models across four tiers. The question: what's the best LLM you can run in ~2.5GB?

The top 5 candidates that passed the filter:

Model	Size	Context	Key Strength
Qwen3:4b	2.5GB	256K	Rivals 72B performance
Phi-3-mini	2.2GB	128K	Strong reasoning (Microsoft)
Llama 3.2-3B	2.0GB	128K	Instruction following
Qwen2.5-3B	1.9GB	32K	Good coding/math
Gemma3-4B	3.3GB	128K	Multimodal (images)

The research was clear: Qwen3:4b was the winner. 2.5GB, 256K context window, benchmarks rivaling 72B models. A no-brainer.

Reality Disagreed

I deployed Qwen3:4b to the ByteNUC. It fit in memory. It loaded fine. Then I asked it a question.

Problem 1: Painfully slow. On a CPU-only Intel NUC with 6GB RAM, Qwen3:4b took 15-20+ seconds for simple responses. For a household chat assistant, that's unusable. You ask "how much did I spend on groceries?" and wait long enough to check the bank app yourself.

Problem 2: Hallucinations in Chinese. Qwen3 is a multilingual model supporting 100+ languages. On constrained hardware with limited context, the model would occasionally bleed into Chinese mid-response. Great for a multilingual product. Confusing for a household assistant that only needs English.

The benchmarks didn't lie — Qwen3:4b is a remarkable model. But benchmarks run on A100 GPUs with 80GB VRAM, not on a $200 mini PC with shared system RAM and no GPU.

The Downgrade

I switched to phi3:mini (Microsoft's Phi-3 Mini, 3.8B parameters, 2.2GB):

	Qwen3:4b	phi3:mini
Size	2.5GB	2.2GB
Context	256K	128K
Speed on ByteNUC	15-20s+	3-7s
Language stability	Occasional Chinese bleed	Stable English
Benchmark ranking	Higher	Lower
Actually usable?	No	Yes

phi3:mini is less capable on paper. Smaller context window. Lower benchmark scores. But it responds in seconds, stays in English, and gives coherent answers about household data. That's what matters.

Lesson: benchmarks ≠ real-world performance on constrained hardware. Test on your actual target device, not on specs.

A Note on Choosing Models for Constrained Hardware

A few things I wish I'd prioritized earlier in the evaluation:

Quantized models matter. On hardware like this, quantized variants (Q4_0, Q4_K_M) are far more practical than full-precision weights. They reduce memory footprint and improve inference speed with minimal quality loss. All the Ollama models above are already quantized — that's how a 4B parameter model fits in 2.5GB.
Match the model to the task, not the leaderboard. A multimodal model that understands images is wasted when your use case is explaining bank transactions. A model with a 256K context window is overkill when your prompts are 200 tokens. Pick for what you actually need.
CPU inference is slow — plan for it. Even with phi3:mini, responses take around 30 seconds end-to-end on the ByteNUC. That's liveable for a household assistant you check a few times a day, but it rules out anything conversational or real-time. If you need snappy responses, you need a GPU or a smaller model.

Why the LLM Barely Matters

Here's the counterintuitive insight that makes a smaller model viable: the LLM is just the natural language interface. Neo4j does the heavy lifting.

The architecture follows a "tools first, LLM explains" pattern:

Intent classification is deterministic keyword matching — no LLM needed
Data retrieval uses specialized tools that run Cypher queries against Neo4j
The LLM only translates between natural language and presenting structured results

The LLM never calculates, never queries, never decides what data to fetch. It receives pre-fetched results and writes a human-readable response. For that job, phi3:mini is more than sufficient.

From the actual code (aapla-hushar/src/agents/graph.py):

# Deterministic routing — no LLM involved
finance_keywords = ["spend", "expense", "budget", "transaction", ...]
if any(k in last_message for k in finance_keywords):
    intent = "finance"

# Tools fetch data first, then LLM explains
data = get_spending_summary()  # Cypher query to Neo4j
response = llm.invoke([
    SystemMessage(content="You are a financial assistant."),
    HumanMessage(content=f"ONLY interpret this data:\n{data}")
])

The Ollama sidecar runs alongside FastAPI in the same K8s pod — localhost communication, zero network latency between the Python service and the LLM.

Future: Better LLM Requires Better Hardware

The current setup works for structured queries against Neo4j. But for actual data analysis — "what trends do you see in my spending?" or "suggest ways to save money" — the model needs to reason over larger contexts with more capability.

That upgrade means upgrading the ByteNUC's RAM first. A 7B or 8B model needs 8-16GB total system memory. Until then, phi3:mini handles the structured query-and-explain pattern well.

What's Next

The model runs. The queries work. But is "runs locally" enough to keep your data safe?

In Part 3, I look at what happened to OpenClaw — another local-first AI assistant — and why the real security lesson isn't about where your AI runs, but what it's allowed to do.

This is Part 3 of a 3-part series. Use the series navigation above to read Part 1 (Architecture & Neo4j) and Part 3 (What OpenClaw Teaches Us).

gharasathi (घरासाठी) — A Privacy-First Household AI Running on a $200 Mini PC

Koustubh — Sun, 08 Feb 2026 03:58:11 +0000

gharasathi ("for home" in Marathi) is a privacy-first household AI that connects finances, photos, and memories — running entirely on a $200 mini PC in my garage. No cloud. No subscriptions. No data leaving the house.

The Problem

Household data is scattered everywhere. Bank transactions in three different apps. Photos split across iCloud and Google Photos. Bills in s or individual company portals. Memories in your head.

Every "smart" assistant that promises to unify this — Alexa, Google Home, ChatGPT — requires shipping your most intimate data to someone else's servers. Your spending patterns. Your family photos. Your location history. All flowing through infrastructure you don't control, governed by privacy policies that change quarterly.

I wanted something different: a private AI that ties all household data together and runs entirely on my home network. Something I could ask "How much did we spend during the Christmas trip?" and get an answer by traversing actual data, not hallucinating one. Something where the answer to "where is my data?" is always "in the living room."

Architecture

The system is a set of microservices on Kubernetes. All services are named in Marathi — the language spoken in western India.

Service	Language	Role
aapla-dhan (आपलं धन)	Go	Finance — syncs bank transactions, loans, investments
aaplya-athvani (आपल्या आठवणी)	Go	Memories — photo sync, tagging, storage
aapla-hushar (आपला हुशार)	Python + LangGraph	AI chat — intent routing, agents, Ollama sidecar
aapla-mahitisatha (आपला माहिती साठा)	Neo4j	Graph database for all structured data
gharasathi-ios	Swift/SwiftUI	Native iOS app
gharasathi-web	React/TypeScript	Browser interface

Everything runs on a single ByteNUC mini PC — 6GB RAM, 2TB disk, no GPU — running Talos Linux. The entire stack — OS, K8s, database, LLM, and all services — fits in that 6GB with CPU-only inference.

A core design principle: services write data to Neo4j, and the LLM reads from it. The AI layer never modifies your data — it only queries and explains. This separation means the LLM can be swapped, restarted, or upgraded without any risk to your actual records.

Why Neo4j?

Household data is inherently a graph. People own accounts. Accounts generate transactions. Transactions happen at places. Photos feature people at events. Events are held at places. Places contain other places.

The schema has 6 node types and 17 relationship types — from OWNS and SOURCE_OF to FEATURES, SIMILAR_TO, and LOCATED_IN. This density of connections is exactly what graph databases excel at.

Consider what "our Sydney trip" means in data terms. It's an Event node connected to: Transaction nodes (what we spent), Photo nodes (what we captured), Person nodes (who went), and Place nodes (where we went) — which themselves link upward via LOCATED_IN to "Sydney" to "NSW" to "Australia." In a relational database, that's a normalized nightmare. In a graph, it's just… the shape of the data.

The Query That Sells It

"How much did we spend during the Christmas trip?"

In a relational database, this requires JOINing across 4+ tables: events, transactions, places, and a trip-transaction mapping table. In Neo4j, it's a single traversal:

MATCH (e:Event)
WHERE e.title CONTAINS 'Christmas' AND e.startDate.year = 2025
OPTIONAL MATCH (e)<-[:PART_OF]-(photo:Photo)
OPTIONAL MATCH (e)<-[:RELATED_TO]-(t:Transaction)
OPTIONAL MATCH (e)<-[:ATTENDED]-(person:Person)
RETURN e.title,
  collect(DISTINCT person.name) as attendees,
  count(DISTINCT photo) as photos,
  sum(t.amount) as totalSpent

One query. It starts at the Christmas event node, walks outward along relationships, and gathers everything connected: who attended, how many photos were taken, and the total cost. No JOINs. No subqueries. The relationships are the schema.

This pattern repeats across every use case:

"What are my top spending categories?" — aggregate Transaction nodes by category
"Show photos from the reef trip" — traverse Photo → PART_OF → Event
"Find photos similar to this one" — vector similarity search on embeddings
"What did we do last year?" — walk Events by date, gather connected Transactions and Photos
"Show spending by location" — traverse Transaction → OCCURRED_AT → Place

The query file has 25+ patterns covering financial analysis, photo search, people lookup, event timelines, and cross-domain insights. Every one of them follows the same shape: start at a node, walk relationships, aggregate what you find.

Vector Search Built In

Neo4j 5.11+ supports native vector indexes. Every Photo, Transaction, and Event node carries a 1536-dimension embedding vector. This enables semantic search without a separate vector database:

CALL db.index.vector.queryNodes('photo_embedding', 5, p.embedding)
YIELD node, score
RETURN node.filename, node.aiDescription, score

"Find sunset photos" doesn't need exact keyword matching — it searches by meaning. No separate Pinecone or Weaviate instance needed — one database handles both structured queries and semantic search.

Dual Storage: Graph + Object

One thing Neo4j shouldn't store is binary files. Photos and videos go to MinIO — a self-hosted, S3-compatible object store. Neo4j holds the metadata (who's in the photo, where it was taken, AI-generated description, embedding vector) while MinIO holds the actual JPEG. The Photo node's storagePath property links the two.

This keeps Neo4j lean — critical when you're running it in 1GB of RAM — while MinIO happily stores terabytes of photos on the 2TB disk.

What's Next

The architecture makes sense on paper. But how do you run Neo4j + an LLM + 4 microservices on a machine with only 6GB RAM?

In Part 2, I cover the model selection journey — where research recommended one model, reality disagreed, and I had to learn the hard way that benchmarks don't mean much on constrained hardware.

This is Part 1 of a 3-part series on building a privacy-first household AI. Use the series navigation above to read Part 2 (LLM Model Selection) and Part 3 (Privacy & Lessons from OpenClaw).

Building Apps with AI: Deep Dive into beads Workflow

Koustubh — Tue, 20 Jan 2026 22:25:56 +0000

Building Apps with AI: Deep Dive into beads Workflow

Part 2 of 2: JSONL Memory, Real Examples, and Honest Drawbacks

Recap

In Part 1, I introduced beads — a git-native issue tracker designed for AI-assisted development. We looked at the Mission House app and the basic workflow. Now let’s go a bit deeper.

Scope and Assumptions

This post reflects a solo, AI-assisted development workflow on a small but non-trivial codebase (dozens of tasks, explicit dependencies, multiple external APIs).

Assumptions:

The AI agent has read access to the full issue graph
Execution efficiency matters more than prolonged design deliberation

For ambiguous product discovery, multi-team coordination, or regulated environments, spec-driven approaches may be a better first step.

The JSONL Advantage: Compact and Queryable

Every beads issue is stored as a single line of JSON in .beads/issues.jsonl:

{
  "id": "mission-house-ogp",
  "title": "Implement myschool.edu.au scraper",
  "description": "Create Puppeteer-based scraper...",
  "status": "closed",
  "priority": 1,
  "close_reason": "NAPLAN scraper implemented in server.js",
  "dependencies": [{ "depends_on_id": "mission-house-5mv" }]
}

Compare this to a typical markdown task file that might span dozens of lines with headers, descriptions, and nested checklists for the same information.

Why Compact Matters

The AI gets structured data it can query, not prose it must interpret:

bd ready - What's unblocked and highest priority?
bd blocked - What's waiting on other work?
bd show <id> - Full details on one issue
bd stats - Project health at a glance

Close Reasons: Implementation Memory

When you close an issue, you document what was actually built:

bd close mission-house-ogp --reason="NAPLAN scraper implemented in server.js, handles terms acceptance and score extraction"

This is not just status — it’s ground truth.

Specs capture intent.
Close reasons capture reality.

Real Example: Session Continuity

Here's what happened when I resumed work on NAPLAN scoring after a break:

Session 1 (ended with):

bd close mission-house-ogp --reason="NAPLAN scores integration complete: scraper implemented in server.js"

Session 2 (started with):

> bd ready

mission-house-6t1 [P2] [task] open - Display NAPLAN scores in UI
  └─ Blocked by: mission-house-ogp (closed), mission-house-0ch (closed)
  └─ All blockers resolved - ready to work!

Claude immediately knew:

The scraper was done (from ogp's close reason)
The schema was updated (from 0ch's close reason)
The next logical step was UI display

No manual context re-establishment was needed, because dependencies and implementation details were already encoded.

The Hierarchy: Epics → Features → Tasks

We organized Mission House using a three-level hierarchy:

Why This Structure Works

Level	Purpose	Typical Count	Lifetime
Epic	Strategic goal, multiple sessions	2-5 per project	Weeks
Feature	User-facing capability	5-15 per epic	Days
Task	Single implementation unit	3-10 per feature	Hours

The AI works at the task level but understands the feature and epic context.

Real Issues from Mission House

Let me show you actual issues from our project to illustrate different patterns:

Pattern 1: Task with Clear Dependencies

{
  "id": "mission-house-73p",
  "title": "Calculate Flinders Street Station travel time",
  "description": "Calculate travel time to Flinders Street Station during peak hours on a working day from: (a) the nearest train station, (b) the property address directly.",
  "status": "closed",
  "priority": 2,
  "issue_type": "task",
  "close_reason": "Implemented MapsService.getTravelToFlinders() with peak hour scheduling. Calculates transit, driving, walking routes from property and via nearest station",
  "dependencies": [
    {
      "depends_on_id": "mission-house-utk",
      "type": "blocks"
    }
  ]
}

What the AI learned from this:

Can't calculate commute until "Find nearest train station" (utk) is done
Implementation went into MapsService.getTravelToFlinders()
Peak hour scheduling was added
Multiple route types were implemented

Pattern 2: Bug with Acceptance Criteria

{
  "id": "mission-house-v4e",
  "title": "Fix naplan score web scraping logic",
  "description": "Naplan score web scraping not working as expected. Check the requirements document",
  "acceptance_criteria": "Naplan score written in json file as in the requirements document",
  "status": "closed",
  "priority": 0,
  "issue_type": "bug",
  "close_reason": "Implemented naplan_quality metric: added benchmark constants, quality calculation function, and UI display in both hub-spoke view and compare page radar chart"
}

Priority 0 (P0) is the highest priority level. The AI knew to work on this first.

Pattern 3: Tombstone (Deleted Issue)

{
  "id": "mission-house-ck6",
  "title": "Implement spider/radar chart visualization",
  "status": "tombstone",
  "deleted_at": "2026-01-17T23:15:21.909135+11:00",
  "deleted_by": "batch delete",
  "delete_reason": "batch delete",
  "original_type": "task"
}

Tombstones preserve history while removing clutter. The AI knows this was deleted and won't try to work on it.

beads vs. agent-os (an SDD Framework)

Spec-Driven Development (SDD) is a methodology - different tools implement it differently. Let's compare beads to agent-os, one popular SDD framework.

Note: This comparison is specific to agent-os. Other SDD implementations may work differently.

Two Different Philosophies

agent-os follows a six-phase workflow:

Plan Product → 2. Shape Spec → 3. Write Spec → 4. Create Tasks → 5. Implement → 6. Orchestrate

It uses layered context (Standards/Product/Specs) in markdown files. Tasks are derived from specs.

beads is task-first:

Create issues → 2. Add dependencies → 3. Run bd ready → 4. Implement

No ongoing spec phase was required. A lightweight requirements document seeded the task graph, after which dependencies were tracked as explicit graph edges rather than implied through prose.

Comparison Table (agent-os vs beads)

Aspect	agent-os	beads
Philosophy	Spec-first	Task-first
Persistence	✅ MD files in git	✅ JSONL in git
Context layers	Standards/Product/Specs	Flat issue list
Task creation	Derived from specs	Created directly (or through a file input)
Dependencies	Implicit in spec narrative	Explicit graph edges
"What's next?"	Derived from spec phase	`bd ready` computes it
Upfront design	Required (spec phases)	Optional
Best for	Complex features needing design	Iterative, fast-moving work

When to Use Which

Use agent-os (or similar SDD frameworks) when:

Complex features - You need to think through architecture before coding
Team alignment - Specs help communicate intent to other humans
Stakeholder buy-in - Non-technical people need to review plans
Regulated industries - Formal specs may be required

Use beads when:

Fast iteration - You want to jump straight to tasks
Clear requirements - You already know what to build
Dependency-heavy work - Many tasks blocking each other
Solo or AI-assisted - Less need for human-readable specs

Can You Use Both?

Yes. You could:

Use SDD's planning phases to think through architecture
Export tasks to beads for execution with graph-based tracking
Keep high-level context in a README, detailed execution in beads

My approach: For Mission House, I skipped formal specs and went straight to beads. The requirements doc was enough context - I didn't need a full SDD workflow for a personal project.

The Drawbacks: What Didn't Work

Let me be honest about the challenges:

1. Learning Curve

The CLI commands take time to internalize. bd dep add A B means "A depends on B" (B blocks A) - I got this backwards several times.

2. Sync Conflicts

When working across multiple branches or machines, sync conflicts can occur:

> bd list
💡 Tip: Run 'bd sync' to resolve sync conflict

The fix is usually simple (bd sync --from-main), but it's an extra step that spec documents don't have.

3. Over-Granularity Temptation

It's tempting to create a task for everything:

# Too granular - don't do this
bd create --title="Add import statement for React"
bd create --title="Create empty component file"
bd create --title="Add basic JSX structure"

Better: One task for "Create React component for X with basic structure"

4. Daemon Startup Delays

Occasionally the beads daemon takes time to start:

> bd list
Warning: Daemon took too long to start (>5s). Running in direct mode.

Not a blocker, but noticeable.

5. No Visual Dashboard

This is not entirely true. You have a lot of community built dashboards available here which will make your life much easier.

Advanced Features We Used

Bulk Operations

When we had duplicate issues, we cleaned up with:

bd delete mission-house-ck6 mission-house-d1s mission-house-hqs --reason="batch delete" --force

This created tombstones preserving the history.

Priority System

beads uses P0-P4 priorities:

Priority	Meaning	Our Usage
P0	Critical	Blocking bugs
P1	High	Core features
P2	Medium	Most tasks
P3	Low	Nice-to-haves
P4	Backlog	Future ideas

Close Reasons

Always close with a reason:

bd close mission-house-qvs --reason="Added Google Maps API with Places and Geometry libraries. MapsService provides geocoding, directions, nearest station search, and autocomplete"

This becomes searchable context for future sessions.

Integrations

beads also supports syncing with Jira if your team needs to keep stakeholders updated in traditional issue trackers.

Project Timeline Visualization

Here's how our project actually progressed:

Total active development time: ~3 hours across 2 days

Final Thoughts

beads and SDD frameworks like agent-os represent different philosophies:

SDD frameworks say: "Think first, spec it out, then derive tasks"
beads says: "Create tasks directly, let the graph handle prioritization"

Neither is universally better. SDD frameworks shine when you need upfront design and human-readable documentation. beads shines when you want to move fast with automatic dependency resolution.

What makes beads unique:

Task-first workflow - Skip straight to issue creation
Graph-based dependencies - Explicit edges, not prose to interpret
Automatic prioritization - bd ready computes what's next via graph traversal
Compact JSONL - High signal-to-noise ratio as projects grow

The Mission House project went from idea to working app in about 3 hours of active development, spread across multiple sessions. The graph kept track of what was blocked, what was ready, and what was done - no spec documents required.

Choose the approach that fits your project. Or use both.

Resources

Thanks for reading! If you try beads on your next AI-assisted project, I'd love to hear how it goes.

Building Apps with AI: How beads Changed My Development Workflow

Koustubh — Tue, 20 Jan 2026 22:25:54 +0000

Building Apps with AI: How `beads` Changed My Development Workflow

Part 1 of 2: From Spec Documents to Living Issue Trackers

TL;DR

I built a real estate comparison app using Claude Code and a tool called beads - a git-native issue tracker designed for AI-assisted development. This post explores how beads transformed my workflow from writing lengthy spec documents to having a living, breathing project tracker that my AI assistant actually understands.

Source Code: GitHub Repository
Live Demo: Mission House App - This is a SPA with no backend, so it requires a Google Maps Platform API key to be provided on the UI to render fully. Have a look at the snapshots in the source code repository above to get an idea of the app.

Spec-Driven Development (SDD)

If you've used AI coding assistants for larger projects, you've likely encountered Spec-Driven Development (SDD) - a methodology where detailed specifications drive the implementation process. I've written about SDD in detail previously - it's a powerful approach that works well for many projects. Different frameworks implement SDD differently.

One popular implementation is agent-os, which formalizes SDD into a six-phase workflow:

agent-os uses layered context (Standards → Product → Specs) stored in markdown files like mission.md, roadmap.md, and tech-stack.md. Tasks are derived from specs, not created directly.

agent-os works well, but has trade-offs:

Spec-first philosophy - You write specs before creating tasks
Tasks derived from prose - The AI interprets specs to generate tasks
Layered context - Rich documentation, but more files to maintain
Sequential phases - Structured workflow from planning to orchestration

SDD frameworks like agent-os are great for complex features that need upfront design. But what if you want to skip straight to task management?

The key difference is where control flow lives:

In SDD, control flow is implicit in prose.
In beads, control flow is explicit in a DAG.

To be clear: SDD shines when architectural intent must be stabilized early — beads optimizes for execution once intent is roughly understood.

Enter `beads`: Task-First with Graph-Based Dependencies

Beads is what its creator Steve Yegge calls "a drop-in cognitive upgrade for your coding agents." Instead of the spec-first approach, beads is task-first - you create issues directly, with explicit dependencies stored as graph edges.

1. Compact JSONL, Not Verbose Markdown

.beads/
├── issues.jsonl    # All issues in ONE compact file
├── config.yaml     # Project configuration
└── db.sqlite       # Local cache for fast queries

One line per issue. No walls of prose. The AI can quickly parse the entire project state:

{
  "id": "mission-house-ogp",
  "title": "Implement scraper",
  "status": "closed",
  "dependencies": [{ "depends_on_id": "mission-house-5mv" }]
}

2. Automatic "What's Next?" via Graph Traversal

This is the killer feature. Instead of the AI parsing prose to figure out order of execution:

bd ready  # Shows only unblocked, high-priority tasks

The graph database computes eligibility automatically based on explicit status, dependencies, and user-defined priority.

3. Explicit Dependencies = Enforced Execution Order

bd dep add mission-house-abc mission-house-xyz
# "abc depends on xyz" (xyz blocks abc)

This isn't prose that might be ignored - it's a graph edge. The AI literally cannot see abc in bd ready until xyz is closed.

Quick Start: Installing beads

For macOS users:

brew tap steveyegge/beads
brew install bd

# Initialize in your project
cd your-project
bd init

Other options: npm install -g @beads/bd or go install github.com/steveyegge/beads/cmd/bd@latest

That's it. You now have a .beads directory in your repo.

Also install the beads claude code plugin following this.

The App: Mission House

Before diving deeper into beads, let me briefly introduce what we built. Mission House is a property comparison tool for Melbourne house hunters. It helps answer questions like:

Which catchment schools serve this address?
How do their NAPLAN scores compare to each other?
How long is the commute to the CBD?
How do these 4 properties compare on a radar chart?

The interesting part isn't the app itself - it's how we built it using beads.

How I Used beads: A Real Example

Although beads is task-first, I didn’t start from a blank slate.

I began with a lightweight requirements.md that described:

core user flows
data sources (e.g. NAPLAN, Google Maps)
output expectations (comparison metrics, charts)

I then asked Claude Code (with beads installed) to:

Read requirements.md
Propose epics, features, and tasks
Encode them directly into beads issues with explicit dependencies

In other words, requirements existed, but they were treated as input, not as a continuously consulted execution artifact.

Once the task graph existed, beads became the primary source of truth.

Here's where the magic happens. When I started my next Claude Code session:

> bd ready

mission-house-5mv [P1] [task] open - Set up frontend project structure
  └─ No blockers - ready to work!

Claude immediately knew what to work on. No spec re-reading, no context reconstruction - just:

> bd update mission-house-5mv --status=in_progress

And we're coding.

The Dependency Graph

Here's what our project looked like after the initial planning:

Every arrow represents a bd dep add command. The AI knows it can't work on "Web Scraper" until "URL Input Form" is done.

What's Coming in Part 2

In the next post, I'll dive deep into:

The JSONL Advantage - How storing issues in plain text gives AI assistants "long memory"
Workflow Patterns - Epics → Features → Tasks hierarchy
Real Issue Examples - Actual JSON from our project
beads vs. Spec-Driven Development - A detailed comparison
The Drawbacks - What didn't work so well
Advanced Features - Tombstones, sync branches, and multi-session workflows

Key Takeaways

Task-first vs. spec-first - beads lets you create issues directly; SDD requires specs first
Graph-based dependencies - Explicit edges, not prose to interpret
bd ready is the killer feature - Automatic prioritization via graph traversal
Enforced execution order - Blocked tasks are invisible until unblocked

Continue to Part 2: Deep Dive into beads Workflow →

Part 5: Building Station Station - Should You Use Spec-Driven Development?

Koustubh — Tue, 04 Nov 2025 05:46:51 +0000

We've covered a lot in this series. In Part 1, we introduced Spec-Driven Development. In Part 2, we explored the Station Station project—8 features solving a real hybrid work compliance problem. In Part 3, we walked through the agent-os workflow. In Part 4, we got honest about the challenges and limitations.

Now for the decision: Should you use SDD for your next project?

This part gives you a decision framework based on real experience, not theory. No marketing fluff—just practical guidance on when SDD makes sense and when it's overkill.

What We Actually Built

Let's recap what the SDD approach delivered for Station Station:

The Project:

Personal web application for tracking office attendance via Melbourne Myki transit data
8 features across 3 phases (Foundation, Data Layer, Integration & UI)
Live and deployed at https://koustubh25.github.io/station-station/
Fully autonomous daily execution via GitHub Actions
Zero hosting costs (GitHub Pages + GitHub Actions free tier)

The Tech:

~6,000 lines of code (3,500 Python + 2,500 React)
Python backend with Playwright for browser automation and Cloudflare bypass
React frontend with Tailwind CSS v4, responsive mobile-first design
Lighthouse scores 95+ across the board

The Timeline:

~4-5 hours on specs
~2 days on implementation
~4-6 hours debugging hard problems (Cloudflare, timezone, multi-layer integration)
Total: ~3 days for 8 features shipped and deployed

The Result:
I use this app regularly to check my attendance compliance. It solves my actual problem. All planned features are complete. I can (and did) add new features weeks later by creating new specs and following the same workflow.

When to Use Spec-Driven Development

Based on the Station Station experience, here's when SDD is worth the upfront investment:

✅ Use SDD When:

1. You're building a complete project, not just prototyping

If you want to actually ship something and maintain it, SDD helps you finish. The structure prevents the "60% complete and abandoned" problem that plagues side projects.

Station Station could have easily become another abandoned project. Authentication alone took 2 days to solve. Without the roadmap keeping me focused on the goal, I might have given up after Cloudflare kept blocking me.

2. You'll be coming back to the code later

If there's any gap between coding sessions (days, weeks, months), the documentation is invaluable. The spec tells you what you were building and why. The task list shows what's done and what's next.

I added the manualAttendanceDates feature a week after the initial deployment. The existing specs told me exactly how the system worked, where to add the new field, and what components would be affected.

3. You're working solo or on a small team

SDD provides structure when you don't have teammates to keep you accountable. The roadmap prevents scope creep. The task breakdown prevents getting overwhelmed.

For Station Station, I was the only developer. The workflow kept me organized and prevented me from jumping between random features.

4. The project involves multiple components or layers

When your project has backend + frontend, or data extraction + processing + visualization, specs help you think through the integration points upfront.

Station Station has Python backend, GitHub Actions automation, and React frontend. The specs documented how data flows between these layers, which made debugging multi-layer issues much easier.

5. You want to learn a structured development process

If you're tired of chaotic development and want to build better habits, SDD provides a framework. The first project has a learning curve, but future projects benefit from the workflow.

6. You're solving a non-trivial problem

Simple CRUD apps or one-off scripts don't benefit much from SDD. But if your problem has complexity (Cloudflare bypass, API reverse engineering, browser automation), the structure helps you tackle it systematically.

❌ Skip SDD When:

1. You're doing quick experiments or throwaway code

If you're testing an idea and will likely discard the code, specs are overkill. Just write code and see if the idea works.

2. You have a crystal-clear mental model

If you've built this exact thing 10 times before and know every step, specs won't add much value. You already have the structure in your head.

3. The project is extremely simple

A single-file script, a basic static site, or a trivial automation doesn't need specs. Just write it.

4. You're under extreme time pressure

If you need something working in the next 2 hours, don't spend 30 minutes on a spec. But recognize you're trading speed now for maintenance pain later.

5. You're learning a completely new technology

If you're learning React for the first time, just following tutorials and experimenting might be better than trying to spec everything out. Learn first, then apply structure to real projects.

Decision Framework

Here's a simple decision tree:

Is this a real project you want to finish and maintain?
├─ No → Skip SDD, just code
└─ Yes ↓

Will you be working on this over multiple sessions?
├─ No → Skip SDD unless project is complex
└─ Yes ↓

Does the project involve multiple components/layers?
├─ No, single component → SDD optional
└─ Yes, multiple layers ↓

Are you working solo or small team?
├─ No, large team with existing processes → Evaluate SDD fit
└─ Yes ↓

→ USE SDD. The upfront investment will pay off.

The Honest Trade-offs

Let's be real about what you're signing up for:

What You Give Up:

Time to first code: Specs take time. You'll spend 30 minutes to several hours documenting before you write a single line of code.

Flexibility to "just try things": SDD encourages thinking before coding. If you like to experiment your way to a solution, the structure might feel constraining.

Simplicity: You're adding agent-os to your workflow. There's a learning curve. The first project takes longer.

What You Get:

Actually finishing projects: Structure prevents abandonment. The roadmap keeps you focused. The task breakdown prevents overwhelm.

Resumability: Come back weeks later and know exactly where you left off. No re-learning your own codebase.

Systematic debugging: When things break, the spec tells you what should happen. The task breakdown shows you where to look.

Documented decisions: Future you (or future contributors) can understand why things were built a certain way.

Less decision fatigue: The spec tells you what to build next. No "what should I work on today?" paralysis.

The Bottom Line:

SDD trades upfront time for higher completion rate and better maintainability.

If you care more about finishing than starting, SDD is worth it.

How to Get Started

If you've decided SDD might work for your next project, here's how to begin:

1. Try It With a Real Project

Don't practice with a tutorial. Pick an actual problem you want to solve. Station Station worked because I genuinely needed to track my attendance.

Your first SDD project will be slower. That's normal. The second project will be much faster once you internalize the workflow.

2. Start With Agent-OS

Agent-os is the tool I used for Station Station. It's built for Claude and provides the complete workflow: product creation, spec shaping, spec writing, task breakdown, and implementation.

Agent-OS Repository: https://github.com/cyanheads/agent-os
Getting Started Guide: Check the repository README for setup instructions

3. Try Station Station Yourself

Station Station is open source and free to use. If you're a Melbourne train commuter with hybrid work requirements:

Live Application: https://koustubh25.github.io/station-station/
GitHub Repository: https://github.com/koustubh25/station-station

Two options to get started:

Get onboarded to the existing app - See your attendance on the same GUI. Check the README for onboarding instructions.
Fork and deploy your own - Complete control and privacy. The README has full deployment instructions using GitHub Actions and GitHub Pages (all free).

The repository also includes complete specs for all 8 features and task breakdowns, so you can study the SDD approach in action.

4. Other SDD Tools

Agent-os isn't the only way to do Spec-Driven Development. Other tools exist like OpenSpec, Speckit, and others. However, I found some too simple (lacking the structure I needed) and others too verbose (overwhelming with process overhead). Agent-os struck a good balance for my workflow—structured enough to keep me organized, but not so heavy that it gets in the way of actually building.

Your preferences might differ. If agent-os doesn't feel right, explore the alternatives.

Final Thoughts

Spec-Driven Development isn't revolutionary. It's structure. It's documentation. It's thinking before coding.

It won't make you code faster. It won't eliminate bugs. It won't replace your judgment.

But it might help you finish instead of abandon. It might help you resume instead of restart. It might help you debug systematically instead of randomly.

For Station Station, that was enough. In 3 days, I went from "I need to track attendance" to a fully deployed application solving my real problem. Two weeks later, I added new features without re-learning the codebase. A month later, the app is still running autonomously, requiring zero maintenance.

Your mileage may vary. Your projects are different. Your workflow preferences are different.

But if you're tired of abandoned side projects, forgotten codebases, and chaotic development, maybe give SDD a try. Pick a real problem. Write a spec. Follow the workflow. See if it works for you.

And if you do try it—or if you've been using SDD and have your own experiences to share—I'd love to hear about it. Drop a comment, open a GitHub discussion, or reach out.

Thanks for reading this series. Now go build something.

Links:

Station Station Live App: https://koustubh25.github.io/station-station/
Station Station GitHub: https://github.com/koustubh25/station-station
Agent-OS GitHub: https://github.com/cyanheads/agent-os
OpenSpec GitHub: https://github.com/openspec-dev/openspec

Part 4: Building Station Station - Where SDD Helped (and Where It Didn't)

Koustubh — Tue, 04 Nov 2025 05:45:52 +0000

In Parts 1-3, we covered Spec-Driven Development, the Station Station project, and the agent-os workflow. We saw a structured process that delivered 8 features, fully deployed and working. But I've been painting a rosy picture. Let me be honest about the challenges.

This part is about the real development experience: Where did the structured SDD approach actually help? Where did I still struggle despite having specs and tasks? What problems can structure solve, and what problems require good old-fashioned debugging?

If you're considering SDD for your next project, this is the part you need to read. Because understanding what structure can and can't solve is critical to setting realistic expectations.

Challenge 1: Cloudflare Authentication Bypass

Let me start with the most frustrating part of the entire project: getting authentication to work with the Myki portal.

The Context:
The whole project depends on accessing Myki transaction data. No authentication = no data = no project. This was the critical blocker. Everything else was blocked until this worked.

The Problem:
The Myki portal is protected by Cloudflare Turnstile, which actively detects and blocks headless browsers. My first attempt using standard Playwright headless mode failed immediately with the "Verifying you are human..." overlay blocking form access.

The Spec:
"Use Playwright to authenticate with Myki portal and extract session tokens for API calls."

Simple requirement, right? But the spec didn't capture the complexity of Cloudflare bot detection.

How SDD Helped

The structured approach provided a framework for tackling this beast:

Clear success criteria: "Extract Bearer token from authentication response" - The spec told the AI (and me) exactly what success looked like, even if neither of us knew how to get there yet.

Task breakdown kept me focused: Instead of having the AI try to solve "make authentication work" all at once, the tasks broke it down into pieces:

Launch browser with Playwright
Navigate to login page
Fill in credentials
Submit form
Extract authentication tokens

When the AI's implementation of step 3 (fill in credentials) was blocked by Cloudflare, the task breakdown showed me exactly where the problem was.

Documentation of attempts: Each failed approach got documented in the spec as "out of scope" or "doesn't work because..." This prevented me from asking the AI to retry the same failed approaches days later.

What SDD Couldn't Solve

But here's the brutal truth: specs don't solve hard technical problems for you.

The AI tried implementing authentication multiple ways based on the spec. Each attempt failed. Over two days, I kept iterating:

Attempt 1: AI implements standard headless Playwright

browser = playwright.chromium.launch(headless=True)

Result: Blocked by Cloudflare immediately. "Verifying you are human..."

Attempt 2: I ask AI to try headed mode (visible browser)

browser = playwright.chromium.launch(headless=False)

Result: Better, but still detected as automation. Random CAPTCHA challenges.

Attempt 3: I ask AI to try user-agent and header spoofing

context = browser.new_context(user_agent="...")

Result: Cloudflare is smarter than that. Still blocked.

Attempt 4: AI tries stealth mode plugins
Result: Helped a bit, but not consistent. Sometimes worked, sometimes didn't.

The Solution (After Two Days of Frustration)

What finally worked: Browser profile trust signals

After researching Cloudflare bypass techniques, I figured out the solution and told the AI to implement it:

# Create empty Chrome profile directory structure
profile_dir = create_empty_chrome_profile()
# Launch with profile - appears as "real" browser to Cloudflare
browser = playwright.chromium.launch_persistent_context(
    user_data_dir=profile_dir,
    headless=False
)

The profile directory contains just enough metadata (Cookies, Preferences, History files - all empty) to make Playwright look like a legitimate Chrome browser instead of automation.

Why This Was So Frustrating

The spec couldn't help because:

This required deep knowledge of browser fingerprinting and bot detection
Solutions aren't documented well (Cloudflare actively tries to prevent bypass)
Trial and error was the only way to find what worked
Each attempt took 5-10 minutes to test (AI implements → I run the code → see if blocked)

What I actually needed to do:

Research how Cloudflare Turnstile detects automation
Learn that browser profiles affect fingerprinting
Try approach after approach until something worked
Debug headless browser issues by inspecting what Cloudflare was detecting

None of this came from the spec. The spec told the AI what to achieve. But figuring out how required me to research, experiment, find the solution, and then tell the AI to implement it.

The Takeaway

What SDD provided:

Clear goal to work toward (extract Bearer token)
Focus on one step at a time instead of being overwhelmed
Documentation of failed approaches to avoid repetition
Motivation to keep going (this was Task 1 on a roadmap of 8 features - couldn't give up)

What SDD couldn't provide:

Technical solution to Cloudflare bypass
Knowledge of browser fingerprinting
Shortcuts to avoid trial-and-error debugging
The actual working approach (that required research and experimentation)

Would specialized subagents have helped?
Honestly, I don't know—I didn't try them for this problem. Agent-os has advanced features like specialized research agents and orchestrated task execution that might have helped with researching Cloudflare bypass techniques. But I was using the standard workflow, so I can't say whether those advanced features would have shortened the two-day struggle.

Lesson learned:
SDD gives you structure to tackle hard problems systematically. But hard problems are still hard. Structure doesn't replace technical knowledge, research, and persistence. It just gives you a framework to keep trying without getting lost.

When I finally got authentication working after two days, having it documented as "Task 1: COMPLETE ✓" with detailed notes on the working approach was incredibly valuable. Future features could reference "see Task 1 for Cloudflare bypass pattern." Without that documentation, I might have forgotten the solution by the time I needed to debug it again.

Challenge 2: The Multi-Layer Bug

After finally getting authentication working, I ran into a different kind of frustration: the manualAttendanceDates feature bug.

The Context:
This requirement came much later, after the entire end-to-end system was already working. The app was successfully tracking attendance based on Myki transactions, deployed and running. But then I realized I needed a way to record office attendance on days when I drove to work instead of taking the train. No Myki transaction = no automatic detection. The solution was to add a manualAttendanceDates config field where I could explicitly list dates I was in the office.

This was an enhancement to an already-working system, not part of the original implementation.

The Spec:
Clear requirements. Well-defined tasks. Config schema documented. Expected behavior spelled out.

The Implementation:
Feature got implemented according to the tasks. Initial version deployed.

The Problem:
Manual dates weren't showing up correctly. But it wasn't just one bug—it was multiple issues across different parts of the system.

How SDD Helped

Having the spec gave me a debugging roadmap:

Check the spec requirements - What should happen? "Manual dates should appear on the calendar with the same styling as PTV-detected dates"
Follow the task breakdown - Config parsing → Python backend → GitHub Actions workflow → JSON output → Frontend rendering
Trace the data flow - The spec documented the exact data structure at each layer

I could systematically check each layer by reviewing the code:

Python code: Reviewing the implementation, I could see it wasn't properly merging manual dates with PTV-detected dates
Workflow file: Looking at GitHub Actions, I realized it needed updates to handle the new field
UI: Checking the frontend code, I spotted where it needed changes to render manual dates correctly

This wasn't a single bug, but multiple integration issues across three different components.

What SDD Couldn't Solve

Even with perfect specs, I still had to:

Actually review the code across all three layers to spot the issues
Give the AI specific hints - "Look at how manual dates are merged in the Python code," "Check the workflow file," "The UI might not be checking the right field"
Understand the integration points - recognizing that adding one field means touching Python backend, GitHub Actions workflow, and React frontend
Connect the dots across components - understanding how the config file flows through Python processing, gets written to JSON, picked up by GitHub Actions, and rendered by the UI

The spec told me what should happen. It didn't tell me that integrating a new field into an existing multi-layer system would require touching all these different pieces. The AI could implement each fix once I pointed it to the right location, but finding those locations required me to review the code and understand the full data flow.

The Takeaway

What SDD provided:

Clear expected behavior to test against
Systematic way to isolate which layer was failing
Documentation of the intended data structure

What still required human debugging:

Understanding multi-layer integration issues
Recognizing when different components had different assumptions
Finding the exact line where the mismatch occurred

Challenge 3: Timezone Handling

The Context:
Calendar dates were displaying incorrectly—off by 1 day from the actual values in attendance.json.

The Spec:
"Display attended days on calendar matching the dates in the JSON file exactly."

Simple, right? Match the dates. But there was a subtle problem.

How SDD Helped

The spec was clear about what should happen (dates must match), which made it obvious when they didn't. Without a spec, I might have thought "close enough" or missed the off-by-one bug entirely.

The task breakdown also helped isolate the problem:

Task: "Parse attendance dates from JSON"
Task: "Mark calendar tiles for attended dates"

The bug was in the date parsing task, not the calendar rendering task. Task isolation made debugging faster.

What SDD Couldn't Solve

The spec said "match the dates exactly" but didn't specify how to handle timezones. The bug was subtle:

// Initial implementation - caused timezone conversion
const dateString = date.toISOString().split('T')[0];
// For dates near midnight, UTC conversion shifts the date!

The problem: JavaScript's toISOString() converts to UTC. For dates near midnight, this can shift the date forward or backward. Nov 1, 2024 01:00 AEDT becomes Oct 31, 2024 14:00 UTC—wrong day!

Why the spec didn't prevent this:
The spec didn't say "use local timezone, not UTC" because I didn't think about timezones when writing it. The requirement seemed obvious: match the dates. But "obvious" hides assumptions.

The Fix

// Corrected implementation - uses local timezone
const dateString = date.toLocaleDateString('en-CA'); // YYYY-MM-DD format

The Takeaway

What SDD provided:

Clear success criteria (dates must match exactly)
Quick detection that something was wrong
Task isolation to narrow down where the bug was

What SDD didn't prevent:

Subtle implementation details (timezone handling)
Hidden assumptions in "obvious" requirements
Need for domain knowledge (how JavaScript handles dates)

Lesson learned:
Good specs need to surface non-obvious assumptions. "Match the dates" should have been "Match the dates using local timezone to avoid UTC conversion issues." But you often don't know to specify this until you've been bitten by the bug.

Challenge 4: Third-Party Library Integration

The Context:
Integrating the date-holidays npm package to automatically detect Victoria public holidays.

The Spec:
"Use date-holidays library to fetch Victoria public holidays. Display them on the calendar with red text."

How SDD Helped

The spec documented exactly which library to use and what the expected behavior was. When the integration didn't work as expected, I could reference the spec to confirm what was supposed to happen.

What SDD Couldn't Solve

The library returned dates in an unexpected format: "YYYY-MM-DD HH:MM:SS" strings instead of JavaScript Date objects.

// Initial attempt based on spec
const holidayDate = new Date(holiday.date);
// Assumed library returns Date objects - it doesn't!

Why the spec didn't prevent this:
The spec said "use the library" but didn't document the exact return format because I hadn't investigated the library deeply when writing the spec. I assumed standard Date objects.

The Fix

I had to inspect the actual library output, recognize the format mismatch, and handle it explicitly:

// Working solution after investigation
const dateString = holiday.date.substring(0, 10); // "2025-11-04"
const [year, month, day] = dateString.split('-').map(Number);
const holidayDate = new Date(year, month - 1, day); // Uses local timezone

The Takeaway

What SDD provided:

Documentation of which library to use (no decision paralysis)
Clear requirement for what should be displayed (public holidays with red text)
Task to test the integration

What SDD didn't prevent:

Runtime surprises from third-party libraries
Need to investigate actual library behavior vs documented behavior
Format mismatch that only appears when you run the code

Lesson learned:
Specs can't predict every third-party library quirk. You discover these by running code and inspecting actual output. The structured approach helps you document the quirks once you find them, so future tasks can reference the pattern.

Challenge 5: User Preferences vs Developer Assumptions

The Context:
During development, weekends were displaying in red text (standard react-calendar behavior). I assumed this was confusing since it wasn't attendance data.

The Plan:
Remove the red weekend styling via CSS override. Seemed like a clean UI improvement.

What Actually Happened:
Before implementing the change, I asked the user (myself, wearing the user hat instead of developer hat). Response: "I really liked keeping weekends and public holidays in red."

I almost removed a feature I valued because I was thinking like a developer, not a user.

How SDD Helped

The structured workflow created natural checkpoints for user feedback:

After completing a task group, review with user
Before removing functionality, validate with user
Spec updates require user approval

Without this structure, I would have just removed the feature mid-coding session without stopping to think "Should I ask about this?"

What SDD Couldn't Solve

SDD doesn't tell you what users want. It creates opportunities to ask, but you still have to:

Actually ask the question
Listen to the answer
Override your own assumptions

The Takeaway

What SDD provided:

Natural review checkpoints to get user feedback
Process that encourages "ask before removing"
Documentation of decisions (why we kept the feature)

What SDD didn't prevent:

Making wrong assumptions in the first place
Need for actual user communication
Temptation to "just fix it" without asking

Lesson learned:
Structure creates opportunities for better decisions, but you still have to take advantage of those opportunities. The review checkpoint is useless if you skip it.

Where SDD Provided the Most Value

After going through these challenges, here's where the structured approach actually helped:

1. Clear Success Criteria

Every bug was obvious because the spec defined success. "Dates should match exactly" meant off-by-one was clearly wrong. Without specs, I might have rationalized it: "Close enough, probably a display thing."

2. Systematic Debugging

Task breakdown gave me a debugging roadmap. Instead of randomly checking files, I could trace the data flow through the task list:

Config parsing (Task 1)
Backend processing (Task 2)
JSON output (Task 3)
Frontend rendering (Task 4)

Check each layer systematically until you find the broken one.

3. Documentation of Decisions

When I came back a week later to add a new feature, the specs told me:

Why certain approaches were chosen
What assumptions were made (and documented)
How data flows through the system

Without this documentation, I would have re-learned the codebase every time.

4. Review Checkpoints

The workflow forced me to pause and review:

After each task group
Before major changes
When user feedback was needed

These pauses prevented rushing ahead with wrong assumptions.

Where SDD Couldn't Replace Human Judgment

But let's be honest about what structure can't solve:

1. Multi-Layer Integration Issues

Specs describe individual components well. But when components need to work together, you still have to understand the full picture. The manualAttendanceDates bug required understanding backend + frontend + data contract all at once.

2. Hidden Assumptions and Edge Cases

"Match the dates" seemed clear until timezone conversion bit me. "Use the library" seemed clear until format mismatches appeared. Good specs surface assumptions, but you often don't know what to surface until you've been bitten.

3. Third-Party Library Quirks

Specs can't predict runtime behavior of external dependencies. You discover these by running code, inspecting output, and debugging when things don't work as documented.

4. User Preferences and Domain Knowledge

Structure can't tell you what users value or what domain-specific constraints matter. You still need actual user communication and domain expertise.

The Honest ROI of SDD

Let's be real about the time investment:

Time spent on specs: ~4-5 hours across all features
Time spent on implementation: ~2 days
Time spent debugging: ~4-6 hours (timezone, manualAttendanceDates, library integration)

Total: ~3 days for 8 features shipped and deployed

Where SDD Saved Time

Debugging was faster:

Systematic task-by-task checking vs random file jumping
Spec told me what should happen vs guessing
Data flow documented vs reverse-engineering it

Resumability was huge:

Came back a week later, knew exactly where I left off
Spec reminded me why decisions were made
Task list showed what was done and what was next

Fewer forgotten requirements:

Everything documented upfront vs relying on memory
Edge cases captured in spec vs discovered in production
Complete feature set shipped vs "80% done" abandonment

Where SDD Cost Time

Upfront spec creation:

4-5 hours thinking and documenting
But this is thinking time I'd need anyway, just formalized

Learning the workflow:

First project with agent-os had a learning curve
Second project would be faster

Maintaining documentation:

When specs changed, had to update docs
But this paid off when resuming work later

The Takeaway

Spec-Driven Development isn't magic. It won't prevent bugs, eliminate debugging, or replace human judgment. But it provides:

Structure when you'd otherwise be lost
Documentation when you'd otherwise forget
Checkpoints when you'd otherwise rush ahead with wrong assumptions
Resumability when you'd otherwise re-learn the codebase

The challenges I faced—timezone bugs, library quirks, multi-layer issues—would have happened with or without SDD. The difference is how I dealt with them:

Without SDD: Random debugging, forgotten context, abandoned projects
With SDD: Systematic debugging, documented decisions, completed features

That's the honest ROI.

What's Next

We've now seen the complete picture: the workflow (Part 3) and where it actually helps vs where you still struggle (Part 4). You know the realistic benefits and the honest limitations.

In Part 5, we'll wrap up with a decision framework: when should you use SDD for your project, and when is a simpler approach better? We'll also cover how to get started with Station Station yourself (it's free and open source), and where to go from here.

If you're ready to make the call on whether Spec-Driven Development fits your workflow, Part 5 has the answers.

Part 3: Building Station Station - Agent-OS Workflow in Action

Koustubh — Tue, 04 Nov 2025 05:43:12 +0000

In Parts 1 and 2, I introduced Spec-Driven Development and showed you the finished Station Station project—8 features, live on GitHub Pages, solving my real hybrid work compliance problem. But how did we actually get there? What does the agent-os workflow look like in practice?

This part walks you through the complete development process, using real examples from Station Station. No theoretical abstractions—just the actual workflow I followed to go from "I need to track my office attendance" to a deployed web application.

The Five Phases of Agent-OS

Agent-os structures development into five distinct phases, each with specific deliverables and human review checkpoints. For this project, I used Claude (Anthropic's AI) as the AI assistant throughout the entire workflow—from spec shaping to implementation.

AGENT-OS DEVELOPMENT WORKFLOW
════════════════════════════

1. Create Product  →  2. Shape Spec  →  3. Write Spec  →  4. Write Tasks  →  5. Implement
   Mission & Roadmap   Requirements      Technical Spec    Task Breakdown    AI + Human Code
         ↓                  ↓                  ↓                 ↓                 ↓
   [Human Review]     [Human Review]     [Human Review]    [Human Review]    [Human Review]

Notice the review checkpoints? That's deliberate. You're not waiting until the end to discover the AI misunderstood your requirements. You're validating assumptions at each phase before moving forward.

Let me show you what each phase actually looks like.

Phase 1: Create Product

The first step isn't writing code—it's defining what you're building and why. You start with just a raw idea, and agent-os helps you shape it into a structured product plan.

Here's how it actually works:

You start with a simple idea:
"I need to track my office attendance using my Myki train card data to meet my company's 50% hybrid work requirement."

Agent-os asks clarifying questions:

What problem are you solving? Who's the target user?
What are your key constraints? (Budget, timeline, technical preferences)
What features are must-haves vs nice-to-haves?
Do you have preferred technologies or deployment platforms?

You answer honestly:

Problem: I need proactive visibility into my attendance, not reactive manager notifications
User: Primarily me, but could be useful for other hybrid workers
Constraints: Zero hosting costs, mobile-first (I'll check this on my phone)
Tech: Python (I'm comfortable with it), React (modern and fast), GitHub Pages (free)

Agent-os generates three documents:

mission.md - The "why" behind the project:

Problem: Hybrid workers need transparency into office attendance compliance
Solution: Automated attendance tracking using Melbourne's Myki transit data
Target user: Me (and anyone else commuting via train for hybrid work)

roadmap.md - The "what" we're building:

8 features broken into 3 phases
Phase 1: Foundation (auth + API discovery)
Phase 2: Data Layer (extraction + processing)
Phase 3: Integration & UI (frontend + config)

tech-stack.md - The "how" we'll build it:

Backend: Python + Playwright for browser automation
Frontend: React + Tailwind CSS v4
Deployment: GitHub Pages (static site, zero hosting costs)

This took about 30 minutes—mostly answering questions and reviewing the generated documents. You don't write these from scratch; the AI creates them based on your answers. But you review and refine them to ensure they match your vision.

Human role here: You provide the raw idea and answer questions about your constraints, priorities, and technical preferences. AI can't guess that you want zero hosting costs or mobile-first design—that's domain knowledge only you have.

Phase 2: Shape the Spec

Once you have the product defined, you pick a feature from the roadmap and start spec shaping. This is where agent-os really shines—the spec-shaper agent asks clarifying questions to gather detailed requirements.

For the React frontend feature, the shaping process went like this:

Spec-shaper: "What data format will the frontend consume?"
Me: "JSON from GitHub raw URL, generated by Python backend"

Spec-shaper: "What visualizations do you need?"
Me: "Calendar view with attended days marked, monthly bar chart showing attendance trends, summary statistics"

Spec-shaper: "Mobile or desktop priority?"
Me: "Mobile-first. I'll be checking this on my phone most mornings."

Spec-shaper: "Should the calendar be read-only or interactive?"
Me: "Interactive—clicking an attended day should show which station and what time I tapped on."

This back-and-forth surfaces implementation details you might not think about upfront. By the end of shaping, you have a clear picture of what "build the frontend" actually means.

Important: This is iterative. The spec-shaper generates a draft spec based on your answers. You review it. If something's not quite right—maybe it misunderstood your intent, or you realized you forgot to mention a key requirement—you provide more input. The spec gets refined. You review again. This continues until you're satisfied.

For Station Station, I went through 2-3 refinement rounds on some specs. The first draft might have missed that I wanted public holidays automatically displayed on the calendar. I'd point that out, and the spec would be updated to include it. No starting over—just iterative improvement.

Human role here: Answer questions honestly about your use case, then review and refine the generated spec. The AI can't guess that you'll primarily use this on mobile, or that you care more about quick glances than detailed analytics. And if the first draft misses something, that's fine—just keep refining until it's right.

Phase 3: Write the Spec

Now the spec-writer agent takes your answers from the shaping phase and generates a detailed technical specification. Here's what the actual spec looked like for the frontend:

# Specification: Attendance Tracker Frontend UI

## Goal
Build a responsive static React web application to visualize work attendance data
from the Myki attendance tracker JSON output, enabling users to view attendance
statistics, explore monthly calendars with marked attended days, analyze trends
through bar charts, and filter data by date ranges across mobile and desktop devices.

## User Stories
- As a user, I want to see a monthly calendar with my attended days visually marked
  so that I can quickly identify when I was at the office
- As a user, I want to view monthly attendance percentages in a bar chart so that I
  can understand my attendance trends over time
- As a user, I want to filter data by date range so that I can focus on specific
  time periods like quarters or financial years

## Specific Requirements

**Calendar View Component**
- Display monthly grid calendar showing current month by default
- Provide previous/next month navigation buttons for browsing history
- Mark attended days with red visual indicators (red background circle or dot)
- Make attended days clickable to show detail modal or tooltip
- Display timestamp and target station name when attended day is clicked
- Use react-calendar library for calendar functionality
- Ensure keyboard navigation support for accessibility
- Mobile-optimized with touch-friendly date selection

**Monthly Bar Chart Visualization**
- Display one bar per month showing attendance percentage (0-100%)
- Use Recharts library for rendering responsive bar charts
- Color bars in red theme to match attended day indicators
- Include tooltips showing exact percentage, working days, and days attended on hover
- Ensure chart is fully responsive and readable on mobile screens

This continues for 9 specific requirement areas, totaling about 100 lines of detailed specifications. The spec-writer captured my shaping answers and translated them into implementable requirements.

Key insight: Notice the specificity. Not "build a chart" but "use Recharts library, red theme, responsive on mobile, tooltips on hover." That level of detail lets the AI implement without guessing.

Just like shaping, this is iterative too. The spec-writer generates a detailed spec based on the shaped requirements. You review it carefully. Maybe you notice it specified the wrong color theme, or it didn't include a requirement for error handling, or the accessibility requirements aren't strong enough. You provide feedback, and the spec gets updated. Review again. Refine again. Keep going until the spec accurately represents what you want built.

For the frontend spec, I noticed the first draft didn't specify what should happen when the JSON fetch fails. I asked for better error handling requirements—retry option, user-friendly messages, graceful degradation. The spec was updated to include those details. Same with accessibility—I pushed for stronger requirements around keyboard navigation and screen reader support.

Human role here: Review the spec thoroughly and keep refining until it's right. Did it capture your intent? Are there edge cases missing? Requirements that don't make sense? This is your last chance to catch misunderstandings before code gets written, so it's worth taking the time to get it right.

Phase 4: Write Tasks

With an approved spec, the task-writer agent breaks it into granular, actionable tasks. Here's how the frontend spec became 6 task groups with 40+ individual tasks:

# Task Breakdown: Attendance Tracker Frontend UI

## Task List

### Task Group 1: Initial Project Setup
**Dependencies:** None

- [x] 1.1 Create new Vite React project
  - Run: `npm create vite@latest attendance-tracker -- --template react`
  - Navigate into project directory
  - Install base dependencies: `npm install`

- [x] 1.2 Install and configure Tailwind CSS
  - Install: `npm install -D tailwindcss postcss autoprefixer`
  - Initialize: `npx tailwindcss init -p`
  - Configure tailwind.config.js with content paths and custom 'attended' color (#ef4444)
  - Add Tailwind directives to src/index.css

- [x] 1.3 Install required libraries
  - Chart library: `npm install recharts`
  - Calendar library: `npm install react-calendar`
  - Date picker: `npm install react-datepicker`

### Task Group 2: Data Fetching and Processing
**Dependencies:** Task Group 1

- [x] 2.1 Write 2-6 focused tests for data utilities
  - Test JSON fetch success scenario
  - Test error handling for network failures
  - Test date filtering calculation

- [x] 2.2 Create data fetching utility
  - File: src/utils/dataFetcher.js
  - Implement fetchAttendanceData() function
  - URL: https://raw.githubusercontent.com/koustubh25/station-station/main/output/attendance.json
  - Use cache: 'no-cache' for fresh data
  - Handle network errors with descriptive messages

Each task is concrete enough that I could hand it to any developer (or AI) and they'd know exactly what to build. Dependencies are explicit—you can't build the calendar component until data fetching works.

Human role here: Review the task breakdown. Is anything missing? Are tasks sequenced correctly? Do the dependencies make sense? Sometimes AI misses edge cases or creates circular dependencies.

Phase 5: Implement Tasks

This is where AI assistance actually writes code. But it's not fully autonomous—there are specific checkpoints where human review is critical.

For Station Station, the implementation flow looked like this:

AI implements task - The implementer agent writes code according to the task spec
AI runs tests - Verifies the implementation works (if tests exist)
Human reviews output - You check the code, test it manually, and approve or request changes
Move to next task - Repeat for each task in the breakdown

Here's where I learned an important aspect: agent-os CLI permissions. The AI can read files, write files, and run tests. But certain operations require your explicit approval:

Git commits - You review and commit changes yourself
Git pushes - You decide when to push to remote
Workflow triggers - You manually kick off CI/CD pipelines

This is by design. You maintain control over version history and deployments.

But here's what I discovered: After gaining confidence in the AI-generated code—once I'd reviewed a few implementations and saw they were solid—I started allowing the AI to do git push and use the gh CLI to view or trigger GitHub Actions workflows. This let the AI work more autonomously: push code, trigger the build, check if tests passed, and if they failed, fix the issues and try again.

The workflow became: AI implements → AI pushes → AI triggers workflow → AI monitors results → If failures, AI fixes and repeats. I'd check in periodically, but for well-defined tasks, the AI could iterate autonomously until everything passed.

This isn't the default (and probably shouldn't be for unfamiliar projects), but once you've established trust through review, you can grant more autonomy where it makes sense.

Note: Agent-os also has an orchestrate-tasks command that provides even more advanced multi-agent coordination and autonomous task execution. But that's beyond the scope of this blog—we'll cover it in detail in a future post. For Station Station, the standard task-by-task implementation workflow was sufficient.

Human role here: Code review, manual testing, and deployment decisions. AI can generate the boilerplate, but you verify it works in your specific context.

The Complete Agent-OS Workflow

Now that we've walked through all five phases, let's see how they fit together into a complete cycle.

The agent-os workflow follows a structured, iterative cycle. Notice the feedback loop where human review catches issues that require debugging before the feature is complete. This isn't full automation—it's a partnership where AI handles implementation and humans guide the architecture and review the results.

Human Review at Key Decision Points

This sequence diagram reveals the continuous human-AI collaboration throughout development. Review happens at multiple stages, not just at the end. Each phase includes a human checkpoint where you validate the AI's work before proceeding.

The key insight: this is continuous collaboration, not "AI does everything then human reviews at the end." You're involved throughout, making architectural decisions, reviewing outputs, and course-correcting when needed.

The Iterative Reality

Here's what the spec doesn't show: iteration. The workflow diagrams make it look linear, but reality is messier.

For Station Station, I went through multiple rounds:

Spec refinement: Realized mid-development I needed manual attendance dates (for days I drove to work instead of taking the train). Went back and updated the spec.
Task adjustments: Some tasks were too large and got broken into smaller chunks. Others were unnecessary and got removed.
Implementation bugs: AI couldn't fix the manualAttendanceDates field bug after several attempts. I had to review the code, identify the issue location, then let AI implement the fix.
New specs added later: The initial roadmap had 8 features, but I added more specs later for enhancements like security improvements and manual attendance features. You don't have to plan everything upfront—you can always create new specs for additional features as needs emerge.

The workflow provides structure, but you'll loop back. That's normal. The key difference from ad-hoc AI chat is that when you loop back, you update the spec or tasks—so the system stays consistent. Future features can reference the updated spec instead of inheriting outdated assumptions.

And when you think of new features? Just create a new spec and go through the same workflow. The product documentation evolves with your project.

What Makes This Different

The structured approach of agent-os SDD provides several key benefits:

Clear Direction Throughout

Every feature starts with documented requirements, not assumptions
The roadmap gives you a clear view of what's done and what's next
When you solve one problem (like Cloudflare bypass), the spec tells you exactly what comes next
No more "I got this working, but now what?"

Persistent Context

Because you have a record of all specs, tasks, and their completion status, the AI can pick up exactly where you left off—even weeks later
Come back after a break: "Task 7 is complete, Task 8 is next, here's what needs to be done"
No context loss, no re-explaining what you've already built
The documentation serves as persistent memory across sessions

Easier Debugging

When something breaks, you can reference the spec to understand intended behavior
Task breakdown makes it easy to isolate which component is failing
Specs document edge cases and requirements that are easy to forget during implementation

Iterative Refinement

Update specs as you learn—they evolve with your understanding
Add new specs for new features without disrupting existing work
Each iteration is documented, so you can see why decisions were made

The time investment is front-loaded. Spec creation took longer than just prompting Claude to "build a frontend." But I shipped all 8 planned features. The debugging was easier. The resumability was huge. And when I added new features a week later, the specs told me exactly where to hook them in.

That's the ROI of Spec-Driven Development—not faster initial code generation, but fewer surprises, clearer direction, and maintainable progress.

What's Next

We've seen the agent-os workflow in action: creating products, shaping specs, writing detailed specifications, breaking down tasks, and implementing with AI assistance. We have a structured process that transforms vague ideas into working code.

But this is the part where I need to be honest about limitations. The workflow isn't magic. AI still struggles with certain problems, and some features require significant human intervention. In Part 4, we'll dive into the real challenges: debugging stories where AI failed, the collaboration spectrum between AI and human, and when to know if Spec-Driven Development is overkill for your project.

If you're wondering whether this structured approach is always worth it—or where it breaks down—Part 4 has the answers.

Part 2: Building Station Station - A Real-World SDD Case Study

Koustubh — Tue, 04 Nov 2025 05:41:57 +0000

In Part 1, I introduced Spec-Driven Development and explained how it provides a structured alternative to the AI chat trial-and-error loop. Now let's get concrete. This is the story of building Station Station—a real application that solves a real problem I face as a Melbourne train commuter working hybrid.

This isn't a contrived demo or a toy project. It's a fully functional web application I use regularly, built in about 3 days using the agent-os SDD workflow, now live at https://koustubh25.github.io/station-station/. In this part, I'll walk you through the problem it solves, the technical architecture, the 8-feature roadmap we shipped, and the real challenges we encountered along the way.

The Problem: Hybrid Work Compliance Meets Train Commuting

My company, like many post-pandemic workplaces, has a hybrid policy: 50% office attendance required. Sounds reasonable, right? But here's the catch—there's no automated way to track whether I'm meeting that threshold.

My manager periodically notifies me of my compliance status—"You're at 42% this month, need to increase office days"—but by then, I'm already behind. I have no visibility into my attendance until someone tells me I'm non-compliant.

The real problem isn't just manual tracking—it's reactive tracking. I want to proactively manage my hybrid schedule: "It's mid-month and I'm at 48%, so I should go to the office this week to stay on track." Or "I'm at 55%, so I can safely work from home the rest of the month."

This kind of strategic planning requires real-time visibility into my attendance data, not periodic manager notifications. The advantages of tracking this myself:

Autonomy: Control my schedule instead of reacting to manager feedback
Strategic planning: Make informed WFH decisions based on current compliance
Avoid surprises: Never get caught off-guard by non-compliance notifications
Flexibility: Balance office days with personal needs (appointments, errands, weather)

This is exactly the kind of repetitive, rule-based task that should be automated. But the Myki portal doesn't offer attendance tracking—it's designed for viewing transactions, not inferring work patterns. No third-party tools exist because the Myki API is undocumented and protected by Cloudflare bot detection.

Perfect problem space for a personal automation project.

The Solution: Automated Attendance Tracking from Transit Data

Station Station automatically determines office attendance by analyzing Myki transaction data. The core logic is simple: if I tapped on or off at my designated work station on any given day, that counts as an office day. The application handles the complexity of data extraction, date calculations, and visualization.

Here's the architecture:

Data Extraction Layer (Python + Playwright)

Headless browser automation bypasses Cloudflare Turnstile bot detection using browser profile trust signals
Authenticated session extracts cookies and Bearer tokens for API calls
Transaction history retrieved for user-specified date ranges
Data parsed and normalized into structured JSON format

Processing Layer (Python)

Attendance logic analyzes transactions to identify work station visits
Configurable rules support:
- Work days: Default Monday-Friday, customizable per user
- Skip dates: Exclude specific dates (sick leave, planned vacation, public holidays)
- Manual attendance dates: Override for days you attended work without taking the train (drove to work, carpooled, etc.)
Monthly aggregations calculate attendance statistics and percentages
JSON storage provides data persistence with optional GitHub backup for version history

Presentation Layer (React + Tailwind CSS v4)

Interactive calendar displays attended days with color-coded indicators
Monthly bar chart visualizes attendance trends over time
Summary dashboard shows attendance percentage, working days, and days attended/missed
Date range filtering allows focused analysis of specific periods
Public holidays automatically detected and displayed for context
Fully responsive design optimized for mobile (my primary use case)

Deployment (GitHub Pages)

Static site deployment—no backend hosting required
React app consumes JSON data generated by Python backend
GitHub Actions automates data updates and deployment
Zero ongoing hosting costs

Fully Autonomous Operation

This is the crucial part: once configured, the system runs completely autonomously. Every day, GitHub Actions automatically:

Spins up a Docker container
Authenticates with the Myki portal
Fetches new transaction data
Calculates updated attendance statistics
Commits the updated JSON to the repository
Deploys the refreshed dashboard to GitHub Pages

I don't touch anything. I just open the dashboard URL on my phone each morning and see my current attendance percentage. No manual data entry, no clicking through the Myki portal, no spreadsheet updates. The app handles everything from data extraction to visualization without human intervention.

And it's completely free. GitHub Actions provides 2,000 free minutes per month for private repositories (unlimited for public repos), GitHub Pages hosting is free, and the entire stack runs on free tiers. Zero ongoing costs for a fully automated daily workflow.

The beauty of this architecture is its separation of concerns: Python handles the complex authentication and data extraction (runs locally or in GitHub Actions), React handles the user-friendly visualization (deployed statically), and JSON serves as the contract between them.

The Roadmap: 8 Features Across 3 Phases

Here's exactly what we built, broken down into three logical phases:

Phase 1: Foundation (Authentication & API Discovery)

Myki Authentication & Cloudflare Bypass (Large, 2 days)
- Browser automation with Playwright to bypass Cloudflare Turnstile detection
- Profile-based trust signals (copy Chrome profile data) to appear as legitimate user
- Multi-step authentication flow to extract session cookies and Bearer token
- Critical blocker—nothing else could work until this was solved
Transaction History API Reverse Engineering (Medium, 1 day)
- Network traffic analysis to identify undocumented Myki API endpoints
- Request header mapping (x-ptvwebauth, x-verifytoken, authorization Bearer)
- Response parsing to extract transaction data
- Session persistence handling (Bearer token expires after ~20 minutes)

Phase 2: Data Layer (Extraction & Processing)

Myki SDK / Data Retrieval - Browser-based (Medium, 1 day)
- Browser-based scraping as fallback (API reverse engineering had limitations)
- Transaction data parsing (station names, timestamps, tap on/off events)
- Date range query support for historical data retrieval
Card Selection & Date Range Handling (Small, <1 day)
- Programmatic Myki card selection from user's registered cards
- Configurable date range parameters for transaction queries
- Input validation and error handling
Attendance Logic & JSON Storage (Medium, 1 day)
- Station-based attendance detection (checks if user visited their designated work station)
- Daily and monthly attendance aggregations
- Skip dates support for planned absences
- Manual attendance override for non-PTV commutes (drove to work)
- Structured JSON output with separate sections for clarity

Phase 3: Integration & UI (Visualization & Configuration)

GitHub Integration for Data Backup (Small, <1 day)
- Optional GitHub repository integration for data version control
- Automated commits and pushes via GitHub Actions
- Historical tracking and audit trail
React Frontend Dashboard (Medium, 1 day)
- Interactive calendar component with attended days highlighted
- Monthly statistics bar chart for trend analysis
- Summary dashboard (attendance %, working days, days attended/missed)
- Date range filtering with react-datepicker integration
- Public holidays display using date-holidays library
- Fully responsive mobile-first design with Tailwind CSS v4
- Live at: https://koustubh25.github.io/station-station/
Configuration Management & User Setup (Small, <1 day)
- GitHub secrets for secure credential storage (MYKI_USERNAME, MYKI_PASSWORD, MYKI_CARDNUMBER)
- Config files for attendance settings (target station, skip dates, manual dates)
- Multi-user support with user-specific configuration
- Setup documentation and README

The Cloudflare bypass was the critical path item. Once we solved that authentication challenge on Day 1-2, the rest of the features flowed in rapid succession. This is the power of proper sequencing—identifying technical blockers early and solving them first unlocks everything downstream.

Try It Yourself

The application is live at https://koustubh25.github.io/station-station/. You'll see my actual attendance data (with privacy considerations—no personal identifiers exposed). The dashboard shows:

Attendance Calendar: Interactive monthly view with attended days highlighted in green
Monthly Statistics: Bar chart showing attendance trends across months
Summary Dashboard: At-a-glance metrics (attendance percentage, working days, days attended/missed)
Date Range Filtering: Focus on specific periods for detailed analysis
Public Holidays: Automatically displayed for context (Victoria, Australia)

The repository is public at https://github.com/koustubh25/station-station (if you want to deploy this yourself—more on that in Part 5). The frontend is fully responsive, so try it on your phone. That's where I use it most—quick morning check to see my monthly progress.

Key Technical Challenges

Let me highlight the three biggest technical challenges we faced. These weren't abstract problems—they were real blockers that required careful problem-solving.

1. Cloudflare Turnstile Bypass

The Myki portal is protected by Cloudflare Turnstile, which actively detects and blocks headless browsers. Our first attempt using standard Playwright headless mode failed immediately with the "Verifying you are human..." overlay blocking form access.

The solution: browser profile trust signals. Playwright launches with an empty Chrome profile directory structure (Cookies, Preferences, History, Web Data, Login Data files), which provides just enough profile metadata to appear as a real browser without containing any actual user data. This runs on GitHub Actions runners, so there's no risk to user credentials—it's a fresh environment each time. Cloudflare recognizes the profile structure as legitimate traffic. Running in headed (visible) mode rather than headless also helps avoid automation detection signals.

This was the critical blocker for the entire project. Without solving Cloudflare bypass, we couldn't access any Myki data. It took domain knowledge of browser fingerprinting and creative problem-solving—not something AI could generate from patterns alone.

2. API Reverse Engineering

The Myki transaction API is completely undocumented. We had to analyze browser network traffic to identify endpoints, required headers, and authentication patterns.

Key discoveries through network analysis:

Bearer token location: extracted from authentication response data.token field
Required headers: x-ptvwebauth, x-verifytoken, x-passthruauth, authorization: Bearer <token>
Token expiration: ~20 minutes, requires re-authentication for long-running sessions
Request structure: specific query parameters for date ranges and card selection

Here's a snippet of the authentication configuration:

// myki_config.json structure
{
  "users": {
    "username": {
      "targetStation": "Your Work Station Name",
      "skipDates": ["2025-12-25", "2026-01-01"],
      "startDate": "2025-04-15",
      "manualAttendanceDates": ["2025-11-03"]
    }
  }
}

AI could implement the API client once we documented these patterns in the spec. But identifying which headers were critical, understanding token lifetime, and mapping the authentication flow required human analysis of network traffic.

3. Transaction Parsing and Attendance Logic

Inferring office attendance from train transactions required clear, simple logic. The core rule: "Did the user visit their work station on this date?"

Our solution:

If any transaction on date D involves the user's designated target station, count D as attended
Card type, fare type, transaction status - none of that matters for attendance detection
Skip dates: Explicitly exclude dates that shouldn't count (sick leave, planned vacation). Public holidays are automatically detected and excluded using Victoria, Australia's holiday calendar. For example, if I'm sick on a Tuesday, I add that date to skipDates so it doesn't count against my attendance percentage.
Manual attendance dates: Override for days I attended work but didn't take the train. For instance, if I drove to work or carpooled, there's no Myki transaction, but I still want it counted as an office day. I add that date to manualAttendanceDates.

This handles the edge cases that real-world commuting brings. Some weeks I drive because I need my car for errands after work. Some days I'm sick or on vacation. The configuration system makes it easy to adjust attendance tracking to match reality, not just train transactions.

This domain-specific logic required human understanding of the problem space and business rules. AI could generate the code structure, but the attendance rules came from me knowing how I actually commute and what counts as "attendance."

What We Actually Built

So what does ~3 days of structured development look like in practice? Here's what we ended up with:

The codebase is about 6,000 lines—roughly 3,500 lines of Python for the backend extraction and processing logic, and 2,500 lines of React/CSS for the frontend dashboard. That's 8 UI components, 3 custom React hooks, and a handful of utility modules.

Performance-wise, it's pretty snappy. Lighthouse scores are in the 95+ range across the board (accessibility hits 100). The bundle is 65KB gzipped, loads in under a second on 3G, and becomes interactive in under 2 seconds. Not bad for a dashboard with charts and calendars. The frontend doesn't do any complex calculations—all the heavy lifting (attendance logic, date processing) happens in Python during the data extraction phase. The React app just renders pre-computed JSON.

I built this in about 3 days of active development. That includes writing the specs upfront, breaking down tasks, implementing features, and deploying to GitHub Pages. The upfront time spent on specs definitely paid off—I had way fewer "wait, what was I supposed to build here?" moments compared to my usual side projects.

Is this faster than just prompting Claude directly? Honestly, probably not for the initial code generation. But debugging was way easier because I could reference the spec when something didn't work. And when I came back to add the manual attendance feature a week later, the spec told me exactly where to hook it in. That's where the structured approach actually saves time.

What's Next

We've seen the Station Station project: the problem it solves, the architecture, the 8-feature roadmap, and the technical challenges. We have a live application built in 3 days using agent-os SDD.

But how did we actually build this? How does the agent-os workflow translate requirements into specs, specs into tasks, and tasks into working code? That's what Part 3 will explore in detail.

In Part 3, we'll walk through the complete agent-os workflow: creating a product, shaping specifications, writing detailed specs, breaking down tasks, and implementing features with AI assistance. We'll use actual Station Station examples to show how each phase works and where human review happens.

If you're curious how a structured workflow can make AI assistance predictable instead of frustrating, Part 3 is where we dive deep into the mechanics.

Part 1: Spec-Driven Development - Building Predictable AI-Assisted Software

Koustubh — Tue, 04 Nov 2025 05:40:36 +0000

You know that feeling when you're using AI to generate code, and it seems to understand exactly what you want? The AI suggests a complete implementation. You accept it, run your tests, and... something's off. So you refine your prompt with more context. The AI generates a different approach. Tests still fail. You add more details to your conversation, but now the context is getting long and the AI starts hallucinating—confidently suggesting functions that don't exist, mixing up variable names, or contradicting its earlier recommendations. Before you know it, you're caught in a trial-and-error loop, spending more time debugging and re-prompting than you would have writing it yourself.

I've been there. And I've found a better way.

This is the first part of a series where I'll share how I built Station Station—a personal project for tracking Melbourne train commuters' office attendance—using Spec-Driven Development (SDD). This isn't a promotional piece. I'm going to show you the actual challenges, the times when AI failed spectacularly, and the structured approach that made AI assistance genuinely productive instead of frustrating. By the end of this series, you'll have a clear framework for deciding when SDD makes sense for your projects and when it's overkill.

Vibe Coding vs Spec-Driven Development

That trial-and-error loop I described? It's often called "vibe coding"—chatting with AI, trying what it suggests, debugging when it breaks, and iterating until something works. No upfront planning, no structure, just vibing with the AI and seeing where it takes you. For quick experiments and throwaway scripts, vibe coding is perfectly fine. But for real projects you want to finish and maintain? The lack of structure becomes a problem.

Spec-Driven Development takes a different approach: structure first, then code. Instead of chatting your way to a solution, you invest time upfront to document what you're building, why you're building it, and what success looks like. Then you let AI implement those documented requirements. The AI still does the heavy lifting, but within guardrails you've defined.

The contrast is striking. Traditional AI chat often becomes a trial-and-error loop—you provide a vague prompt, get generated code, test it, realize it doesn't quite work, and start over. Spec-Driven Development follows a predictable path: gather requirements, write detailed specifications, break into tasks, implement systematically, and produce reviewable output. You spend more time upfront on specs, but save time on debugging and rework.

What Is Spec-Driven Development?

Spec-Driven Development is exactly what it sounds like: you write detailed specifications before any code gets generated. Instead of throwing vague prompts at an AI and hoping for the best, you invest time upfront to document exactly what you want to build, why you're building it, who it's for, and what success looks like.

Here's the core insight: AI is incredibly good at implementing well-defined specifications, but it's terrible at mind-reading. When you provide a structured spec with clear requirements, user stories, and acceptance criteria, AI can generate code that actually works. When you give it fuzzy requirements through conversational chat, you get fuzzy results.

Tools for Implementing SDD

There are several tools that help implement Spec-Driven Development workflows. I've experimented with a few, including OpenSpec, which takes a different angle on the problem.

OpenSpec focuses on change proposals for existing systems. Its workflow centers around proposing, reviewing, and implementing changes to established codebases, with support for multiple AI tools through the AGENTS.md convention. It's particularly strong when you're working with a team using different AI assistants or making incremental updates to existing projects.

For Station Station, I chose agent-os, which is optimized for building complete products from scratch. It handles the full product lifecycle—from initial idea through deployment—with deep Claude integration and sophisticated multi-agent orchestration. The spec format is requirements-based with explicit goals, user stories, and technical implementation details. They've recently introduced version 2 with enhanced capabilities.

Since I was building a greenfield personal project solo with Claude, agent-os was the natural fit.

The important thing isn't which SDD tool you use—it's that you use some structure to align human intent with AI implementation.

The Agent-OS Workflow

For Station Station, the agent-os workflow followed five phases:

Create Product: Define your product mission, target users, and core value proposition
Shape Spec: Gather requirements through structured AI-human dialogue where the AI asks clarifying questions
Write Specs: Convert those requirements into detailed technical specifications with explicit scope boundaries
Write Tasks: Break the spec into granular, actionable implementation tasks
Implement Tasks: AI-assisted coding with human review at key checkpoints

The magic isn't in any of these individual steps—product managers have been writing specs for decades. The magic is in how this structure channels AI's strengths while keeping humans in control of architecture and quality.

The Station Station Case Study

Over this series, I'll use Station Station as a concrete example of SDD in action. Here's the quick version: I'm a Melbourne train commuter working hybrid, and my company has a 50% office attendance policy. But there's no automated way to track whether I'm meeting that threshold. I have to manually review my Myki (Melbourne's metro card) transaction history and count which days I tapped on at my work station.

Tedious? Absolutely. Perfect problem for automation? You bet.

Station Station automatically determines office attendance by analyzing Myki transaction data. If you tapped on/off at your designated work station, it counts as an office day. The app presents monthly statistics, attendance calendars, and lets you export the data for compliance tracking. It's live at https://koustubh25.github.io/station-station/. You can use or deploy this tool yourself for free—I'll show you how in the final part of this series.

Here's what makes it a good SDD case study:

Real complexity: Bypassing Cloudflare bot detection, reverse-engineering undocumented APIs, handling timezone edge cases
8 completed features: Built incrementally across 3 phases using the agent-os workflow
Honest challenges: There were bugs AI couldn't fix. I'll show you exactly where human intervention was critical.
Real-world application: ~6,300 lines of code, fully responsive React dashboard with Lighthouse score 95+

This wasn't a toy project. It's a real application I use daily, built with real constraints, encountering real problems. In Part 2, I'll walk through the project in detail—the problem it solves, the technical challenges, and the 8-feature roadmap. In Part 3, we'll dive deep into the agent-os workflow. In Part 4, I'll share the honest truth about where AI failed and why.

But before we get there, let's set realistic expectations.

SDD Is Not Full Automation

Here's the critical thing to understand: Spec-Driven Development is not about handing requirements to an AI and walking away while it builds your app. It's a structured human-AI partnership where you maintain control of architecture and quality while the AI handles implementation and boilerplate.

Think of it this way: you're the architect and reviewer, the AI is your implementation assistant. You design the system, write the specs, approve the task breakdown, and review the code. The AI generates the boilerplate, implements standard patterns, writes tests, and handles the repetitive work you'd rather not do manually.

There's a collaboration spectrum, which I'll detail in Part 4:

Tier 1 - AI Can Handle Alone: Boilerplate generation, standard CRUD operations, test scaffolding
Tier 2 - AI + Human Review Required: Complex business logic, external API integration, cross-file refactoring
Tier 3 - Human Must Lead: Debugging multi-layered issues, architectural decisions, security implementations

Most real development work spans all three tiers. SDD helps you identify upfront which tier each task falls into, so you know where to invest your review energy and where AI can run autonomously.

I learned this the hard way. When I added manual attendance date tracking to Station Station, the AI couldn't fix a date-handling bug even after several debugging rounds. I had to review the code myself, identify the specific function where the problem lived, and guide the AI to the solution. The AI was perfectly capable of implementing the fix—once I identified what needed fixing.

That's the partnership. AI accelerates implementation. Humans guide architecture and troubleshoot the hard problems.

What's Next

In Part 1, we've established what Spec-Driven Development is, why it's more predictable than ad-hoc AI chat, and how it compares to other approaches like OpenSpec. We've introduced Station Station as a real-world case study with honest challenges included.

In Part 2, we'll explore the Station Station project in detail: the problem it solves, the 8 features I shipped, the technical challenges like Cloudflare bypass and API reverse engineering, and the metrics that prove this approach works in production.

If you're tired of the AI chat trial-and-error loop, if you want to leverage AI assistance without sacrificing code quality, or if you're just curious how a structured spec can turn AI from unpredictable to reliable—stick around. This series will show you exactly how it works, warts and all.

Forem: Koustubh

What OpenClaw Teaches Us About Personal AI Security

Quick Context: gharasathi's Setup

What Happened to OpenClaw

CVE-2026-25253: One-Click RCE (CVSS 8.8)

Misconfigured Instances Leaking Data

341 Malicious Skills in ClawHub

The Honest Comparison

This Is a Trade-Off, Not a Win

What I Actually Learned

Where This Stands

The Idea Scales

Running an LLM on 6GB RAM — Model Selection for Edge AI

The Hardware

The Memory Budget

The Model Selection Journey

Reality Disagreed

The Downgrade

A Note on Choosing Models for Constrained Hardware

Why the LLM Barely Matters

Future: Better LLM Requires Better Hardware

What's Next

gharasathi (घरासाठी) — A Privacy-First Household AI Running on a $200 Mini PC

The Problem

Architecture

Why Neo4j?

The Query That Sells It

Vector Search Built In

Dual Storage: Graph + Object

What's Next

Building Apps with AI: Deep Dive into beads Workflow

Building Apps with AI: Deep Dive into beads Workflow

Recap

Scope and Assumptions

The JSONL Advantage: Compact and Queryable

Why Compact Matters

Close Reasons: Implementation Memory

Real Example: Session Continuity

The Hierarchy: Epics → Features → Tasks

Why This Structure Works

Real Issues from Mission House

Pattern 1: Task with Clear Dependencies

Pattern 2: Bug with Acceptance Criteria

Pattern 3: Tombstone (Deleted Issue)

beads vs. agent-os (an SDD Framework)

Two Different Philosophies

Comparison Table (agent-os vs beads)

When to Use Which

Can You Use Both?

The Drawbacks: What Didn't Work

1. Learning Curve

2. Sync Conflicts

3. Over-Granularity Temptation

4. Daemon Startup Delays

5. No Visual Dashboard

Advanced Features We Used

Bulk Operations

Priority System

Close Reasons

Integrations

Project Timeline Visualization

Final Thoughts

Resources

Building Apps with AI: How beads Changed My Development Workflow

Building Apps with AI: How beads Changed My Development Workflow

TL;DR

Spec-Driven Development (SDD)

Enter beads: Task-First with Graph-Based Dependencies

1. Compact JSONL, Not Verbose Markdown

2. Automatic "What's Next?" via Graph Traversal

3. Explicit Dependencies = Enforced Execution Order

Quick Start: Installing beads

The App: Mission House

How I Used beads: A Real Example

The Dependency Graph

What's Coming in Part 2

Key Takeaways

Part 5: Building Station Station - Should You Use Spec-Driven Development?

What We Actually Built

When to Use Spec-Driven Development

Building Apps with AI: How `beads` Changed My Development Workflow

Enter `beads`: Task-First with Graph-Based Dependencies