Forem: Steve Harlow

Building a Transparent AI Pipeline: 59 Weeks of Automated Political Scoring with Claude API

Steve Harlow — Thu, 19 Mar 2026 03:25:45 +0000

I've been running an automated AI pipeline for over a year that ingests news articles, clusters them into political events, and scores each event on two independent axes. Here's how it works, what I learned, and why I made everything transparent.

The Problem

Political events have two dimensions that are rarely measured together:

How much institutional damage does this cause? (democratic health)
How much media attention does it get? (distraction economics)

When these are wildly mismatched — high damage, low attention — something important is being missed. I built The Distraction Index to detect these gaps automatically.

Architecture Overview

News Sources (GDELT + GNews + Google News RSS)
    ↓ every 4 hours
Ingestion Pipeline (/api/ingest)
    ↓ dedup + store
Clustering (Claude Haiku) → group articles into events
    ↓
Dual-Axis Scoring (Claude Sonnet) → Score A + Score B
    ↓
Weekly Freeze → immutable snapshot

Tech stack: Next.js 16 (App Router), Supabase (PostgreSQL), Claude API, Vercel

Why Two Models?

Cost optimization was critical. Running everything through Sonnet would cost ~$300/month. Instead:

Claude Haiku handles article clustering (~$0.25/1M tokens) — it groups articles by topic similarity
Claude Sonnet handles scoring (~$3/1M tokens) — it evaluates institutional impact using structured prompts

Result: ~$30/month for a production pipeline processing articles every 4 hours.

The Dual Scoring System

Score A: Constitutional Damage (0-100)

Seven weighted governance drivers, each scored 0-5:

Driver	Weight	What it measures
Judicial Independence	0.18	Court stacking, ruling defiance
Press Freedom	0.15	Journalist targeting, access restrictions
Voting Rights	0.15	Disenfranchisement, election interference
Environmental Policy	0.12	Regulatory rollbacks, enforcement gaps
Civil Liberties	0.15	Due process, privacy, free assembly
International Norms	0.10	Treaty violations, alliance damage
Fiscal Governance	0.15	Budget manipulation, oversight bypass

Multiplied by severity modifiers (durability × reversibility × precedent) and mechanism/scope modifiers.

Score B: Distraction/Hype (0-100)

Two-layer model:

Layer 1 (55%): Raw media hype — volume, social amplification, cross-platform spread, emotional framing, celebrity involvement
Layer 2 (45%): Strategic manipulation indicators — timing relative to damage events, coordinated messaging, deflection patterns

Layer 2 is modulated by an intentionality score (0-15). Low intentionality → Layer 2 weight drops to 10%.

Classification

Events are classified by dominance margin:

Damage (List A): Score A exceeds Score B by ≥10 points
Distraction (List B): Score B exceeds Score A by ≥10 points
Noise (List C): Neither dominates

The Smokescreen Index

The most interesting feature: automatic pairing of high-distraction events with concurrent high-damage events.

When a B-dominant event (media spectacle) co-occurs with an A-dominant event (institutional harm) that received less coverage, the system flags it as a potential smokescreen.

210+ pairs identified across 59 weeks.

Radical Transparency

Every scoring formula, weight, and AI prompt is published at /methodology. This was a deliberate design choice — if you're scoring political events, your methodology must be auditable.

Key transparency features:

Immutable weekly snapshots — once a week freezes, scores cannot be silently changed
Append-only corrections — post-freeze corrections are timestamped and linked to the original
Published prompts — the exact Claude prompts used for scoring are documented
Open source — full codebase on GitHub

What I Learned

1. Publishing your prompts is terrifying

When your prompt templates are public, anyone can argue with your framing. That's the point — but it requires thick skin and a willingness to iterate.

2. Immutability prevents model drift

Without frozen snapshots, you can't tell if score changes come from real-world changes or model updates. Immutability is essential for longitudinal analysis.

3. The two-axis approach reveals patterns

Single-dimension scoring (left/right, reliable/unreliable) misses the key insight: damage and distraction are independent variables. Some events are both. Some are neither.

4. Cost optimization matters for indie projects

The Haiku-for-clustering, Sonnet-for-scoring split keeps costs at ~$30/month. Without this, the project wouldn't be sustainable as a solo effort.

The Numbers

After 59 weeks:

1,500+ scored events
11,800+ ingested articles
210+ smokescreen pairs
288 tests passing
1,071 pages indexed

Try It

Live site: distractionindex.org
Methodology: distractionindex.org/methodology
Source code: github.com/sgharlow/distraction

I'd love feedback on the scoring methodology. What would you weight differently? What blind spots do you see?

Building AccessBrowse: Voice-Driven Web Browsing with Gemini AI

Steve Harlow — Thu, 12 Mar 2026 05:13:20 +0000

Building AccessBrowse: Voice-Driven Web Browsing with Gemini AI

This blog post was created for the Gemini Live Agent Challenge hackathon.

#GeminiLiveAgentChallenge

The Problem: The Web Is Built for Eyes

Imagine trying to find an apartment on Zillow without seeing the screen. A sighted user glances at a map, scans listings, clicks filters — all in seconds. A visually impaired user faces a different reality: tab through dozens of invisible elements, hope the screen reader can parse Zillow's React-rendered layout, and guess which button says "Apply Filters" versus "Clear All." Over 2.2 billion people worldwide live with visual impairment, and for many of them, this is what browsing the web feels like every day.

Despite decades of work on web accessibility standards, screen readers still fundamentally struggle with the modern web. Dynamic JavaScript layouts, single-page applications that never trigger a page reload, and interactive elements that rely on visual context rather than semantic markup create constant barriers. The root issue is architectural: most assistive tools try to understand websites through their source code — parsing the DOM, following ARIA roles, reading element text. This works for well-structured, semantic HTML. It fails on the messy, JavaScript-heavy reality of how websites are actually built.

We wanted to try a different approach: what if we looked at websites the same way a sighted person does — visually — and combined that with natural voice conversation?

The Approach: Coordinate-Based Browsing

The core insight behind AccessBrowse is simple: instead of parsing the DOM, take a screenshot and ask an AI model what it sees. This is exactly what Gemini Computer Use is designed for.

Here is the action loop at the heart of AccessBrowse:

The user speaks a request ("Find me apartments in Seattle under $1000 on Zillow")
Gemini Live API receives the voice input and decides to call the browse_web tool
The backend requests a screenshot from the Chrome extension
The screenshot is sent to Gemini Computer Use (gemini-2.5-computer-use)
The model analyzes the visual content and returns the next action with precise coordinates
The content script translates coordinates to viewport pixels and executes the action
Steps 3-6 repeat until the task is complete

The coordinates come back on a normalized 1000x1000 grid — (0, 0) is the top-left corner, (1000, 1000) is the bottom-right. This abstraction is powerful because it works regardless of the actual viewport size. Whether the browser window is 1280 pixels wide or 1920 pixels wide, coordinate (500, 300) always refers to the same relative position on the page.

On the extension side, translating these coordinates to actual DOM interactions is straightforward:

const x = (coordinate[0] / 1000) * window.innerWidth;
const y = (coordinate[1] / 1000) * window.innerHeight;
const element = document.elementFromPoint(x, y);
element.click();

This document.elementFromPoint() approach eliminates the fragility of CSS selector matching. The model does not need to guess at class names or XPath expressions. It looks at the page, identifies the button or form field visually, and returns where to click. In practice, this works on sites ranging from Zillow's complex map-and-list layout to Amazon's product grid to CNN's news feed.

The Voice Pipeline: Gemini Live API

AccessBrowse uses Gemini Live API (gemini-2.5-flash-native-audio) for real-time bidirectional voice streaming. The connection is established using the Google GenAI SDK:

from google import genai
from google.genai import types

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

config = types.LiveConnectConfig(
    response_modalities=["AUDIO", "TEXT"],
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Aoede")
        )
    ),
    system_instruction=types.Content(parts=[types.Part(text=SYSTEM_PROMPT)]),
    tools=[tool_obj],
)

session = client.aio.live.connect(model=LIVE_API_MODEL, config=config)
live = await session.__aenter__()

The session is a full bidi-stream: audio frames flow in (16kHz PCM from the microphone), and audio responses flow out (24kHz PCM from Gemini). The 24kHz output was a deliberate choice — for a product where the user experience is entirely audio-driven, voice quality matters enormously. The difference between 16kHz and 24kHz is clearly audible in consonant clarity and natural intonation.

Tool Calling Within Live Sessions

One of the most interesting engineering challenges was tool calling within the streaming session. When Gemini decides the user wants to browse a website, it emits a tool_call in the response stream. The backend must:

Receive the tool call (with function name and arguments)
Execute the tool (which may involve 10+ steps of screenshot-analyze-act)
Return the tool result as a LiveClientToolResponse
Resume processing the audio stream as Gemini speaks the result

This happens within a single async event loop. The browse_web tool execution can take 30+ seconds (multiple screenshots, model calls, and DOM actions), and during this time the Live API connection must stay alive. We solve this with a keepalive task that sends 100ms of silence at 200ms intervals:

async def _audio_keepalive(self):
    while True:
        await asyncio.sleep(0.2)
        if self._live:
            await self._live.send(
                input=types.LiveClientRealtimeInput(
                    media_chunks=[types.Blob(
                        data=silence_bytes,
                        mime_type="audio/pcm;rate=16000",
                    )]
                )
            )

The Audio Pipeline: Web Audio API in a Chrome Extension

Chrome MV3 extensions have a constraint that made the audio pipeline interesting: service workers cannot access the DOM, which means no AudioContext or getUserMedia() in the service worker. The solution is an offscreen document — a hidden page created by the extension specifically for audio processing.

The offscreen document handles both microphone capture and audio playback:

Capture: getUserMedia() with 16kHz sample rate, mono, echo cancellation enabled. Audio frames are captured via a ScriptProcessorNode, converted from Float32 to Int16 PCM, base64-encoded, and sent to the service worker via chrome.runtime.sendMessage().
Playback: Incoming 24kHz PCM audio from Gemini is decoded from base64, converted to Float32, loaded into an AudioBuffer at 24000Hz sample rate, and played through a queued source node system for smooth sequential playback.

The key detail is the sample rate asymmetry: input is 16kHz (what Gemini Live API expects for speech input) while output is 24kHz (what Gemini outputs for higher quality). The offscreen document uses two separate AudioContext instances at different sample rates to handle this cleanly.

Deployment: Cloud Run

The FastAPI backend is deployed to Google Cloud Run using a single deploy script (deploy.sh):

gcloud run deploy accessbrowse \
    --source ./backend \
    --project $PROJECT_ID \
    --region us-central1 \
    --allow-unauthenticated \
    --memory 1Gi \
    --timeout 300

Cloud Run is well-suited for this workload because the backend is stateful (WebSocket sessions) but not long-lived — sessions typically last 5-10 minutes. The 300-second timeout accommodates long browsing sessions, and 1Gi of memory is sufficient for the async Python server handling up to 3 concurrent sessions.

What We Learned

Vision-based interaction is more robust than DOM parsing. Gemini Computer Use consistently identifies form fields, buttons, and links from screenshots — even on pages with complex layouts, overlapping elements, or minimal semantic markup. The coordinate-based approach eliminates an entire category of bugs related to CSS selector matching. We tested on Zillow (complex map + list layout), Amazon (product grids with dynamic loading), and CNN (news feeds with overlapping images and text). In all cases, the model correctly identified interactive elements from visual inspection alone. Traditional selector-based automation would have required custom adapters for each site.

Tool calling within a live bidi-stream requires careful async orchestration. The GenAI SDK's client.aio.live.connect() context manager is elegant for basic voice conversations, but executing tools that take 30+ seconds (like a multi-step browse action) within the streaming session required solving several subtle problems. The connection would time out without keepalive audio. The event loop had to simultaneously handle incoming audio, outgoing tool results, and background screenshot processing. We ended up building a pattern where asyncio.Event objects coordinate state between the WebSocket handler, the Live API session, and the tool executor — three concurrent async loops that must stay synchronized. Documentation around this flow was minimal, so we relied heavily on experimentation and reading the SDK source.

Audio quality is a feature, not a nice-to-have. For users who rely on voice as their primary interface, the difference between 16kHz and 24kHz output is immediately noticeable. Investing in the higher sample rate — and building the dual-AudioContext pipeline in the offscreen document to handle the sample rate asymmetry — was one of the best decisions we made. The Aoede voice at 24kHz delivers natural intonation and clear consonants that make long browsing sessions comfortable rather than fatiguing.

Chrome MV3 constraints push you toward better architecture. The service worker's lack of DOM access forced us to use an offscreen document for audio, which actually created a cleaner separation of concerns. Message passing between extension components is strictly JSON-serializable, which meant base64 encoding for audio and image data — not elegant, but it forced us to think carefully about data flow boundaries. The result is a four-module architecture (service worker, content script, offscreen document, sidepanel) where each component has a single responsibility.

Try It Yourself

Watch the demo: https://youtu.be/1BBzOFUTdKw

AccessBrowse is open source: github.com/sgharlow/accessbrowse

The README includes step-by-step setup instructions. You need a Google Cloud account with Vertex AI enabled, Python 3.12+, Node.js 20+, and Google Chrome.

Built with Google Gemini Live API, Gemini Computer Use, Cloud Run, and Vertex AI for the Gemini Live Agent Challenge.

#GeminiLiveAgentChallenge

Building a Transparent AI Pipeline: 59 Weeks of Automated Political Scoring with Claude API

Steve Harlow — Sun, 22 Feb 2026 22:26:36 +0000

The Problem

Political events have two dimensions that are rarely measured together:

How much institutional damage does this cause? (democratic health)
How much media attention does it get? (distraction economics)

When these are wildly mismatched — high damage, low attention — something important is being missed. I built The Distraction Index to detect these gaps automatically.

Architecture Overview

News Sources (GDELT + GNews + Google News RSS)
    ↓ every 4 hours
Ingestion Pipeline (/api/ingest)
    ↓ dedup + store
Clustering (Claude Haiku) → group articles into events
    ↓
Dual-Axis Scoring (Claude Sonnet) → Score A + Score B
    ↓
Weekly Freeze → immutable snapshot

Tech stack: Next.js 16 (App Router), Supabase (PostgreSQL), Claude API, Vercel

Why Two Models?

Cost optimization was critical. Running everything through Sonnet would cost ~$300/month. Instead:

Claude Haiku handles article clustering (~$0.25/1M tokens) — it groups articles by topic similarity
Claude Sonnet handles scoring (~$3/1M tokens) — it evaluates institutional impact using structured prompts

Result: ~$30/month for a production pipeline processing articles every 4 hours.

The Dual Scoring System

Score A: Constitutional Damage (0-100)

Seven weighted governance drivers, each scored 0-5:

Driver	Weight	What it measures
Judicial Independence	0.18	Court stacking, ruling defiance
Press Freedom	0.15	Journalist targeting, access restrictions
Voting Rights	0.15	Disenfranchisement, election interference
Environmental Policy	0.12	Regulatory rollbacks, enforcement gaps
Civil Liberties	0.15	Due process, privacy, free assembly
International Norms	0.10	Treaty violations, alliance damage
Fiscal Governance	0.15	Budget manipulation, oversight bypass

Multiplied by severity modifiers (durability × reversibility × precedent) and mechanism/scope modifiers.

Score B: Distraction/Hype (0-100)

Two-layer model:

Layer 1 (55%): Raw media hype — volume, social amplification, cross-platform spread, emotional framing, celebrity involvement
Layer 2 (45%): Strategic manipulation indicators — timing relative to damage events, coordinated messaging, deflection patterns

Layer 2 is modulated by an intentionality score (0-15). Low intentionality → Layer 2 weight drops to 10%.

Classification

Events are classified by dominance margin:

Damage (List A): Score A exceeds Score B by ≥10 points
Distraction (List B): Score B exceeds Score A by ≥10 points
Noise (List C): Neither dominates

The Smokescreen Index

The most interesting feature: automatic pairing of high-distraction events with concurrent high-damage events.

When a B-dominant event (media spectacle) co-occurs with an A-dominant event (institutional harm) that received less coverage, the system flags it as a potential smokescreen.

210+ pairs identified across 59 weeks.

Radical Transparency

Every scoring formula, weight, and AI prompt is published at /methodology. This was a deliberate design choice — if you're scoring political events, your methodology must be auditable.

Key transparency features:

Immutable weekly snapshots — once a week freezes, scores cannot be silently changed
Append-only corrections — post-freeze corrections are timestamped and linked to the original
Published prompts — the exact Claude prompts used for scoring are documented
Open source — full codebase on GitHub

What I Learned

1. Publishing your prompts is terrifying

When your prompt templates are public, anyone can argue with your framing. That's the point — but it requires thick skin and a willingness to iterate.

2. Immutability prevents model drift

Without frozen snapshots, you can't tell if score changes come from real-world changes or model updates. Immutability is essential for longitudinal analysis.

3. The two-axis approach reveals patterns

Single-dimension scoring (left/right, reliable/unreliable) misses the key insight: damage and distraction are independent variables. Some events are both. Some are neither.

4. Cost optimization matters for indie projects

The Haiku-for-clustering, Sonnet-for-scoring split keeps costs at ~$30/month. Without this, the project wouldn't be sustainable as a solo effort.

The Numbers

After 59 weeks:

1,500+ scored events
11,800+ ingested articles
210+ smokescreen pairs
288 tests passing
1,071 pages indexed

Try It

Live site: distractionindex.org
Methodology: distractionindex.org/methodology
Source code: github.com/sgharlow/distraction

I'd love feedback on the scoring methodology. What would you weight differently? What blind spots do you see?

Building a Transparent AI Pipeline: 59 Weeks of Automated Political Scoring with Claude API

Steve Harlow — Sun, 22 Feb 2026 07:25:44 +0000

The Problem

Political events have two dimensions that are rarely measured together:

How much institutional damage does this cause? (democratic health)
How much media attention does it get? (distraction economics)

When these are wildly mismatched -- high damage, low attention -- something important is being missed. I built The Distraction Index to detect these gaps automatically.

Architecture Overview

The pipeline runs every 4 hours:

Ingestion - Fetch articles from 3 news sources (GDELT, GNews, Google News RSS)
Dedup + Store - Remove duplicates, persist to Supabase
Clustering - Group articles into events using Claude Haiku
Dual-Axis Scoring - Score each event on two axes using Claude Sonnet
Weekly Freeze - Create immutable snapshot every Sunday

Tech stack: Next.js 16 (App Router), Supabase (PostgreSQL), Claude API, Vercel

Why Two Models?

Cost optimization was critical. Running everything through Sonnet would cost ~$300/month. Instead:

Claude Haiku handles article clustering (~$0.25/1M tokens) -- it groups articles by topic similarity
Claude Sonnet handles scoring (~$3/1M tokens) -- it evaluates institutional impact using structured prompts

Result: ~$30/month for a production pipeline processing articles every 4 hours.

The Dual Scoring System

Score A: Constitutional Damage (0-100)

Seven weighted governance drivers, each scored 0-5:

Driver	Weight	What it measures
Judicial Independence	0.18	Court stacking, ruling defiance
Press Freedom	0.15	Journalist targeting, access restrictions
Voting Rights	0.15	Disenfranchisement, election interference
Environmental Policy	0.12	Regulatory rollbacks, enforcement gaps
Civil Liberties	0.15	Due process, privacy, free assembly
International Norms	0.10	Treaty violations, alliance damage
Fiscal Governance	0.15	Budget manipulation, oversight bypass

Multiplied by severity modifiers (durability x reversibility x precedent) and mechanism/scope modifiers.

Score B: Distraction/Hype (0-100)

Two-layer model:

Layer 1 (55%): Raw media hype -- volume, social amplification, cross-platform spread, emotional framing, celebrity involvement
Layer 2 (45%): Strategic manipulation indicators -- timing relative to damage events, coordinated messaging, deflection patterns

Layer 2 is modulated by an intentionality score (0-15). Low intentionality means Layer 2 weight drops to 10%.

Classification

Events are classified by dominance margin:

Damage (List A): Score A exceeds Score B by 10+ points
Distraction (List B): Score B exceeds Score A by 10+ points
Noise (List C): Neither dominates

The Smokescreen Index

The most interesting feature: automatic pairing of high-distraction events with concurrent high-damage events.

When a B-dominant event (media spectacle) co-occurs with an A-dominant event (institutional harm) that received less coverage, the system flags it as a potential smokescreen.

210+ pairs identified across 59 weeks.

Radical Transparency

Every scoring formula, weight, and AI prompt is published at /methodology. This was a deliberate design choice -- if you're scoring political events, your methodology must be auditable.

Key transparency features:

Immutable weekly snapshots -- once a week freezes, scores cannot be silently changed
Append-only corrections -- post-freeze corrections are timestamped and linked to the original
Published prompts -- the exact Claude prompts used for scoring are documented
Open source -- full codebase on GitHub

What I Learned

1. Publishing your prompts is terrifying

When your prompt templates are public, anyone can argue with your framing. That's the point -- but it requires thick skin and a willingness to iterate.

2. Immutability prevents model drift

Without frozen snapshots, you can't tell if score changes come from real-world changes or model updates. Immutability is essential for longitudinal analysis.

3. The two-axis approach reveals patterns

Single-dimension scoring (left/right, reliable/unreliable) misses the key insight: damage and distraction are independent variables. Some events are both. Some are neither.

4. Cost optimization matters for indie projects

The Haiku-for-clustering, Sonnet-for-scoring split keeps costs at ~$30/month. Without this, the project wouldn't be sustainable as a solo effort.

The Numbers

After 59 weeks:

1,500+ scored events
11,800+ ingested articles
210+ smokescreen pairs
288 tests passing
1,071 pages indexed

Try It

Live site: distractionindex.org
Methodology: distractionindex.org/methodology
Source code: github.com/sgharlow/distraction

I'd love feedback on the scoring methodology. What would you weight differently? What blind spots do you see?

Why Most AI Coding Sessions Fail (And How to Fix It)

Steve Harlow — Thu, 08 Jan 2026 01:22:10 +0000

The data behind AI-generated code quality—and a framework that enforces discipline

The Promise vs. Reality

AI coding assistants are everywhere. GitHub reports 15 million developers now use Copilot—a 400% increase in one year. Stack Overflow's 2024 survey found 63% of professional developers use AI in their workflow.

The productivity gains are real. Microsoft's research shows developers using Copilot achieve 26% higher productivity and code 55% faster in controlled tests.

But here's what the headlines don't tell you:

AI-generated code creates 1.7x more issues than human-written code.

That's from CodeRabbit's analysis of 470 GitHub pull requests. The breakdown:

1.75x more logic and correctness errors
1.64x more code quality and maintainability issues
1.57x more security findings
1.42x more performance problems

Google's 2024 DORA report found that increased AI use correlates with a 7.2% decrease in delivery stability.

And perhaps most damning: only 3.8% of developers report both low hallucination rates AND high confidence in shipping AI code without human review.

The Specific Failure Patterns

After tracking my own AI coding sessions for 6 months, I identified 13 specific ways they fail. Here are the top 5:

1. Mock Data That Never Dies

AI assistants love mock data. It makes demos look great and code compile cleanly.

The problem? Mocks survive to production.

In my logs, sessions where mock data existed past 30 minutes had an 84% chance of shipping with fake data still in place.

2. Interface Drift

You start with a clean API contract. Midway through the session, the AI suggests "just a small change" to the interface.

Three changes later, your frontend is broken, your tests fail, and you've lost 2 hours.

GitClear's 2025 research shows code churn—changes to recently written code—has increased dramatically since AI adoption, suggesting this pattern is widespread.

3. Scope Creep

"While I'm in here, let me also refactor this..."

What starts as a 50-line change becomes 500 lines across 15 files. Now nothing works, and you can't isolate what broke.

4. The "Almost Done" Trap

The AI reports the feature is "complete." Tests pass locally. You feel good.

Then you deploy, and it breaks immediately because:

Environment variables weren't configured
Error handling was added but never tested
A dependency was mocked that doesn't exist in production

5. Security Blind Spots

Studies show 48% of AI-generated code contains security vulnerabilities. Earlier GitHub Copilot research found 40% of generated programs had insecure code.

The AI writes syntactically correct code. It doesn't understand your threat model.

Why This Happens

The core issue isn't that AI is "bad at coding." It's that AI lacks accountability.

When you ask Claude or Copilot to write code:

It doesn't know if your tests actually run
It can't verify its changes didn't break the build
It assumes you'll catch the mocks, the drift, the scope creep

Prompt engineering helps, but prompts are suggestions. The AI can claim "I removed all mocks" while mocks still exist in your codebase.

You need enforcement, not suggestions.

The Framework Solution

I built the AI Control Framework to enforce discipline through external scripts—validators that check the actual state of your project, not what the AI claims.

Contract Freezing

At session start, interfaces (API specs, database schemas, type definitions) get SHA256-hashed.

$ ./freeze-contracts.sh
✓ api/openapi.yaml: sha256:a1b2c3...
✓ db/schema.sql: sha256:d4e5f6...
Contracts frozen.

Any change during the session triggers an immediate alert:

$ ./check-contracts.sh
✗ CONTRACT VIOLATION: api/openapi.yaml changed
Hash expected: a1b2c3...
Hash found: x7y8z9...
STOP: Submit Contract Change Request or revert.

This catches interface drift before it breaks your frontend.

30-Minute Mock Timeout

Mocks are allowed for the first 30 minutes—enough time to explore an approach.

After 30 minutes:

$ ./detect-mocks.sh
⚠ MOCK TIMEOUT: 2 mocks detected after 30-minute limit
- src/api/users.ts:42 → mockUserData
- src/services/auth.ts:18 → fakeTok...
ACTION REQUIRED: Replace with real service calls.

This forces the "connect to real services" conversation early, when it's still cheap to pivot.

Scope Limits

Hard stops at 5 files changed, 200 lines added per session.

$ ./check-scope.sh
Files changed: 6/5 ✗
Lines added: 240/200 ✗
SCOPE EXCEEDED: Ship current work (if DRS ≥ 85) or revert.

This forces incremental, deployable chunks instead of massive, risky changesets.

Deployability Rating Score (DRS)

A 0-100 score calculated from 13 components:

$ ./drs-calculate.sh
═══════════════════════════════════════
DEPLOYABILITY SCORE: 87/100
═══════════════════════════════════════
✓ Contract Integrity     (8/8)
✓ No Mocks               (8/8)
✓ Tests Passing          (7/7)
✓ Security Validation   (16/18)
✓ Error Handling         (4/4)
⚠ Prod Readiness        (12/15)

✅ READY TO DEPLOY (DRS ≥ 85)

When DRS hits 85+, you know the code is production-ready. No guessing.

Results

After implementing this framework across 6 projects:

Metric	Before	After
Time to deploy	3-5 days	4-6 hours
Rework rate	67%	12%
Breaking changes per feature	4.2	0.3
"Works on my machine" incidents	Weekly	Rare

The framework doesn't slow you down. It prevents the 3-5 day rework cycles that happen when you deploy code that isn't ready.

Industry Context

The research supports this approach:

44% of developers who say AI degrades code quality blame context issues
Microsoft reports it takes ~11 weeks for developers to fully realize AI productivity gains
GitClear found code duplication increased 8x in 2024

The problem isn't AI capability. It's discipline—and discipline requires enforcement.

Getting Started

# Clone and install
git clone https://github.com/sgharlow/ai-control-framework.git
./ai-control-framework/install.sh /path/to/your/project

# Run your first DRS check
cd /path/to/your/project
./ai-framework/reference/bash/drs-calculate.sh

The framework works with any AI assistant that can read files: Claude Code, Cursor, Copilot, Aider.

It's MIT licensed and has 100% test coverage (136/136 tests passing).

The Bottom Line

AI coding assistants are powerful. But power without discipline leads to:

Beautiful code that breaks on deploy
"Almost done" sessions that need 3 more days
Mock data that survives to production

Stop hoping AI code will work. Start knowing it will deploy.

Try the AI Control Framework →

Sources

All statistics in this article are sourced from:

Have you struggled with AI coding assistant reliability? Let me know in the comments what patterns you've seen.

Kiroween Hackathon Submitted!

Steve Harlow — Fri, 05 Dec 2025 19:12:43 +0000

Just making a quick post about my latest hackathon entry.

This repo is a remaking of a legacy scrum planning webapp that I used Kiro to migrate the connection infrastructure from Web RTC, which was unreliable to AWS serverless tech. https://github.com/sgharlow/scrum-web-reborn

The Transformation Story
From 50% to 99.5% Connectivity: A 99x Improvement

The original Scrum Facilitator app relied on peer-to-peer (P2P) WebRTC connections, where users connected directly to each other through their browsers. While this approach worked in ideal conditions, it failed catastrophically in real-world scenarios—achieving only a 50% connection success rate. The core problem was NAT traversal: corporate firewalls blocked UDP traffic, symmetric NAT prevented direct connections, mobile carrier-grade NAT made connections impossible, and free TURN relay servers were unreliable and rate-limited. When connections did work, they were fragile—if the facilitator's browser closed or their network hiccupped, the entire session's data was lost since there was no server to persist state. Scrum Reborn solves this by replacing the entire P2P architecture with AWS AppSync (managed GraphQL with WebSocket subscriptions) and DynamoDB. Now, all communication flows through HTTPS (port 443), which works everywhere—no NAT traversal needed, no firewall issues, no TURN servers. AppSync provides guaranteed message delivery with automatic reconnection, while DynamoDB serves as the single source of truth with ACID transactions. The result: 99.5%+ connectivity, sub-250ms latency, and zero data loss. This represents a 99x improvement in reliability (99.5% ÷ 50% = 1.99), transforming an unreliable prototype into a production-ready collaboration platform that distributed teams can actually depend on.

Features
Planning Poker: Real-time story estimation with Fibonacci voting
Retrospectives: Collaborative retro boards with voting
Presence Tracking: Always know who's in the room
99.5%+ Connectivity: No more NAT traversal issues
Sub-250ms Latency: Updates feel instant
Serverless Architecture: Scales automatically with AWS
Architecture
Frontend: React 19 + TypeScript + Vite
Backend: AWS AppSync (GraphQL) + Lambda + DynamoDB
Auth: AWS Cognito User Pools
Infrastructure: AWS CDK (TypeScript)
Real-time: GraphQL Subscriptions over WebSocket