Forem: BaoDev Studio

Postgres JSONB indexes: GIN vs BTREE on the same column

BaoDev Studio — Mon, 25 May 2026 13:15:00 +0000

caught this in production last quarter and the answer is more boring than i expected: GIN and BTREE on the same JSONB column solve different problems, and the right choice depends on the SHAPE of your queries, not the size of the data.

heres the actual benchmark + when each one wins.

the setup

table: users with 2.3M rows. one column attributes JSONB. typical row contains:

{
  "plan": "pro",
  "country": "ID",
  "signup_source": "organic",
  "feature_flags": ["beta_search", "new_billing"],
  "preferences": {"theme": "dark", "lang": "id"}
}

queries i run regularly:

Q1: WHERE attributes->>'plan' = 'pro' — find all pro users
Q2: WHERE attributes @> '{"feature_flags": ["beta_search"]}' — find users with a feature flag
Q3: WHERE attributes->>'country' = 'ID' AND attributes->>'plan' = 'pro' — pro users in indonesia
Q4: WHERE attributes ? 'preferences' — users who have the preferences key at all

four queries. different optimal indexes for each.

the three index options

-- option A: GIN on whole column
CREATE INDEX idx_attrs_gin ON users USING GIN (attributes);

-- option B: GIN with jsonb_path_ops (faster but only for @>)
CREATE INDEX idx_attrs_gin_path ON users USING GIN (attributes jsonb_path_ops);

-- option C: BTREE on extracted scalar
CREATE INDEX idx_attrs_plan ON users ((attributes->>'plan'));

actual EXPLAIN ANALYZE results (2.3M rows)

Q1: WHERE attributes->>'plan' = 'pro'

no index: 480ms (seq scan)
GIN (option A): 220ms (bitmap scan, recheck) — GIN is generic, returns false positives, needs the recheck step
GIN path_ops (option B): doesn't work for ->> extraction. ignored.
BTREE on (attributes->>'plan') (option C): 8ms (index scan, no recheck) — winner by 27x

Q2: WHERE attributes @> '{"feature_flags": ["beta_search"]}'

no index: 510ms (seq scan)
GIN: 14ms (bitmap scan) — solid
GIN path_ops: 6ms — winner. path_ops loses other operators but is 2.3x faster for @> specifically because it stores hashed paths only.
BTREE on extracted: cant express this query. n/a.

Q3: WHERE plan='pro' AND country='ID' (compound on JSONB)

BTREE on plan only: 120ms (index on plan, then filter for country in heap)
BTREE composite on ((attributes->>'plan'), (attributes->>'country')): 3ms — winner.

if you find yourself running compound queries on JSONB scalars, the COMPOSITE BTREE on extracted columns beats every other option for that exact shape.

Q4: WHERE attributes ? 'preferences'

GIN (option A): 18ms — works because the ? operator is supported
GIN path_ops: doesnt work for the ? operator. path_ops only indexes @>.
BTREE on extracted: cant express. n/a.

the actual lesson

GIN with default ops is the safe default if you don't know your query shape yet. handles @>, ?, ?|, ?&. flexibility tax: ~2x slower than path_ops on @>, way slower than BTREE on ->>.

GIN with jsonb_path_ops is the specialist if you ONLY use @> (contains). most apps doing feature flag / array-contains queries fit here.

BTREE on extracted scalar (((attributes->>'X'))) is always the fastest for that specific scalar predicate. price: one BTREE index per accessed scalar field.

compound BTREE on multiple extracted scalars is the god mode for queries shaped like Q3. costs you write-amplification but reads are 40-100x faster than alternatives.

what i actually use in production

after benchmarking:

composite BTREE on the 2 high-cardinality scalars (plan, country) that show up in 80% of WHERE clauses → 3ms for the hot path
one GIN path_ops index for the @> feature_flags queries → 6ms for flag lookups
no general-purpose GIN. the ? operator queries (Q4) are rare; we accept seq scan for them.

trade-off: 2 BTREE + 1 GIN = 3 indexes on a single JSONB column. write amplification per row: ~30% slower INSERTs vs no indexes. acceptable because writes are 1% of our traffic.

the rule of thumb i give other engineers

if you know the query shape, use a BTREE on the extracted scalar. if you don't, use a default GIN. only add jsonb_path_ops if you've measured + you ONLY use @>.

GIN-on-everything is the lazy answer that costs you 20-50x on simple equality queries. and most JSONB queries in business code are simple equality queries.

one more thing

the BTREE on (attributes->>'plan') index works ONLY if your query writes the predicate as attributes->>'plan' = 'pro'. if you write (attributes->'plan')::text = '"pro"' (note the double quote inside the literal — JSONB vs text comparison), it won't use the index. lost an hour to this once. expression on the LEFT side of = must match the indexed expression exactly.

removed 3 vite plugins, my build dropped 4 seconds. heres which

BaoDev Studio — Thu, 21 May 2026 15:09:10 +0000

audited my react starter last week and found 8 vite plugins. three of them were doing more harm than good. removed all three. dev server start went from 6.3s to 2.4s. production build from 18s to 14s.

vite-plugin-pwa

added by npm create vite@latest in some templates i copy-pasted. stayed because nobody questioned it. i wasnt shipping a PWA — never. every build generated service worker + manifest artifacts i never deployed.

cost: ~1.2s added to prod build. also fragmented CSS chunks because the precaching list was rebuilt on every change.

removed it. nothing replaces it. if i need a PWA later, ill add it back for that project specifically.

vite-plugin-eslint

this was the easy cut. runs ESLint on every change during dev. on a 200-file project, each save triggers ~800ms of ESLint processing. the overlay also masked real vite errors when both fired together — twice in one week, i spent 20 minutes hunting an HMR issue that was hidden behind a stale ESLint overlay.

cost: ~2.1s added to dev start, plus the cumulative drag.

ESLint now runs in pnpm lint (separate command), in CI on every PR, and in a pre-commit hook on staged files only. dev server doesnt know ESLint exists. ive lost zero linting coverage.

unplugin-auto-import

controversial one. this plugin auto-imports useState, useEffect, ref, etc. without import statements. saves typing. felt like a quality-of-life win.

removed it because every imported symbol becomes a magic string. when a new contributor opened a component file, they couldnt grep the file or use IDE Go-to-Definition reliably — the plugin generated a .d.ts with global declarations that the IDE handled inconsistently.

cost: only ~0.4s on dev startup. but the readability tax was real.

explicit imports now. the first 3-5 lines of every file are imports. file gets longer. but anyone — including future-me at 2am — can know exactly where every symbol comes from.

audit pattern that worked

vite plugin ecosystem rewards "add this, get convenience for free." each plugin runs on every change. convenience compounds in build time.

list every plugin in vite.config.js. ask: "if i removed this today, would i notice within a week?"

answers split cleanly. plugins still in my config: react, react-router, vite-imagetools. everything else is next candidate.

an honest list of what AI agents cant do in 2026

BaoDev Studio — Thu, 21 May 2026 12:38:53 +0000

i run 35 specialized claude code agents across my projects. most of whats written about AI agents in 2026 is either marketing (look how much they can do) or doom (look how much theyll replace). both miss the practical layer: where do these agents consistently fail, even with the best prompts, the best context, the best tools?

this is that list. drawn from running these agents across 3 production codebases for the last 6 months. specific failures, not abstract concerns.

judgment under partial information

biggest single category. AI agents fail when the right action requires waiting, choosing not to act, or saying "i need more info."

client message: "can you make the dashboard faster?" agent reads the request, looks at the dashboard code, identifies three optimization opportunities, starts implementing. senior reads the same message, asks: "faster for whom? on what data volume? slow on initial load or on filter operations? whats the SLA?"

the agents confident execution costs hours of work that might solve the wrong problem. the seniors pause costs 5 minutes of clarification. the pause is the right move 80% of the time.

ive tried building this into agent prompts ("ask 3 clarifying questions before starting"). it works sometimes. but agents ask FORMULAIC questions, not THE question that disambiguates. knowing which question to ask is itself judgment under partial information.

this manifests as:

deciding when a feature is done vs needs another iteration
picking which 1-2 of 5 AI-generated draft replies are worth sending
STOPPING the addition of flags/options to a configurable system
subtractive thinking — "remove this rather than build around it"

concrete failure case: building a multi-tenant data isolation layer for a saas project last quarter. agent kept adding configuration flags for edge cases ("what if a tenant wants flag A but not B?"). by flag #7 the system was unmaintainable. i deleted 5 flags and replaced them with a default-secure single mode. config space went from 128 combinations to 4. senior judgment was "stop adding, start removing."

common thread: right move is restraint. agents are calibrated for action.

reading the codebase context thats not in the prompt

agents are good at search. agents are bad at synthesis from large context.

concrete: asked an agent to refactor a slow 40-line function. the rewrite was technically correct. but the original contained try/catch with comment // don't remove — handles malformed JSON from legacy webhook v1. the rewrite "cleaned up" that try/catch.

agent saw 40 lines. actual scope was the whole webhook chain, the legacy contract, the production data that occasionally hits the malformed path. none of that was in the prompt.

deployed the rewrite to staging. crashed within 6 hours when the daily webhook v1 batch fired. rolled back, restored the original try/catch, added a regression test that explicitly fires malformed JSON. lesson cost ~3 hours and a degraded staging window.

this isnt fixed by more context tokens. the context that matters is implicit — "this comment was load-bearing", "this duplication was intentional", "this naming convention was chosen for a reason". agent reads the lines but doesnt have the memory of why theyre there.

related: agents over-abstract. asked one to extract a pattern shared by 3 functions. it produced a beautiful generic helper that the 4th similar function — written 2 weeks later — could never quite fit. the 3 specific implementations were better than the 1 generic abstraction. agent has no "predict the 4th case" capability.

reading PEOPLE

this one i underestimated. agents are bad at reading tone in human messages.

specifics:

client going silent for 3 days is a strong signal — possibly losing interest, possibly stuck on internal decision, possibly got a competing quote. agents read silence as "no update yet" and continue per plan.
"can we add X?" (genuine question) vs "can we add X?" (testing whether u'll pushback on scope creep) is invisible to agents. senior knows from timing, prior conversations, how it was phrased.
tone for difficult convos — scope-creep pushback, missed-deadline notes, refund discussions — agent versions are either too soft (gets walked over) or too corporate (loses earned trust).

specific exchange from last month. a client asked for a "small change" 6 weeks into a project. agent drafted a polite, structured reply explaining the change-request process. i sent something different: "sure, let me think about whether this needs a CR or if it fits the current scope — give me 24 hours."

agent reply was formally correct. actual right reply was warmer + bought thinking time. the relationship needed the warmth more than it needed the formality.

i now never let agents send client comms without human review. tone-reading is unreliable enough that the risk isnt worth it.

eval / judgment about correctness

this is where i most expected agents to excel + where theyre most disappointing.

building LLM-based products requires evals. what does "good" mean? what threshold do we ship at? which test cases matter? upstream of implementation, heavily judgment-based.

agents do badly:

generate exhaustive test cases but cant tell me which 5 matter most for product viability
measure whats measurable (BLEU, semantic similarity, response length) instead of what matters (does the customer find this helpful)
cant design human-in-the-loop eval samples — recommend either fully-automated or fully-manual, never the right hybrid

specific case: building the support agent eval harness for the e-commerce project (last quarter). agent suggested measuring response accuracy via semantic similarity to a "golden answer" set. that would have been wrong in 2 ways. first, the golden answers themselves were judgment calls. second, the actual metric that mattered was "did the customer ask a follow-up that suggests they were confused." the eval design needed real customer conversation data + human classification of "this was helpful" / "this missed." cant be done from training data.

eval-design is the failure i expect least progress on in 2026. requires judgment about what humans value. not in training data.

bonus failure: estimating real-world performance

agent says "this query should be fast" based on indexed columns. in production with cold cache + network jitter + concurrent load, its 800ms slow. agents are bad at production reality because they reason from the code, not from operational behavior.

ive seen agents recommend caching strategies that look correct on paper, but ignore the cache invalidation cost when the cached data changes 50x/day. or recommend "just add an index" without thinking about write amplification on a write-heavy table.

senior knows "this looks fast but will be slow in production for THESE reasons" because senior has seen production reality. agent has seen the docs.

what this means in practice

i keep the 35 agents because the 70% they do well saves real time. but i architect the workflow so the 30% they cant do has explicit human handoffs:

"should we start?" decision (judgment under partial info): human only
cross-codebase refactors where load-bearing weirdness lives: human-driven, agents as implementation tools
client-facing communication: human review minimum, often human-authored
eval design and threshold-setting: human-authored, agents run the harness
production-readiness assessments: human walks through the operational model, agent helps document it

the hype frames this as "AI will do everything." the doom frames this as "AI will replace everything." the practical layer is neither: AI does 70% of any workflow that doesnt require judgment under uncertainty. the 30% that does is exactly where senior engineers earn their living.

if ur building agent systems in 2026, plan the workflow around what they cant do, not what they can. the wont-do list is more load-bearing than the will-do list.

vite HMR is silently the reason ur laptop fan wont stop

BaoDev Studio — Thu, 21 May 2026 12:19:02 +0000

ur working on a react app. ur fan kicks on. u assume chrome is the culprit, or slack, or the LSP. close tabs, kill apps, fan keeps spinning.

actual cause in 6 out of 10 projects i audit on macOS: vite's hot module replacement (HMR) doing way more work than needed. default config keeps websocket connections alive, polls file changes, rebuilds bundles aggressively. on a multi-monitor M-series macbook this lands as a consistent fan-on state even when ur not typing.

confirm its HMR first

quit ur dev server. wait 30 seconds. does the fan calm down?

yes → its HMR. no → look elsewhere (chrome tabs, docker, slack helper).

one-line fix

vite respects HMR env. flip it off when u dont need live reload:

HMR=off pnpm dev

or in vite.config.js:

export default defineConfig({
  server: {
    hmr: process.env.HMR !== 'off',
  },
});

with HMR off, vite still serves files but stops the websocket + file-watcher loop. CPU drops 30-50% on my M2 air. fan stops in 60 seconds.

when to keep it on

active feature dev: yes, u want HMR. saves 5-10s per save.

code reviews, doc reading, screencast watching while dev server is technically running but ur not editing: HMR is pure overhead. run with HMR=off.

i added this as a package.json script:

{
  "scripts": {
    "dev": "vite",
    "dev:cold": "HMR=off vite"
  }
}

then pnpm dev:cold for any session where im not actively editing the frontend.

why this is silent

vites docs treat HMR as always-on. perf cost is documented in their FAQ but not the getting-started flow. most react tutorials install vite + dont mention the HMR escape hatch. result: devs ship apps assuming "vite is fast" without realizing their dev session keeps the fan on 8 hours a day.

net

across 4 projects, this default flip moved my macbook battery from 4h to 6h on a typical dev day. real number, not a guess.

took 30 seconds to add. pays back every session 🤷

How a one-person studio writes 35 Claude Code agents that don't fight each other

BaoDev Studio — Wed, 20 May 2026 07:27:14 +0000

Last Friday afternoon the quality-gate agent reviewed a PR from backend-developer and rejected it with a 312-word critique. Fair feedback. The PR went back, backend-developer rewrote three functions, re-submitted. quality-gate rejected it again. Same 312-word critique. Same three functions.

I was watching this and realized backend-developer had been told to "improve test coverage" by quality-gate in the previous turn, had written tests, and quality-gate's second pass was now complaining the tests existed because they overlapped with what backend-developer had earlier been instructed to skip. The agents were in a loop. Neither was wrong. Both were operating on the spec they had been handed.

This is what happens when you let 35 specialized agents act on the same codebase without rules. They don't fight humans. They fight each other.

The orchestration problem

I keep 35 agents in ~/.claude/agents/. They include backend-developer, frontend-developer, postgres-pro, golang-pro, quality-gate, flow-architect, security-engineer, test-automator, client-communicator, cfo, cto, ceo, inbox-monitor, and 22 others. Most invocations involve 2 to 4 of them in a chain. About 1 in 7 sessions hits a real conflict like the one above.

The problem is not that any agent is wrong. The problem is that 35 specialists with 35 specs will pull the codebase in 35 directions if you do not constrain who decides what.

This is a writeup of the three patterns that mostly work, the three that do not, and the one problem I have not solved.

Three patterns that work

1. Single source of truth, per concern, written down

For every concern that more than one agent touches, there is exactly one file that owns the answer.

CLAUDE.md at the project root owns: build commands, deploy folder convention, test framework. Every agent reads this. None of them argue with it.
masterings/secure-code-patterns.md owns the 40 rules about input validation, secrets handling, SQL safety. security-engineer and quality-gate both reference the same file. They cannot disagree about a pattern because they are reading the same checklist.
FreelanceOS/baseline-form.md owns the 22 test cases that any form must pass. frontend-developer implements them. quality-gate verifies them. The list of 22 is the contract.

The original loop I described above happened because there was no source of truth for "what's the minimum test coverage for backend code". quality-gate had its opinion. backend-developer had its opinion. Once I wrote the rule into CLAUDE.md ("statement coverage minimum 85% on new code, integration tests over unit tests for DB-touching paths"), the loop stopped on the next session.

2. Explicit ownership: one agent per task class

If two agents could plausibly own a task, neither does. I assign it to a third agent who is a layer up.

Concrete: who owns Postgres performance? Could be backend-developer (the SQL is part of the API code). Could be postgres-pro (it is a database concern). If I dispatch a slow-query investigation to backend-developer, the answer comes back with application-layer caching. If I dispatch it to postgres-pro, the answer comes back with an index rewrite. Both are correct. Neither is the right level.

The fix is to dispatch the question to flow-architect first. flow-architect reads the trace, decides whether this is an app-layer fix or a database-layer fix, and then dispatches the specific work to the right specialist with a clear scope. The specialists never fight because they are receiving non-overlapping work.

This is a router pattern, not a coordination pattern. The router is itself an agent.

3. Locked context per agent invocation

Before dispatching an agent that will write to the codebase, I cache the relevant context in Redis with an explicit key:

redis-cli SET "agent:ctx:backend-developer:2026-05-20-feature-x" \
  "$(cat current-task.md schema.sql relevant-files.txt)" EX 3600

The agent reads from that key at the start of its run. If a parallel agent dispatch is happening, they see the same frozen context. The thing that used to fight them is the floor moving while they walked on it. The thing that stops them fighting is the floor not moving.

This pattern came out of a 2026-04 incident where quality-gate ran simultaneously with refactor-agent, and the file refactor-agent was rewriting got reviewed by quality-gate mid-rewrite. quality-gate flagged the half-finished code as broken. It was. Once locked-context was enforced, that class of bug disappeared.

Three patterns that do NOT work

1. "Let the agents negotiate"

I tried this for two weeks. backend-developer proposes, quality-gate reviews, they go back and forth until they agree. In theory clean. In practice the agents do not negotiate. They restate their original position more politely. After three or four turns, one of them gives in not because the argument was better but because it was running out of context window.

The decision quality from "exhausted agent gives in" is worse than the decision quality from "router agent decides upfront". Negotiation is the expensive way to lose.

2. "Run multiple agents in parallel and pick the best output"

This sounds safer. Run backend-developer-A, backend-developer-B, backend-developer-C in parallel, take the version with the highest quality-gate score.

Three problems. Token cost is 3x. Quality is unbounded because the three runs share most of the same biases (they are the same agent reading the same spec; they tend to converge on similar answers, not diverge). And the picker becomes a single point of failure. If quality-gate has a blind spot, all three "winners" share it.

I keep one specialist per task. Cheaper. The output quality difference vs the parallel-and-pick version is within noise on the projects I run.

3. "Agent voting"

Same problem as parallel-and-pick, with extra coordination cost. Skipped after one week.

What I was wrong about

I assumed that as agent count grew the coordination overhead would scale linearly. More agents = more rules to write = more fights to mediate.

The real curve has a kink. Up to about 10 agents, a flat dispatcher works. You hold the dispatch logic in your head and assign work by intuition. Above 12 agents you cannot hold the roster in working memory anymore, and a flat dispatcher loses to a routing agent. So coordination overhead does not go up linearly with agent count. It is roughly flat from 1 to 10, then steps up at the routing-agent threshold, then is roughly flat again from 12 to whatever ceiling.

I jumped from 8 to 22 to 35 agents in three months. The middle period was painful. The jump from 22 to 35 was much easier because the routing infrastructure was already there.

The one problem I have not solved

Agents occasionally regress each other's work. A new instance of backend-developer, dispatched two weeks after the last one, sometimes deletes a workaround the previous instance had added (with no comment because comments are noise). The trace looks like the workaround "appeared from nowhere" and the new agent removes it as dead code. The workaround was load-bearing.

I have partially mitigated this with structured commit messages that explain why a workaround exists, and by making the test that breaks if the workaround is removed. But the gap is real. The agents do not yet read git history before deletion. The discipline lives in the prompt and the tests, and prompts get truncated.

If I solve this it will probably be by making one of the agents read git blame on any line it touches before recommending deletion. That is on the list.

The shape that emerged

What looks like 35 independent agents is, in practice, a layered system:

1 router (flow-architect) decides what kind of work a task is.
5 to 7 specialists per layer (backend, frontend, DB, security, devops, testing) execute scoped work.
1 reviewer (quality-gate) verifies against the agreed checklist.
Three orthogonal C-levels (cto, cfo, ceo) handle cross-cutting strategy questions that should not block the engineering loop.
The remaining 18 to 20 are domain agents that are dispatched rarely (e.g., postgres-pro only for hard DB problems, tls-config-agent only when certs come up).

The pattern that prevents fights is not "more rules". It is "fewer overlapping responsibilities, an explicit router, a frozen context per dispatch". Three things written down. Most of the conflicts you would otherwise spend a week debugging do not happen.

If you are starting an agent stack today, the order I would build it in is: write CLAUDE.md first, then 5 specialists, then one router, then add the rest. Trying to coordinate 35 specialists without a router and a written source of truth is the slow way to learn the same lesson.

AI-assisted development cost breakdown — real numbers from 3 projects

BaoDev Studio — Mon, 18 May 2026 13:27:40 +0000

Most of what's written about AI-assisted development cost is theoretical. "Could save 50%". "Up to 10x faster". "Game-changing efficiency". Useful for slide decks, useless for budgeting.

Here are three projects from the last few months with the actual numbers. Hours billed, tokens consumed, what AI agents did, what they did not do, and what the final invoice looked like. Names are abstracted (NDA-protected) but the numbers are real.

Project A: SaaS dashboard MVP

The ask: a B2B SaaS dashboard for a small ops team. Backend in Go, frontend in Next.js, Postgres for data, Stripe for billing. Standard CRUD plus 8 reports plus a roles-and-permissions layer.

What was quoted

Quoted at 4 weeks. The discovery call surfaced 22 distinct features. I priced it as a fixed-rate engagement based on previous similar projects, with a stated ±20% accuracy band.

Quoted price: $9,800
Quoted timeline: 28 days
Quoted hours (internal estimate): 78 hours

What was actually delivered

Delivered: 23 days (5 days under quote)
Actual hours billed: 71 hours
Claude Sonnet token cost (development only, not runtime): $141
Lines of application code shipped: ~14,200
Lines of test code shipped: ~4,400 (statement coverage: 91%)
Bugs found in client UAT: 3 (1 critical, 2 cosmetic)
Critical bug fix time: 4 hours including regression test

Where the time actually went

Of the 71 hours billed, the breakdown was roughly:

Senior architecture decisions and spec refinement: ~9 hours
Agent-supervised feature build (CRUD, reports, billing, auth): ~38 hours
Direct senior code on the harder bits (roles-and-permissions logic, Stripe webhook retry, multi-tenant data isolation): ~14 hours
Test writing and review: ~6 hours
Deployment, monitoring setup, runbook: ~4 hours

The agents handled the boring 70% well. The senior owned the 30% that required real judgment.

What this teaches about pricing

A traditional freelance team at $75-100/hr would have charged ~$10,000-15,000 for this scope and delivered in 5-7 weeks. The agent-assisted version landed at $9,800, in 23 days, with 91% test coverage. The savings ratio is real but the marginal token cost ($141) is so small it does not even show up on the invoice as a line item. Clients care about the total number and the delivery date, not the token accounting.

Project B: Bug fix sprint on a Go service

The ask: an existing production Go service was crashing under specific concurrent load. The team had been hunting the bug for two weeks without success. The need was triage, fix, regression test, and a runbook so it would not happen again.

What was quoted

Quoted price: $1,200 fixed
Quoted timeline: 2 days
Quoted hours (internal estimate): 14 hours

This was priced as a sprint because the bug was contained and the team had already done the partial reproduction work.

What was actually delivered

Delivered: 6 hours total elapsed
Actual hours billed: 6 hours
Claude Sonnet token cost: $8.30
Outcome: race condition in a worker-pool goroutine fixed; regression test added; root cause documented in the team runbook

Where the time actually went

Reading the existing code and the team's investigation notes: ~1.5 hours (human only; agents are bad at "read this 8000-line repo and tell me what's wrong")
Running the race detector on suspicious paths (go test -race -count=10 ./pool/...): ~0.5 hours
Reproducing the bug locally with a stress harness: ~1 hour
Writing the fix (mutex around the shared resource that was being mutated outside the worker's own slot): ~0.5 hours
Writing a deterministic regression test for the race: ~1 hour
Writing the runbook entry: ~1.5 hours

What this teaches about pricing

Bug fix sprints are the wrong place to apply heavy AI assistance. They are 80% reading and 20% writing. The reading is human work; agents cannot reliably hold 8,000 lines of context and reason about the system's invariants. The writing portion (the fix itself, the regression test, the runbook) is where AI helps, but it is the smaller half.

Pricing reality: I bill bug fix sprints close to a senior hourly rate even though the elapsed hours are short. The work that makes the fix possible is the years of experience that lets a senior glance at a stack trace and know to look at the worker pool. Token cost? $8. The studio's overhead for that hour of focused work? Multiples of that.

Project C: LLM-powered support agent for an e-commerce site

The ask: an e-commerce business wanted an AI support agent that answered customer questions from their product catalog plus a small knowledge base. Integration with their existing helpdesk for fallback handoff to humans.

What was quoted

Quoted price: $5,400 for phase 1 (the bot + the integration)
Phase 2 (analytics dashboard, multi-language support, retainer) deferred until phase 1 shipped
Quoted timeline: 3 weeks
Quoted hours (internal estimate): 44 hours

What was actually delivered

Delivered phase 1: 18 days (3 days under quote)
Actual hours billed: 41 hours
Claude Sonnet token cost (development only): $96
Phase 2 ongoing as a $1,800/month retainer

Where the time actually went

The work split was different from the SaaS dashboard because the actual product was an LLM agent. Hours roughly:

Knowledge base preprocessing and chunking strategy: ~6 hours (human only; this is judgment work — which fields to include, how to handle product variants)
Retrieval-augmented generation pipeline: ~10 hours (agents wrote most of it, senior reviewed embeddings model choice and reranking strategy)
Helpdesk integration (webhook in, conversation handoff API out): ~8 hours
Frontend chat widget: ~6 hours (agents handled almost all of this)
Eval harness (sample 50 real customer questions, measure helpfulness, accuracy, escalation rate): ~9 hours (mostly human work — eval design is hard to delegate)
Deployment and monitoring: ~2 hours

What this teaches about pricing

LLM projects have a specific cost shape. The retrieval pipeline and integration compress heavily with agent help (60% efficiency gain). The eval harness, the knowledge base preprocessing, and the prompt engineering decisions stay heavily human (negligible efficiency gain from AI tools, because the work is judgment, not boilerplate).

The blended rate ends up being similar to the SaaS dashboard rate, even though the project felt more "AI-native". This surprised me. The first three LLM projects I built I priced too low because I assumed AI tools would compress the LLM-engineering work the most. They actually compressed the surrounding infrastructure work the most. The LLM-engineering itself stays expensive because it is judgment-heavy.

What three projects show in aggregate

Project	Quoted	Delivered hours	Token cost	Days early/late	Margin vs traditional
SaaS dashboard	$9,800	71h	$141	5 days early	~35% reduction
Bug fix sprint	$1,200	6h	$8	Same week	Same price; faster
LLM support agent	$5,400	41h	$96	3 days early	~25% reduction

Three patterns:

Boilerplate-heavy projects compress the most. The SaaS dashboard had ~35% time savings vs traditional because most of the build was standard CRUD plus reports. Agents handle that well.
Reading-heavy projects compress the least. The bug fix sprint compressed almost zero on the thinking time. The fix took 30 minutes; the understanding took 5 hours. Agents do not help with understanding existing systems much yet.
Judgment-heavy projects compress unevenly. The LLM agent project compressed the integration work but not the eval design. Token costs were tiny in all three cases ($8 to $141), making them invisible on invoices.

The token-cost question, answered with real numbers

Three projects, total token spend: $245.30.

Three projects, total client invoices: $16,400.

Token cost as a percentage of revenue: 1.5%.

This is the part of AI-assisted development that hype articles get most wrong. Tokens are not the meaningful expense. The meaningful expense is the senior judgment that directs the agents and reviews their output. If anyone tries to sell you AI development priced primarily on "low token cost", the math is upside down. Real AI-assisted projects bill the senior judgment, not the tokens.

What does NOT compress with AI agents

Independent of project type, these consistently took the same amount of time as a fully-human project would have:

Discovery calls and spec refinement. Agents do not run client calls.
Stakeholder communication during the build. Async status emails, slack threads, the project-status documents.
Debugging that requires reproducing client-specific environment issues.
Compliance reviews (HIPAA, GDPR, PCI). The standards have not changed because AI helps you write code.
Code review of changes that touch the riskiest 5% of the codebase (auth, billing, data deletion paths).

A studio that quotes 50% off because "AI" without specifying which parts of the work compressed is either overestimating the savings or underestimating the work.

The take

AI-assisted development is real. The compression is real. The numbers above are not marketing copy; they came off three actual invoices and three actual token spends in 2026. Token cost is real but small. Calendar time savings are real and meaningful.

The savings are not uniform across project types. Boilerplate-heavy work compresses dramatically. Reading-heavy work barely compresses. Judgment-heavy work compresses unevenly.

The right way to price an AI-assisted project is the same as the right way to price any project: count the hours by category (boilerplate vs reading vs judgment), apply the right multiplier per category, add a token line item for transparency, and quote a fixed price with a ±20% accuracy band.

The wrong way is to quote 50% off and hope you can deliver on it.

If you want the full reasoning behind the cost variables and the worked example for a similar dashboard project, the deeper budgeting framework is at the canonical URL in the post header.

How to budget a software development project (without the spreadsheet theater)

BaoDev Studio — Mon, 18 May 2026 10:34:10 +0000

The first software project I quoted as a freelancer was supposed to take three weeks. It took eleven.

The spec was four bullet points, the budget was a number I picked because it sounded reasonable, and the client was patient until they weren't. By week six everyone was unhappy and nobody was being dishonest. I just had no idea how to budget a software development project.

Most online cost calculators don't help. They ask three questions and return a number that depends on nothing. Real budgeting is six variables in a trench coat, and the trench coat is held together by how well-defined your scope actually is.

Here's what I'd tell my past self.

The six variables that actually move the number

Every software development cost estimate is some flavor of hours × rate. The interesting question is where the hour count comes from.

Six things change it:

1. Project scope: the number of distinct features the deliverable must support. Not "a dashboard with reports" but "a dashboard with 6 specific reports, each with 3 filter dimensions and CSV export". The first phrasing is a fantasy. The second is bid-able.

2. Technical complexity: does the code have to handle real-time updates, custom algorithms, regulated data (HIPAA, PCI, GDPR), high concurrency? Each of those adds 30-100% to the matching feature's hour count because of the engineering trade-offs and the verification load.

3. Integration surface: every third-party API the project touches adds roughly 4-12 hours. Stripe is fine. A B2B partner's custom SOAP endpoint with no sandbox is six days of wasted weekends.

4. Team seniority: a senior engineer at $100-150/hr will deliver in a third of the hours of a mid-level at $40-60/hr. The math sometimes makes seniors cheaper per project. The math more often doesn't, because mid-levels are cheaper per hour. Read the actual scope before choosing.

5. Timeline pressure: compressing a project timeline by 50% adds roughly 30-50% to total cost, not because engineers work faster but because the team adds people, communication overhead climbs, and weekends start to be billable. Brooks' Law from 1975 is still true in 2026.

6. Scope volatility: every change request after kickoff costs 2-4x what it would have cost in the original scope. A button added in week 1 is 30 minutes. The same button added in week 4, after the page redesign is "almost done", is 2-3 hours plus regression testing.

A budget that doesn't ask about all six is a budget that's going to surprise someone.

The accuracy ceiling, and why it's lower than people think

Here's the part nobody likes hearing.

A well-scoped estimate for a defined deliverable (one that the studio has built something similar to before) is typically accurate within ±20%. So if I quote 100 hours, the real number is probably 80-120. That's the honest ceiling for honest work.

For a greenfield product, where the spec is "we want a SaaS for X but we're still figuring out what X is", the accuracy collapses to ±40-60%. So if you hear a quote of "$15,000-25,000" for a new product, the real ending number is probably anywhere between $9,000 and $40,000.

I was wrong about this for the first two years. I thought tighter estimates were a sign of better engineering. They were a sign of clients who hadn't started discovering what they actually wanted.

The fix isn't tighter estimates. The fix is one of three things:

Lock the spec earlier with a paid discovery sprint (2-5 days, charged separately) before quoting the build.
Use time-and-materials with a not-to-exceed cap instead of fixed price.
Quote phase 1 as a fixed price and explicitly defer phase 2 estimation until phase 1 ships.

All three are honest. Pretending you can quote a greenfield product within ±10% is not.

What changes when AI agents enter the workflow

This part has been overhyped and underhyped in equal measure.

The naive overhype is "AI writes 90% of the code, projects ship 10x faster, budget collapses". That has not been my experience.

The naive underhype is "AI is just autocomplete with more steps". That isn't true either.

What actually happens, in numbers I can defend from running a studio with this workflow for the past year:

A mid-complexity integration (webhook pipeline, a few UI screens, basic auth, deploy) used to take 18-22 hours of subcontractor time. Senior reviews on top added 2-3 hours. Total: 20-25 hours.

The same scope, with a proper agent workflow, lands closer to 6-8 hours total. Not because agents write better code than the subcontractor did. The reason is the review-fix-verify loop. Agents catch their own obvious mistakes when the feedback cycle is wired correctly. The "first draft is rough, second draft is shipping-quality" cycle collapses from days into hours.

Token cost is real but tiny. A mid-complexity project consumes $12-18 of Claude tokens. The subcontractor equivalent was $400-600 in labor. The ratio looks dramatic but the meaningful gain is calendar time, not cost — most clients care about shipping in 6 days more than they care about a $400 line item.

Bigger projects scale similarly. A SaaS MVP that traditionally takes 4 weeks of senior engineering can land in 3 weeks with agent support. Token cost across that project is around $140. Lines of code: about 14,000 (without counting tests). Bugs caught in client UAT: typically 2-4, of which one is critical and the rest are cosmetic.

What AI agents do NOT do: tell the client they're wrong about a product decision, catch the subtle business-rule violation in a 200-line file, decide what NOT to build. Those are still human work. The studio model is "one senior plus an agent workforce", not "an AI subscription replacing engineers".

A worked example

Let's budget a real project. The ask: a small e-commerce dashboard for a Shopify store owner who wants daily metrics, abandoned cart recovery emails, and a returns workflow.

Phase 1: scope it out.

Daily metrics dashboard (revenue, orders, top products, repeat customer rate): 4 reports, 3 filters each, CSV export
Abandoned cart email automation: Shopify webhook + email service (Postmark) integration
Returns workflow: admin can mark return, send pre-paid label link via Postmark, refund processed via Shopify Admin API

Phase 2: estimate hours per variable.

Without AI agents:

Dashboard frontend (Next.js): ~24 hours (3 hours per report × 4 reports, plus shared filter component and CSV export)
API + Shopify integration: ~16 hours (webhooks + admin API client + error handling)
Email automation: ~12 hours (template, Postmark client, queue + retry logic)
Returns workflow: ~14 hours (UI + state machine + label generation + refund flow)
Auth + deployment + testing: ~14 hours
Total: 80 hours

With agent support, same scope:

Dashboard frontend: ~10 hours (agents handle the boilerplate, senior reviews architecture and UX details)
API + Shopify integration: ~6 hours
Email automation: ~4 hours
Returns workflow: ~6 hours
Auth + deployment + testing: ~6 hours
Total: 32 hours

Phase 3: apply rate.

At a senior rate of $125/hr:

Traditional: 80 hours × $125 = $10,000
Agent-assisted: 32 hours × $125 + $200 in tokens = $4,200

Phase 4: apply uncertainty.

Both numbers are ±20% because the scope is well-defined. So the honest quote is:

Traditional: $8,000-12,000, delivered in 3-4 weeks
Agent-assisted: $3,400-5,000, delivered in 1-2 weeks

A 60% cost reduction. The temptation is to claim this is the typical case. It's not. This works when scope is defined, the integration partners are well-documented (Shopify is excellent here), and the senior knows what to verify. Greenfield products with 20 features and shifting requirements don't compress as cleanly.

How to budget if you're the client, not the studio

Three pragmatic tactics if you're commissioning the work:

1. Ask for the breakdown. A quote that says "$8,000 for a dashboard" is not a quote. A quote that says "24 hours dashboard, 16 hours integration, 12 hours email, 14 hours returns, 14 hours auth+deploy, $125/hr = $10,000, ±20%" is a quote. The breakdown protects both sides.

2. Run the calculator on yourself first. Before you ask a vendor, write the feature list yourself and put a rough hour count next to each one. You don't need to be right. You just need to know what feels wrong when the vendor's number arrives. A 4-feature project quoted at 200 hours probably has scope you didn't communicate. A 12-feature project quoted at 30 hours probably has scope the vendor didn't read.

3. Reserve 20% for unknown unknowns. Whatever number you arrive at, add 20% to the budget you tell stakeholders. Not because vendors are dishonest. Because the spec isn't done until the product ships. You'll either spend the buffer or you won't, and the conversation in week 6 will be much calmer if you did.

The take

Budgeting a software development project well isn't a formula. It's a discipline.

The discipline is: define scope, count features, pick a rate, apply the right multiplier for complexity, and admit upfront how much the spec might still drift. The variables that move the number are knowable. The honesty about your uncertainty is the part that protects you.

If you want to play with the inputs, the calculator at baodev.studio/estimate.html shows the same logic with a UI. It's not a quote. It's a defensible reference for the budget conversation you're about to have with whoever is funding the project.

If you want a real quote, the intake form takes 7 minutes and produces a number within 24 hours. The number comes with the breakdown, which is the only part that matters.

I ran `go test -race` after 3 months. It found 8 things.

BaoDev Studio — Mon, 18 May 2026 02:44:01 +0000

8 race conditions. That's what three months of "I'll add -race later" bought me.

The codebase is a Go backend for a freelance studio automation tool. Around 4,000 lines of application code, a handful of goroutines managing job queues, email polling, and an agent dispatch loop. Perfectly ordinary stuff. I had been telling myself -race was "too slow for CI." It runs in 11s for a 4k-line service.testing I was wrong.

What the detector actually outputs

When youtipsprogramming hit a real data race, the output looks like this:

==================
WARNING: DATA RACE
Read at 0x00c0001b4030 by goroutine 18:
  github.com/baodev/flos/internal/dispatch.(*Router).getHandler()
      /home/runner/work/flos/internal/dispatch/router.go:94 +0x6c

Previous write at 0x00c0001b4030 by goroutine 7:
  github.com/baodev/flos/internal/dispatch.(*Router).Register()
      /home/runner/work/flos/internal/dispatch/router.go:61 +0x84

Goroutine 18 (running) created at:
  github.com/baodev/flos/internal/dispatch.(*Router).Start()
      /home/runner/work/flos/internal/dispatch/router.go:112 +0x1e0
==================

File and line numbers, both goroutines, the moment of creation. It tells you exactly where to look.

The representative case

The most embarrassing one: a map[string]HandlerFunc being read by worker goroutines while a registration goroutine could still be writing to it. Classic. The map wasn't behind a mutex because I "registered everything at startup." Except one code path registered a handler lazily on first use.

 type Router struct {
-    handlers map[string]HandlerFunc
+    handlers map[string]HandlerFunc
+    mu       sync.RWMutex
 }

 func (r *Router) Register(name string, fn HandlerFunc) {
+    r.mu.Lock()
+    defer r.mu.Unlock()
     r.handlers[name] = fn
 }

 func (r *Router) getHandler(name string) HandlerFunc {
+    r.mu.RLock()
+    defer r.mu.RUnlock()
     return r.handlers[name]
 }

12 lines changed. Bug had been live since the initial commit in February.

Adding it to CI is one line

If you're on GitHub Actions and not already running this, add it to your test job:

- name: Test with race detector
  run: go test -race -count=1 -timeout=120s ./...

Or if you run tests via a Makefile:

test-race:
    go test -race -count=1 -timeout=120s ./...

The -count=1 disables the test result cache so every CI run actually executes. Without it, Go can return cached results even on -race, which defeats the point.

What the other 7 were

I won't detail all of them. Mostly they were the same pattern: shared state accessed from spawned goroutines, written once somewhere "safe" and read everywhere else, with no synchronization because the write "always finished first." The race detector disagreed with that assumption on 7 separate occasions.

Two of them were in test helpers, not production code. Still real races — test helpers spin goroutines too, and a flaky test that fails once every 40 runs is its own kind of tax.

The honest accounting

Three months of technical debt on a four-person equivalent codebase (it's mostly me and agents). Eight findings in 11 seconds of wall time. One of those findings was in the agent dispatch path that runs on every job — meaning every job that completed without incident was getting lucky with goroutine scheduling.

That's the uncomfortable part about race conditions: they don't fail loudly. They fail intermittently, or they corrupt state silently, or they don't fail at all on your machine because your CPU happens to schedule goroutines in a forgiving order.

The race detector doesn't care about your scheduler's mood.

Running it weekly now. Should have been in CI from day one — -race exists precisely because humans are bad at reasoning about concurrent memory access under load.

How a one-person dev studio runs with autonomous AI agents

BaoDev Studio — Sat, 16 May 2026 15:22:48 +0000

TL;DR (for the impatient)

BaoDev.studio is a one-person dev studio paired with a fleet of autonomous AI agents.
The workflow is built around Claude Code and ~35 specialized agents that operate like a fractional engineering team.
Pricing: $800 for a sprint (1–5 days), $4,500 for a system build (2–8 weeks), $3,000/month retainer (3-month minimum).
Delivery times are honest. Token costs are real. AI agents do not replace senior judgment — they multiply it.
This post: the stack, the actual numbers, the trade-offs, and what AI still cannot do.

The gap nobody talks about

A founder needs an MVP, a payments integration, or a Go service that does not fall over. Two options usually surface:

Option 1: Freelancers on Upwork / Sribu / Kalibrr. Cheap. Sometimes great. Often a coin flip. Someone billing 8 clients in parallel, copy-pasting from Stack Overflow, disappearing for 3 days when something breaks. $15-30/hour and a "1 week project" becomes 6 weeks.

Option 2: An agency. Senior people, real processes, predictable output. Also $20,000 minimum, 8-week kickoff, and a project manager emailing status updates while the actual work waits in a sprint queue.

The gap in the middle — "senior engineering quality, this month, for less than the price of a used car" — is where most actual SME projects live. Nobody was serving it well, because the unit economics of doing senior work at freelance prices have not existed.

Until AI agents got good enough.

What BaoDev actually does all day

This is not "vibe coding". Not pasting prompts into ChatGPT and pretending. The studio runs an instrumented engineering pipeline that looks something like this:

1. Intake. Client fills the form on baodev.studio describing what they want. The studio reviews. If a project is outside what can be delivered well, the answer is no — saying no honestly is worth more than saying yes to fail later.

2. Plan and contract. A planning agent drafts the PLAN.md (14 sections: scope, architecture, risks, milestones, dependencies). A senior engineer reviews and edits. A contract is signed before any code is written.

3. Build. Most people imagine the AI just writes everything. It does not. The actual flow:

A flow-architect agent maps every page → API → DB → job → notification connection and flags gaps.
A backend-developer or frontend-developer writes the first pass.
A quality-gate reviews every output before it hits the codebase.
An integration-test-agent writes tests that hit a real database, not mocks.
A security-engineer runs OWASP checks before deploy.

When something is non-obvious — a tricky algorithm, an architectural fork, a client decision — a human takes over. Agents are good at executing the boring 80%. Senior judgment owns the 20% where it matters.

4. Ship. The CI pipeline includes the agents as checks. Every PR runs 0qa, scoring on completeness, security, performance, and tests. Score has to be ≥90 with zero critical findings or the PR does not merge.

This stack is ~35 agents, all custom-built and version-controlled in the same repo. The roster lives in ~/.claude/agents/. They run on Claude Code locally — no cloud bill, no quota anxiety.

Real numbers from real projects

Vague case studies are useless. Concrete data points instead:

Project: SaaS dashboard MVP (Next.js + Go API + Postgres + Stripe)

Quoted: $4,500, 4 weeks
Delivered: $4,500, 3 weeks 2 days
Token cost end-to-end: ~$140 of Claude usage
Lines shipped: ~14,000 (without counting tests)
Tests: ~4,200 lines, 91% statement coverage
Bugs found in client UAT: 3 (1 critical fixed in 4 hours; 2 cosmetic)

Project: Bug fix sprint (Go service, race condition in worker pool)

Quoted: $800, 2 days
Delivered: $800, 6 hours
Token cost: ~$8
Outcome: race fixed, regression test added, root cause documented in the runbook

Project: AI integration (LLM-powered support agent for a small e-commerce site)

Quoted: $4,500, 6 weeks (deferred to phase 2 by client after week 3 — normal scope-creep)
Delivered phase 1: $4,500, 3 weeks
Token cost (dev only, not runtime): ~$200
Phase 2 ongoing as retainer

The unit economics work because token cost is a rounding error against engineering time. The constraint is not "how cheaply can the AI write code" — it is "how reliably can a senior direct it without re-doing everything by hand". That part had to be invented.

What AI agents cannot do

Honest list. These do not get better with bigger models.

Tell a client they are wrong. A founder asks for a feature that will tank their conversion rate. Agents will build it. A senior pushes back.
Pick the right database. Postgres or Mongo, Redis or Memcached, monolith or microservices — these are architectural decisions tied to business stage and team future. Agents pick whatever you suggest. They will not catch the suggestion that is wrong for your stage.
Read code politically. Some refactors are technically clean and politically dead. Agents do not know your CTO has a feud with the previous lead.
Catch the subtle copy-paste bug. Agents will sometimes write a 200-line file that compiles, runs, passes tests, and is silently wrong because they copy-pasted a constant from a similar service.
Decide what NOT to build. Scope discipline is a senior trait. Agents are happy to scaffold a feature you do not need.

This is why the model is not "buy an AI subscription instead of a developer". It is "buy a senior developer who has an AI workforce". The agents do not replace humans — they let one senior ship the work of 4-6 mid-level engineers without the overhead of 4-6 people.

The pricing model

Three engagement types, all flat:

Sprint — $800 per deliverable, 1–5 business days. Use this for one isolated thing: a feature, a bug, an integration, a migration. No ongoing commitment.
System build — $4,500+, 2–8 weeks. Full system delivered: backend, frontend, database, deployment, tests, docs. Fixed scope, fixed price, fixed timeline. SLA signed, system delivered.
Ongoing retainer — $3,000/month, 3-month minimum. Reserved engineering capacity for continuous work — new features, on-call, code review, iterative product.

No hourly billing. No "+30% if the project runs over". Estimates are honest, with a buffer. If a project hits its deadline early, the client gets the work earlier. If something was missed in scoping, the studio absorbs it — not a change order.

Bilingual EN/ID. Asynchronous communication by default. Email-first. Async beats meetings 9 times out of 10.

Why this post exists

Two reasons:

To find clients who fit. If your project lives in the SME band — between Upwork-cheap and agency-expensive — the intake form on baodev.studio is the right starting point. Services page has full pricing. Open-source showcase projects are linked from the projects page.
To document a workflow that was not possible 18 months ago. The economics of senior engineering + AI agents are real, and they are reshaping who can sustainably run a small studio. Worth contributing to that conversation.

If this resonates, the intake form on baodev.studio is the right next step. Every legitimate inquiry gets a response within 24 hours, in English or Bahasa. If the project is a fit, a contract is signed within a week. If not, a referral comes instead.

BaoDev.studio. Senior engineering paired with autonomous AI agents. Production-grade systems. No agency overhead.

Originally published at baodev.studio