Forem: Ryosuke Tsuji

Human-on-the-Loop: AI Reviewing AI PRs at cortex (769 PRs/month, while raising the quality bar)

Ryosuke Tsuji — Tue, 26 May 2026 14:35:43 +0000

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

In Part 1 (intro) I covered the high level -- AI driving both PR reviews and incident response on top of cortex. In Part 2 (Product Graph) I went deep on cpg, the unified knowledge graph that fuses code, docs, DB schemas and infra into a single business-aware index.

This post is about the automated PR review pipeline -- AI reviews the PR, a separate AI applies the fixes, and the system merges automatically once policy gates pass. The usual critiques of AI-assisted development ("the reviewer becomes the bottleneck" and "AI code drops the quality bar") don't really apply here. The rest of this post unpacks why.

Series

#	Theme	Key scene	Article
1	Series intro: cortex harness	PRs merging unattended / incidents fixed before anyone notices	ai-harness-intro
2	Product Graph (cpg)	Code / docs / DB / infra unified into one graph	cortex-product-graph
3	Auto PR review	webhook -> AI review -> auto-fix -> squash merge	This article ← you are here
4	Alert-Fix + observability + auto-added guardrails	Alert -> AI investigates -> fix PR + new lint/type gate -> auto redeploy + recurrence blocked	Coming soon
5	Scaling the harness from cortex to toC services	Non-engineer contributions in practice + scaling cortex's harness to the whole product org	Coming soon

Start with last month's numbers

769 PRs merged.

Median time to merge: 31 minutes.

Human review involvement per PR: near-zero.

That's a typical 30 days on cortex (Apr 21 -- May 21).

Every one of those 769 PRs had an AI reviewer as the first reviewer, with an average of 10.8 review-fix loop iterations per PR (max 56). 1 in 5 merged within 10 minutes, roughly half within 30 minutes. What humans do now is look at review outcomes and tune the review prompt and the guidelines themselves -- this is human-on-the-loop, not human-in-the-loop. Humans operate on the policy layer, not the execution layer.

Past 30 days
PRs merged	769
AI reviewer coverage	100%
Avg review iterations / PR	10.8
Max review iterations	56
Per-PR human review	~0%
Median time-to-merge	31 min
Merged within 10 min	20%
Merged within 30 min	49%

This is a typical month on cortex now.

The common refrain -- "AI speeds up writing but reviews still bottleneck" and "AI-written code lowers quality" -- is something cortex absorbs through a pipeline where neither failure mode can take hold. Let me break it down.

How the review bottleneck stops forming

The conventional wisdom: the reviewer becomes the bottleneck

As AI writes faster, the load on whoever reviews the output grows proportionally. Anthropic's internal blog (How Anthropic teams use Claude Code) reports the same pattern -- the bottleneck has shifted from writing to reviewing, and senior engineers' work has moved from writing code toward integrating and reviewing AI output.

cortex hit exactly this. The moment we ran Claude Code at full throttle, writing speed jumped by an order of magnitude or more. Meanwhile the human time available to read and approve PRs only grew linearly. If the reviewer (=me) took a day off, the whole org stalled -- a classic single point of failure.

cortex's answer: move the reviewer role to AI as well

Part 1 and Part 2 kept asking the same recurring question: "how far do you push the harness?" cortex went all-in: the AI writes the code, the AI reviews the code. What humans keep their hands on is "tuning the prompts and guidelines themselves" -- not making decisions inside each individual PR, but watching the system from above and adjusting.

Three conditions had to hold for this to work:

The AI reviewer has enough context

A generic AI reviewer only sees the PR diff. The diff alone hides business meaning, upstream/downstream dependencies, and prior incident history. cortex feeds the Product Graph (cpg) from Part 2 -- a knowledge graph that fuses code, docs, DB schemas, and infra into one structure, with each node carrying business role and upstream/downstream dependencies -- into the AI reviewer, so it can trace impact into code that the PR didn't even touch. It catches:

- Missed upstream/downstream fixes
- Missed doc updates
- Tests that should have been updated but weren't

Diff-only AI review can never reach this territory.

Reviews are not improvisational

If reviews shift day to day, the team gets confused, and the AI can't be told what "correct" looks like. We enforce this by passing an explicit review-guideline document as the mandatory citation source for every review (we open-sourced a snapshot, see below).
False positives don't blanket-block merges

Treating every false positive as Critical breaks the workflow. We control this with a severity hierarchy (Critical / Major / Minor / Nit) plus strict no-downgrade rules.

So: the cpg from Part 2 solves "what context the AI sees," the review guidelines solve "what the AI should do" as Guides (pre-execution control), and the severity ladder + no-downgrade rules solve "what the AI must not do" as Sensors (post-execution control). This maps cleanly onto Martin Fowler's Guides / Sensors taxonomy (introduced back in Part 1).

One more upstream layer: before any of those three kicks in, a 500-lines-per-file lint keeps every file in any PR small enough to fit in a single AI session. That alone keeps AI review from breaking down, and unlike a human reviewer, the AI doesn't lose focus. There are plenty of other lints in front of the AI reviewer too, but the full picture belongs to Part 4 (Alert-Fix + observability + auto-added guardrails).

How the auto-review system is wired

The implementation is a script running on each developer's machine. GitHub webhooks land on an in-house Event Relay server, get persisted to Firestore, and each developer's machine subscribes as an SSE client. On reconnect, Last-Event-ID replays anything missed -- zero event loss, single webhook registration. Reviewer-mode machines stay always-on, so any incoming review fires immediately. Author mode runs in the background on the PR author's own machine, alongside their normal dev work.

How we ended up with Event Relay

The current setup wasn't the original design.

First: GitHub webhook → smee.io → each machine
Then: GitHub webhook → Cloudflare Tunnel → each machine
Now: GitHub webhook → in-house Event Relay with Firestore persistence → SSE to each machine

Both smee.io and Cloudflare Tunnel ran into connection drops and missed deliveries, which caused real misses for us. Switching to the in-house Event Relay brought event loss to zero (Firestore persistence + Last-Event-ID replay), and the relay turned into a general-purpose layer we could reuse.

The webhook ingestion for Alert-Fix (covered in Part 4) actually goes through the exact same Event Relay. GitHub, Grafana, and other webhook sources get consolidated through one relay, and each machine's SSE client subscribes to whichever events it cares about. Having a single general-purpose webhook relay is a piece of infra that keeps paying off in unexpected ways -- worth investing in early.

When the reviewer's machine receives an event, the script spawns claude -p and walks through 9 dimensions (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) sequentially, then reads the verdict marker the AI emitted at the end and posts APPROVE or REQUEST_CHANGES via gh pr review.

A few notes:

Modes split the role -- the same script started with --mode reviewer becomes the reviewer process; with --mode author it becomes the PR-author response process. The machine of whoever is assigned as reviewer runs reviewer mode; the machine of whoever opened the PR runs author mode. Event Relay multicasts the events, and each machine reacts in a distributed way.
Per-PR worktree isolation -- author mode merges origin/main into a fresh worktree before spawning the AI. Multiple PRs can be handled in parallel without file state contaminating across them.
9 dimensions checked sequentially in one session -- not parallel sub-agents. A single claude -p session walks the 9 dimensions while keeping context shared, which also catches cross-dimension contradictions.
Review guidelines: public snapshot -- air-closet/cortex-review-guidelines (JP/EN). The live guidelines are inside cortex (private repo) and evolve daily; the public repo is a snapshot extracted for reference.

:::message alert
Guidelines alone scale only to projects in the tens-of-thousands-of-lines range. At cortex's scale (over 1M lines of code), the knowledge graph from Part 2 (cpg) is a hard prerequisite. Porting the guidelines without cpg won't reproduce the same review quality -- the AI reviewer simply can't navigate the codebase fast enough to reason about impact.
:::

Why sequential single-session review, not parallel sub-agents

We initially tried splitting the 9 dimensions across parallel sub-agents. Three problems emerged: cpg / guidelines / PR diff got injected 9 times (token cost balloons), cross-dimension findings couldn't reference each other (a [Test] issue rooted in a [Graph] violation gets dropped in isolation), and aggregating 9 outputs into a single verdict required its own machinery.

A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context for later dimensions (cross-dimension consistency comes for free), and one verdict marker at the end is the entire aggregation step.

We also swap CLAUDE.md to a review-specific version at startup. The default CLAUDE.md is dense with development-time context (Product Graph ops, prod-data safety, MCP ordering) -- noise for a reviewer. The review-specific version centers on severity, no-downgrade, and the verdict marker spec, keeping AI attention on the review task.

Cutting wasted context lifts judgment precision and token cost at the same time.

Operational knobs

A few filters and toggles we apply in actual use:

Draft (WIP) PRs are excluded. GitHub Draft state is received but skipped; review starts firing once the author flips it to Ready for Review.
Specific PRs can be targeted manually. The webhook is the normal trigger, but you can also kick off a review against a specific PR number from the CLI -- useful after a CI failure or for re-checking a single PR.
Auto-merge is the PR author's call. Whether the pipeline runs through to auto-merge after APPROVE + CI green is set by the PR author. Default is on; for changes that go directly to prod, the author can flip it off and hit merge themselves.

Output structure: tags and severity

Every auto-review comment is structured as tag + severity + concrete example.

Tags (dimensions)

Tag	Dimension	Primary target
`[Graph]`	Product Graph integrity	`@graph-*` JSDoc, node dependencies, doc consistency
`[Doc]`	Doc consistency	Doc updates that should follow code changes, doc placement
`[Impact]`	Impact analysis	Missed upstream/downstream fixes, `via:` field inconsistency
`[Security]`	Security	Auth, input validation, secrets
`[Architecture]`	Composable Architecture	app/package boundaries, dependency direction
`[Test]`	Test quality	Coverage, matchers, naming
`[Observability]`	Observability	Structured logging, no-truncate rules
`[AI-Antipattern]`	AI-generated code traps	Hallucinated APIs, fallback overuse, dead code
`[Recurrence]`	Recurrence prevention	Bug-fix triage (lint / horizontal rollout / new guideline)

Severity

Severity	Criteria	Action
Critical	Security, data corruption, prod-risk, doc inconsistency, missing `@graph-*`, quality-bar relaxation	`REQUEST_CHANGES`
Major	Spec violation, Composable Architecture violation, missing tests	`REQUEST_CHANGES`
Minor	Naming, maintainability, light refactor	`REQUEST_CHANGES` (must be resolved)
Nit	Style preference, minor inconsistency	`APPROVE` (comment only)

The single most important rule is "no downgrade":

"Following existing patterns" is not a valid reason to downgrade (existing violations are a separate problem to fix, not justification for new violations).
"Will be addressed in a separate PR" or "incrementally" is not a valid reason to drop Critical/Major to Nit.
"Leave a TODO/FIXME" is not a valid deferral path.

This is spelled out in severity.md, and the AI literally cites it back when it issues REQUEST_CHANGES.

Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2

A real review comment is more useful than abstract description. Here's a typical one. This is a feature PR (feat(meet): dual-write embeddings to new 'embedding' column (v2)) that was merged on 2026-05-19, after 6 review-fix iterations across 1.5 hours.

First review (07:35:25):

The dual-write approach, the idempotent migration script, and the Promise.all parallelization all look sound. Please address the 3 Critical and 2 Minor findings below before merging.

Critical

[Graph] Missing @graph-business tag (x3)

graph-integrity.md requires @graph-business on every top-level declaration under the app layer (severity: Critical).
All 3 new declarations below are missing it (detailed comments in separate threads).
Existing code is also missing it, but "following existing patterns" is not a valid downgrade reason (severity.md).

generateEmbeddingV2 (v2 embedding generation function)

EMBEDDING_MODEL_V2 (model name config)

EMBEDDING_LOCATION_V2 (region config)

[Graph] embedMeetContent's @graph-connects doesn't reflect generateEmbeddingV2

The JSDoc on embedMeetContent has @graph-connects generateEmbedding [calls] Generate embedding, but no corresponding @graph-connects line has been added for the newly introduced generateEmbeddingV2 call.
The graph will be missing an edge to generateEmbeddingV2.
   * @graph-connects generateEmbedding [calls] Generate embedding
+  * @graph-connects generateEmbeddingV2 [calls] v2 embedding generation (dual-write)
   * @graph-connects insertMeetChunks [calls] Insert chunks into BQ
[Doc] Corresponding BigQuery schema doc is not updated

The "BigQuery schema" section in the related doc is missing the new embedding column.
Both graph-integrity.md and severity.md define doc inconsistency as Critical.
 | `created_at`  | TIMESTAMP   | Created at                              |
+| `embedding`   | FLOAT64[]   | Embedding vector (v2: gemini-embedding-2) |
Minor

[Test] textEmbeddingV2 value is not asserted

objectContaining allows extra fields, so the test still passes even when the v2 value is never set.
         textEmbedding: [0.1, 0.2, 0.3],
+        textEmbeddingV2: [0.1, 0.2, 0.3],
[Test] No isolated scenario for "v2 returns null"

generateEmbeddingV2: mockGenerateEmbedding reuses the v1 mock, so the case "v2 returns null while v1 succeeds" is not independently verified.



The takeaway is the precision of the details.

File + line numbers are concrete.
Suggested fixes are in diff format (copy-paste ready).
Source guideline (graph-integrity.md / severity.md) is cited explicitly.
The typical excuse ("existing code has the same problem") is pre-emptively closed.
The trailing  is a machine-readable verdict marker -- the trigger that moves the PR into REQUEST_CHANGES state.

After this, the PR author (= usually another AI running on the author's machine) pushes a fix, the reviewer re-reviews. The next review confirms all 3 Criticals are actually resolved, raises the next Major / Critical, and so on. 6 iterations in 1.5 hours, finally APPROVE, auto-merge.

Plotted on a timeline:

With a human reviewer, this is "Critical x3 -> wait until tomorrow for the fix -> re-review the day after" -- 2 to 3 days per PR. cortex closes it in 90 minutes.

The difference between human review and auto review is not just speed. A single AI session walks all 9 dimensions in order and cites the guideline each time, which makes it much harder to miss the "deep" findings humans drop because their attention drifted -- doc consistency, recurrence-prevention judgments, weak matchers. Side-by-side comparison:

This is why the review bottleneck never forms here.

Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules

The review guidelines I've been referring to are not a static document. Running this in production surfaces recurring patterns where the AI mis-judges a specific class of issue. Each time that happens, we don't add a comment to the individual PR; we rewrite the guideline so the AI behaves correctly next time -- this is the meta-layer humans actually operate on.

A few concrete failures we hit on cortex, and how we closed each one by changing the rule, not the PR.

1. AI was downgrading because "existing code has the same issue"

Early on, immediately after flagging a violation the AI would add "however, since existing code has the same violation, I'm downgrading this to Nit" and self-downgrade. The result: violations on newly added code kept dropping to Nit, and the system kept emitting Approve.

We closed this by adding the no-downgrade rule to severity.md:

"Following existing patterns" is not a valid downgrade reason: if existing code violates a guideline, new code following that pattern still gets flagged at the same severity. Deferral language like "consider during the next refactor" is not accepted.

That wasn't enough on its own. Over time other excuse patterns surfaced -- "will be addressed in a separate PR," "will be addressed in the next session," "out of scope," "incrementally" -- so we added those as forbidden downgrade categories too. We also explicitly forbade deferring via TODO/FIXME comments in code. The mindset is: close every typical excuse path preemptively.

2. The final verdict had 3 options, and "comment-only" left PRs in limbo

The final verdict at the end of every review was originally APPROVE / REQUEST_CHANGES / COMMENT (approve / request changes / comment-only). When the AI picked COMMENT -- for example when only Minor issues existed -- the script took no action, the PR sat in review-pending forever, and ultimately someone had to manually pick it up. Classic anti-pattern, and it kept happening.

We collapsed the verdict to 2 options. Anything Minor or above is REQUEST_CHANGES, a missing verdict marker defaults to REQUEST_CHANGES (safe side), and only Nit-only or no findings (with CI passing) yields APPROVE. The principle: "if the judgment is ambiguous, fail-safe by defaulting to the blocking side (REQUEST_CHANGES)." Going all-in on that design eliminated the stuck-PR class entirely.

3. Checklist items had no severity, so the AI's judgment kept drifting

Originally, each guideline (graph-integrity.md, testing.md, etc.) was just a bulleted checklist. Items like "Is the test name descriptive?" or "Are mocks minimized?" were listed, but without per-item severity. As a result, the same violation could land as Major in one PR and Nit in another, depending on the session.

We converted every guideline's checklist into a severity / scope / criterion table:

Severity	Scope	Criterion
Critical	All PRs	Missing `@graph-business`
Major	App layer only	Missing tests
Minor	Shared packages only	More than 3 function args
Nit	All PRs	Naming inconsistency

The scope column is a machine-decidable filter for which paths a check applies to, so the AI reviewer doesn't trigger irrelevant items on PRs outside that scope. Just putting it in a table -- the judgment reproducibility jumped significantly.

4. The existing guidelines didn't catch AI-specific traps

After running this for a while we noticed AI-generated code has its own cluster of antipatterns -- calling APIs that don't exist (hallucinated APIs -- something like user.findOrCreate() that looks plausible but isn't actually defined), swallowing errors and returning fallback values (e.g., silently returning an empty array when an upstream API fails), leaving unused functions (a refactor adds the new function but doesn't delete the old one, leaving dead code), expanding the modification scope beyond what was asked (you ask it to change one function and it reformats the whole file), adding unnecessary backward-compatibility code (creating a deprecated alias for an internal-only function) -- and security.md / testing.md couldn't catch these. There's a distinct class of "mistakes only AIs make."

We added a dedicated ai-antipattern.md for this. Reviews now pick these up explicitly under the [AI-Antipattern] tag. Reviewing AI output requires designing around AI-specific traps -- you don't get there just by porting human review heuristics onto an AI.

5. The AI tries to relax "the standard itself"

The last and most important pattern. When the AI was writing fix PRs, occasionally instead of fixing the guideline violation it would write a PR that relaxes the guideline. For example:

Lower the test coverage threshold to avoid writing more tests
Narrow the in-house lint rule's scope to make the violation go away
Soften the guideline doc language from "recommended" to "preferred" to weaken the binding constraint

And the AI builds a formally-coherent justification: "existing code already violates this, so let's adjust the standard to match the implementation." Left unchecked, the AI gradually walks the quality bar down.

We closed this by adding "quality-bar relaxation" as a Critical in severity.md:

A PR that relaxes the quality bar -- guideline doc, lint rule, coverage threshold -- must not be Approved by the AI reviewer. It is sent back with REQUEST_CHANGES. A human reviewer's approval is required. "Existing code already violates this" is not a valid justification for relaxation.

This is the one explicit boundary where we deliberately do not give the AI autonomous Approve authority. Whether the standard itself moves is a human decision. It's the meta-level safety valve for the "AI reviewing AI" architecture.

Evolving the guidelines is the meta-layer humans actually operate on

The common thread: "when the AI gets it wrong, don't override the individual PR -- rewrite the guideline so the fix propagates forward."

AI escapes via "existing code has the same issue" -> add no-downgrade rule
AI picks "comment-only" and PR stalls -> collapse to 2-option verdict
AI's judgment drifts -> add severity / scope columns to every item
AI falls into its own traps -> add the AI-Antipattern category
AI tries to relax the standard -> classify standard-relaxation as Critical, require human Approve

As long as this loop turns, the guideline is a living document that absorbs the failure patterns AI produces in production. Don't try to write the perfect guideline up front. Catch the moment AI gets it wrong, and write the rule for that moment. That's the actual mechanism behind "quality doesn't drop even when humans aren't inside the loop."

And one more thread. Right now, the trigger for "AI got it wrong, time to rewrite the guideline" is still mostly a human judgment, but parts of that maintenance are gradually becoming automatable too. Alert-Fix (Part 4 next time) -- where AI investigates production incidents, opens a fix PR, runs it through auto-review, and auto-redeploys -- requires every fix PR to write one of {add lint, add guideline, horizontal rollout} under the [Recurrence] lens. So the AI is increasingly participating in the maintenance of its own review criteria, with humans still in the loop on adoption. I'll come back to this in Part 4.

Auto-fix: a separate AI applies the changes and pushes

Once REQUEST_CHANGES lands, the same script running on the PR author's machine, but in author mode, picks up the event and starts working.

[REQUEST_CHANGES detected]
   | SSE push via Event Relay
[Author mode boots on PR author's machine]
   | Merge origin/main into a worktree
   |  (lockfile resolved up front, remaining conflicts handled by AI)
   | Read the auto-review comment as context
   | Run claude -p inside the worktree
   | Commit + push the changes
   | New SHA is delivered back to the reviewer's machine via Event Relay -> re-review

Two design choices matter here.

Reviewer and author run on different machines in different sessions -- reviewer mode and author mode are the same script, but they run on different machines in different processes. "Is the original critique correct?" is judged independently. Unlike a single AI fixing its own complaints, the judgment passes between two separate sessions.
All iteration stays inside the same PR -- we don't spawn a new PR. The "fix the root cause, no deferrals" rule from Part 2 and the review guidelines kicks in here: if the AI tries to escape via TODO/FIXME or by splitting work out into a separate PR, the next review rejects it.

Auto-merge + parallel deploy

Once auto-review returns APPROVE and CI is fully green, the auto-merge script runs and squash-merges the PR.

[Auto review APPROVE + CI green]
   |
auto-merge script
   | squash merge to main
   |
[main updated]
   |
Turborepo build (affected packages only)
   |
Pulumi up (multiple stacks in parallel)
   |- API services
   |- pipeline services
   |- MCP servers
   `- infra
   |
[Deploy complete]
   |
cpg index rebuilt (only changed nodes regenerate embeddings -- see Part 2)

pulumi up <stack1> <stack2> ... runs in parallel, so deploying 9 stacks at once finishes in about 8-12 minutes. End to end, merge-to-production is averaging 10-15 minutes.

This compounds nicely with auto-fix PRs. Incident alert -> Alert-Fix identifies root cause -> opens a fix PR -> auto review pass -> auto merge -> auto deploy runs as a single closed loop without human involvement (covered in Part 4).

The numbers, in more detail

Unpacking the headline numbers a bit further.

Depth of the review-fix loop

Across 769 PRs in 30 days, the average per PR was 10.8 review iterations, max 56. The fact that the average is past 10 means the first review almost always surfaces at least one finding.

The embedding-model migration PR shown earlier needed 6 iterations to merge, and that's representative of the average PR. What would take a human reviewer days, cortex resolves in minutes.

What the auto reviewer typically flags

The most common findings out of the first review:

[Graph] Missing @graph-business -- a prerequisite cpg leans on (from Part 2). The classic finding on newly added declarations.
[Doc] Doc inconsistency -- code changed but the corresponding docs/ section was not updated.
[Test] Weak matchers -- objectContaining weakening value assertions, single-property checks via toBe.
[Observability] Unstructured error logs -- event field or required keys deviating from the structured-log spec.
[Recurrence] No recurrence-prevention action -- a bug-fix PR description not declaring which of {lint / horizontal rollout / add guideline / nothing} applies.

These are categories human reviewers frequently miss in practice, especially doc consistency and recurrence-prevention checks. The AI reviewer applies them mechanically on every PR.

Actual false-positive rate

It's not zero. A few times a month we get "this is Nit, not Major" type misjudgments. The fix path is the one described above -- not a comment on the individual PR, but a guideline edit that corrects the judgment for all subsequent reviews.

What changed / Bridge to Part 4

Over the past six months, the engineer's role on cortex shifted from "writer" and "reviewer" to "operator" -- the human running the system, not acting inside each individual decision.

AI writes the code (Claude Code)
AI reviews the code (auto review)
A different AI applies the fixes (author mode running on the PR author's machine)
AI decides when to merge (auto-merge script)
Deploys go in parallel (Turborepo + Pulumi)

What stays in human hands: "what to build at all (product / requirements)," "is this direction actually right (architectural judgment)," "which guideline to add and where," and "look at the reviews and adjust prompts and guidelines accordingly." High-abstraction work -- not individual decisions, but watching the whole system from above and steering. From human-in-the-loop to human-on-the-loop, you could say.

The widely-reported phenomena -- "AI lowers quality," "the reviewer becomes the bottleneck" -- happen when the harness is extended on the writer side only, and the reviewer side is left to humans. If writing speeds up and reviewing doesn't, of course it bottlenecks. Of course things get missed.

cortex is the opposite. We extended the harness on the reviewer side first, before fully extending it on the writer side. Anthropic's observation that the bottleneck shifts from writing to reviewing is exactly right -- which is precisely why "move the reviewer role to AI as well" is the answer cortex chose.

"The AI writes the code, the AI reviews the code." That's the core of cortex's auto-review pipeline. Quality drop and review bottleneck are functions of how far you extend the harness -- they are not inherent to AI-assisted development.

Up next in Part 4: Alert-Fix + observability + auto-added guardrails -- a pipeline where a production alert (observed via OTel/Faro/Prometheus) triggers AI investigation, an AI-authored fix PR plus a new lint/type gate, auto-review, auto-merge, and auto-redeploy. The fix and a recurrence-prevention guardrail land together, so the same class of incident structurally can't fire again. If auto review protects quality at PR time, Part 4 protects it at production time, while growing the quality gates themselves.

The headline number above includes auto-fix-flavored PRs (= Alert-Fix output). For certain classes of incidents, the fix is already merged before anyone has time to react -- that's where cortex sits today. See you next time.

The Heart of the AI Harness: A Knowledge Graph of the AI, by the AI, for the AI (Series Part 2)

Ryosuke Tsuji — Tue, 19 May 2026 14:16:20 +0000

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" and "cortex-product-graph" referenced in this article are internal code names for an AI platform developed in-house at airCloset. They are unrelated to existing commercial services such as Snowflake Cortex or Palo Alto Networks Cortex.

In Part 1 (Series Intro), I wrote about how AI handles PR reviews and incident response on top of a platform we call cortex. At the center of that flywheel is the Product Graph (implementation name: cortex-product-graph, or cpg) — a unified knowledge graph of code, docs, DB schemas, and infrastructure definitions, queryable through semantic search.

In Part 1, I described cpg at a high level: "all of cortex is indexed in one graph." This post goes deeper — how it's built, why we landed on this design, and what actually changed once it was in place.

Series Index

#	Theme	Key scene	Article
1	Series intro: cortex's harness	PRs auto-merge / incidents self-heal before you notice	ai-harness-intro
2	Product Graph (cpg)	Code, docs, DB, infra unified into one graph	this post ← you are here
3	AI PR review	webhook → AI review → auto-fix → squash merge	coming
4	Alert-Fix + observability + auto-added guardrails	Alert → AI investigates → fix PR + new lint/type gate → auto redeploy + recurrence blocked	coming
5	Scaling the harness from cortex to toC services	Non-engineer contributions in practice + scaling cortex's harness to the whole product org	coming

Start with One Scene

"I want to change the calculation logic behind the 'bug rate' KPI on the dashboard. Where is it, and what might break?" — imagine that question comes up before you touch any code.

When you ask an AI this directly, with no function name and no file path given, it hits cpg with a semantic search and pulls the relevant nodes in one shot. What comes back isn't just functions — it includes BigQuery tables and API endpoints alongside the code. And at the end of the response, there's a "next action candidates (Runbook)" block that tells the AI to re-probe starting from the BQ table with the most reads and writes flowing through it.

The final answer looks like this:

Calculation site: calculateRatePer100pt / calculateBugCount — both pure functions with no I/O side effects; safe to change in isolation
Writers (upstream): syncKpiMetrics / writeKpiMetrics / backfillKpiMetrics all write to the kpi_bug_rate_per_100pt table; these are the real aggregation batch jobs
Readers (downstream): BigQueryKpiRepository.getSummaryByDate reads via BigQuery → /kpi/bugs API → KPI dashboard page
Related docs: docs/generator/kpi.md defines bug rate; updating the code without updating docs would leave them stale

"Update the docs together, and schedule the deploy when the aggregation batch isn't running" — that's a decision you can make with confidence.

I personally know all this — I wrote it. But that's exactly the problem: anyone else who wanted to touch this had to track me down. Three months ago, "finding out where something lives and what would break" meant finding me. Now, this same investigation is done by PMO members (non-engineers) using cpg on their own. grep didn't get them there; documentation didn't get them there. One natural-language question did.

What makes that possible is cpg — a graph where you can follow "what you want to do" in plain language to the relevant nodes in one or two hops, even when you don't know the function name. The Runbook structure — where the tool's return value itself contains the next tool call to make — is what lets the AI re-select its starting point and drill deeper on its own.

That's the setup. Now let me explain how it's built.

What Static Analysis Alone Couldn't Do

cortex has a separate system that graph-analyzes the production codebase using static analysis (I'll write about this in its own post — just touching it here). It parses JS/TS code with AST analysis across our external-facing production repos, automatically extracting function call graphs, API endpoints, DB access patterns, and event pub/sub relationships.

This works well for what it does, and we still use it actively in the production repos. But when we tried applying the same approach to cortex itself, it didn't get us where we wanted to go.

Three specific gaps:

No context — nodes exist but carry no meaning. "What is this API for?" "Why does this column exist?" isn't in the graph. Ask "where is the code that calculates the KPI bug rate?" and you'll miss unless the function name happens to look like it.
No entry point — you already have to know the file path or function name before search can start. "Let me go find it" doesn't work.
Explosion after 1–2 hops — starting from any node, related nodes multiply exponentially within a couple of hops, far exceeding what an AI can process in one context window. Trace results become too long to use.

The summary: mechanically accurate, but no semantic weighting. To be genuinely useful to AI, you need one more layer: "what matters, and why things are connected."

Meanwhile, DB Graph Was Working

Around the same time, a different approach — the DB Graph MCP we'd built — was working exactly as intended.

DB Graph is an MCP server with access to 15 schemas and 991 tables inside cortex, supporting semantic search over tables and columns with AI-generated descriptions. A natural-language query like "tables related to return processing confirmation" would find semantically connected nodes even when the table name doesn't contain those words.

After thinking about why this worked, the answer became clear: DB Graph has a business-context description attached to every node, and that description is what feeds into the embeddings. That semantic weight is what "finding by meaning" actually runs on.

Static-analysis code graph had none of that. Type relationships and call graphs exist — but "why this function exists" was never written anywhere.

The Hypothesis — Bring DB Graph's Essence into the Code Graph

The hypothesis was simple:

"A business-context description on every node, loaded into embeddings" — if that's the core of why DB Graph works, then doing the same thing for the code graph should structurally overcome the limits of static analysis.

The problem was: where do you put the "business context"?

All the options:

Location	Example	Problem
External docs	Design docs / wiki / Notion	Separate from code. Drifts instantly. Nobody maintains it.
External metadata	Sidecar YAML / `*.meta.json`	Dual-management. Breaks on rename.
Dedicated graph DB	Write annotations directly into Neo4j / Neptune	Dual-management again. Doesn't show up in PR diffs — unreviewable.
TypeScript decorator	`@GraphNode({...})` in code	Lives in the transpiled output = runtime dependency. Can't be extracted by AST alone.
DSL file	Custom `.graph` file format	High learning cost. No editor support out of the box.
JSDoc comments	`@graph-business` / `@graph-connects`	Physically co-located with the code. Extractable by AST alone. Zero runtime dependency.

The choice of JSDoc over decorators was intentional:

Zero runtime dependency: decorators survive into the transpiled output and can affect runtime behavior. JSDoc has no executable runtime semantics; with production builds that strip comments, it leaves no runtime artifact.
Generalizes beyond TypeScript: the same @graph-* syntax can extend to Pulumi definitions in infra/ and Markdown frontmatter in docs/. Decorators are locked to TypeScript syntax.
Single AST pass: ts-morph can walk declarations and extract JSDoc in one scan. Decorators sometimes require type resolution, which slows builds.
Shows up naturally in PR diffs: JSDoc sits directly above the code it annotates, so when code changes, the JSDoc diff appears in the same file. Reviewers can't miss it.
Doubles as documentation for both humans and AI: JSDoc already serves as IDE hover text and AI-readable context. Putting @graph-business there means it simultaneously explains the declaration to a human reading the code, and gives a coding AI semantic context about the surrounding functions. Graph metadata that also functions as inline documentation.

Note that the essence of this design is using parseable annotations co-located with code as the SSoT — TypeScript / JSDoc is just one implementation. The same pattern works in any language with comparable comment + AST primitives: Python docstrings + ast, Go comments + go/ast, Rust /// + syn. What matters isn't where you write the annotations, but the invariant: "physically co-located with the code, extractable by AST alone."

Same goes for the monorepo: this pattern doesn't depend on cortex being a monorepo. If anything, its real value shows when repositories are split and AI can't easily follow code across them. In a monorepo, the AI can still grep / read files across the whole tree; in a multi-repo, the cross-repo calls and data flows are the hard part to follow. Run the same build per repo, emit nodes / edges, aggregate into a central graph, and those cross-repo connections become reachable in one hop. We actually run a parallel knowledge graph over our external-facing production repos (multi-repo) using the same pattern — more on that in a separate post.

The Approach — Abandon Code Inference, Make JSDoc the SSoT

The code graph's problem was no meaning. The answer is simple: embed the meaning directly in the code.

For cortex's own code graph, we completely abandoned the approach of inferring graph structure from code. Instead:

Every declaration — function / class / method / API / Page / Cron / etc. — gets a dedicated JSDoc tag. The graph is assembled from those.

This means the SSoT (Single Source of Truth) for business context becomes the code itself. There's no gap between docs and code, because the JSDoc in the code is the authoritative source. The structural problem of "AI makes mistakes because docs are stale" is resolved at the level of where the data lives.

Placing the two side by side — "a graph from code inference alone" versus "a knowledge graph with JSDoc as SSoT" — makes the difference in what's carried on each node immediately visible:

Here's a concrete example of the tags (from cpg's own source):

/**
 * Set embeddings on nodes in place.
 * Compares textForEmbedding against existing BQ data; only re-generates
 * for nodes where the text has changed.
 *
 * @graph-stack product-graph
 * @graph-domain Engineering
 * @graph-business Compares hash of textForEmbedding against existing BQ nodes; re-generates
 *   embedding only for nodes where text has changed. Unchanged nodes reuse BQ embeddings.
 * @graph-connects cortex.product_graph_nodes [queries, via:id] read existing embeddings
 * @graph-connects vertex-ai-embedding [calls] generate embeddings for changed nodes
 */
export async function generateEmbeddings(
  nodes: ProductGraphNode[],
  options: { force?: boolean } = {},
): Promise<void> { ... }

What each tag does:

Tag	Role
`@graph-node`	Explicitly declares node type (defaults to Function)
`@graph-stack`	The infra stack this declaration belongs to
`@graph-domain`	Business domain (comma-separated, multiple allowed)
`@graph-business`	What this declaration specifically does — the body of the embedding input
`@graph-connects`	Connection targets (multiple allowed; `via:` for parameter-level tracking; `none` to explicitly declare no connections)

The key is that @graph-business feeds directly into the embedding input. It's not the node name — it's a natural-language sentence that carries semantic weight into search. In practice, almost all of these sentences are written by AI: during the normal flow of writing code in cortex, the AI writes the JSDoc alongside the code (and thanks to the ESLint enforcement below, it doesn't forget).

Making Omissions Physically Impossible

This design collapses the moment someone leaves a tag out. One function without @graph-business = that function is invisible to semantic search. One without @graph-connects = the data flow through that function is absent from the graph.

So we built enforcement that makes omissions physically impossible:

5 ESLint plugins — tag presence validation, syntax validation, naming convention enforcement (stack / domain allowlists), @graph-connects required, @graph-connects none misuse detection (flags when none appears on code that calls external services)
Automated PR review (Part 1 ③) — tags missing are flagged as [Graph] Critical; docs inconsistency is flagged as [Doc] Critical

The result: "write a declaration → business context is always written with it" holds as an invariant. Add a function → its meaning and connections are necessarily in its JSDoc.

One honest note: forcing "5 JSDoc tags on every declaration" on humans would blow up in code review within three days. Writing a @graph-business sentence per function, enumerating @graph-connects exhaustively, checking the naming allowlists — that's genuinely tedious at scale.

This works because AI writes the code. Writing four required JSDoc tags (plus optional @graph-node when the default Function type isn't enough) is rounding error on top of writing the code itself. With ESLint and automated review in the feedback loop, the AI doesn't miss tags — and human reviewers only need to check "is this tag factually correct?" not "is it there?"

:::message
This design is one that can't realistically be maintained when humans write code, but becomes viable the moment AI does. It's an AI-first design. The premise of AI-first development is what lets business context be fixed in code as the SSoT.
:::

Where Hallucination Happens Shifts

Viewed from another angle, what's going on here is that the location of hallucination shifts. Where you contain hallucination is, I think, fundamental to AI harness design.

As I wrote elsewhere, when you combine AI with a graph system, "hallucination doesn't disappear — it just changes location." For cpg, here's where it lands:

Graph build / query phase: No fresh LLM generation. Once reviewed metadata lands in the graph, the ts-morph AST pass, the BigQuery MERGE, and the MCP query responses are all deterministic.
JSDoc writing phase: This is the entry point for hallucination. Whether @graph-business is factually accurate, or whether @graph-connects is exhaustively listed — these can go wrong since the AI is writing them.

But the entry point is locked down by automated PR review. Missing tags get [Graph] Critical; factual drift gets [Doc] Critical. When something's wrong, either the AI that wrote the code or another reviewer AI catches it and fixes it.

The result: once data lands in the graph, it can be treated as deterministically sourced from reviewed code, not as a fresh generated answer that might hallucinate on every query. AI agents calling cpg don't have to guard against "this might be a generated lie" on every returned node or edge. The tools can be designed as "return facts only" without compromise.

Build — AST to Graph via ts-morph

Once JSDoc is established as the SSoT, the rest is mechanics: extract it and assemble the graph. The implementation:

AST-analyze JS/TS with ts-morph — walk every declaration (function / class / method / type / enum / variable / expression statement / export default / etc.)
Extract @graph-* tags from JSDoc — collect the four required tags plus optional @graph-node and normalize into a ParsedGraphTags structure
Generate nodes — use qualifiedName = "<filePath>:<name>" as the node ID
Generate edges — one edge per @graph-connects entry, with via: / cardinality and other metadata preserved
Generate embeddings — send @graph-business text to Vertex AI Embedding (gemini-embedding-2) and vectorize it
Load into BigQuery — MERGE all nodes / edges into cortex.product_graph_nodes / cortex.product_graph_edges

Because @graph-business goes directly into the embedding input, querying "code that calculates the KPI bug rate" in natural language returns a hit based on semantic proximity of the description — even when the function name contains neither "bug" nor "rate."

The overall flow: the three tracks (apps/ / infra/ / docs/) each go through their own parser, are merged into a single node set by the generator, and only nodes whose text has changed are sent to Vertex AI before being stored in BigQuery:

Build Cost Is Effectively Zero

The build runs automatically on push to main via GitHub Actions, using a differential embedding approach:

Compare textForEmbedding of each BQ node against the new text
Unchanged nodes reuse their existing BQ embeddings
Only changed nodes go to Vertex AI

A typical push changes a few dozen nodes, so cost is under $0.001. Full regeneration (for recovery, triggered via workflow_dispatch) is ~$0.075 for 8,000+ nodes.

Why BigQuery, Not a Graph Database

When people hear "knowledge graph," they often imagine a dedicated graph DB (Neo4j, Neptune, Memgraph, etc.). cortex runs on just two BigQuery tables (product_graph_nodes / product_graph_edges). Three reasons:

Different cost structure — dedicated graph DBs set a floor of "always-on cluster cost"; for the current implementation, BQ is storage + on-demand queries only. Even with continuous AI traffic, it's clearly cheaper than running a server 24/7.
Vector search / cosine similarity / SQL in the same place — BQ has VECTOR_SEARCH and ML.DISTANCE, so semantic search over @graph-business embeddings, filter by node properties, and adjacent-node JOINs can all live in one query. That matters when "semantic search + property filter + neighbor JOIN" is the standard access pattern.
Migration-ready for GQL once BQ Graph goes GA — BQ already has Graph in BigQuery in Preview; once it ships GA, you can put a graph view over the existing tables and likely shift to MATCH (n)-[e]->(m) queries in GQL. The current table design is already migration-ready.

In short: get the graph DB's future strength (GQL) while running on plain BQ tables today. Compared to adding a graph DB on top of a generic RAG stack (pgvector / Pinecone / etc.), fewer systems to operate and lower learning curve.

The Core Part Is Available as an Open-Source Sample

The "parse JSDoc annotations with AST analysis and output a graph" part is small enough to reproduce cleanly, so I published it as a working sample:

🔗 graph-jsdoc-extractor

It's a ~500-line library that extracts @graph-* and outputs ndjson of { kind: "node", ... } / { kind: "edge", ... } objects. Comes with a pnpm run example that runs end-to-end. For those who just want to see the output format without cloning, the built ndjson is checked in: examples/sample/output.ndjson.

This is intentionally just the "turn code into a graph" part. The real value in cortex starts when docs and DB schemas land on the same graph — that's the next section.

Connections — Landing Docs and DB on the Same Graph

Looking at the sample ndjson, a @graph-connects users [reads_from, via:id] entry has users stored as a raw string in targetId. Leaving that as-is means it's just a string. Resolving users into a rich node carrying column definitions, partition info, and per-column descriptions — that's where the resolution power of search takes a real step forward.

cortex does this in three directions.

1. DB Schemas as Nodes in the Same Graph

cpg ingests not just code but cortex's DB schemas in the same build. A @graph-connects users [queries, via:id] on the code side gets resolved at build time into a rich Table node carrying column definitions, partition metadata, and descriptions (if the same-named stub exists, its internals are replaced while its ID and all inbound edges survive).

The key point: table and column descriptions aren't AI-generated annotations attached after the fact — they're pulled directly from the description fields in the Pulumi schema definitions. Here's what that looks like (excerpt from cpg's own table definition):

export const productGraphNodesTable = new gcp.bigquery.Table('cortex-prod-product-graph-nodes', {
  datasetId: 'cortex',
  tableId: 'product_graph_nodes',
  description:
    'Product Graph nodes — unified knowledge graph of code + DB + docs. ' +
    'Auto-generated from JSDoc @graph-* tags',
  schema: JSON.stringify([
    { name: 'id', type: 'STRING', mode: 'REQUIRED',
      description: 'Unique node ID (graphId:nodeType:filePath:name format)' },
    { name: 'nodeType', type: 'STRING', mode: 'REQUIRED',
      description: 'Node type — ApiEndpoint, BigQueryTable, Function, Module, Document, etc.' },
    { name: 'qualifiedName', type: 'STRING',
      description: 'Fully qualified name — filePath:exportName format' },
    // ...
  ]),
});

Both the table-level and column-level descriptions become the embedding input for semantic search directly from the Pulumi definition. The same philosophy as cpg's JSDoc — "write the description at the place the thing is defined" — runs all the way through the DB layer. Fix a Pulumi description → semantic search improves. Same mechanics as fixing a JSDoc.

2. Docs Auto-Promoted to Nodes via Directory Convention

Markdown files under docs/ also land in the graph. The mechanism is simple: the directory structure is conventionalized so that which stack and domain each doc belongs to is deterministically resolvable:

docs/{category}/{name}.md

Examples from cpg itself:

docs/product-graph/README.md → stack: product-graph, domain: Engineering
docs/code-graph/README.md → stack: code-graph, domain: Engineering
docs/mcp/db-graph/README.md → stack: mcp-db-graph-server, domain: Engineering

Each file is ingested as a Document node in the graph, and a documented_by edge is auto-generated from code nodes whose @graph-stack matches the doc's stack. Code under apps/graph/product/ all carries @graph-stack product-graph, so it's automatically linked to docs/product-graph/README.md. Change code → related docs are already linked.

This means an AI reviewer can answer "did this code change leave related docs stale?" in one graph hop (that's the source of the [Doc] Critical comments from Part 1).

3. Infrastructure Definitions as Nodes

@graph-* tags go on Pulumi code in infra/ too. An example from cortex's own graph infrastructure:

/**
 * @graph-node {CronSchedule}
 * @graph-stack code-graph
 * @graph-domain Engineering
 * @graph-business graph-boundary-daily: runs cross-repository boundary analysis at 7:00 AM JST
 *   daily (auto-detecting API, DB, and Event connections across repos)
 * @graph-connects graph-index-job [triggers] trigger Cloud Run Job
 */
new gcp.cloudscheduler.Job(`${prefix}-graph-boundary-schedule`, { ... });

This becomes a CronSchedule node in the graph, connected to the target CloudRunJob node by a triggers edge. The Pulumi definition is itself a graph entry point — "what code runs in this cron?" is now answerable by graph traversal.

Result: Four Layers on One Graph

Adding the three together, the node types in the graph look like this:

Node type	Source
Function / Class / Method	Code (JSDoc)
ApiEndpoint / Page	Code (JSDoc `@graph-node`)
BigQueryTable / FirestoreCollection (stub)	Code `@graph-connects` targets
Table / Column / Schema (rich)	Schema files defined in Pulumi
Document	Directory parser over `docs/`
CronSchedule / PubSubTopic / CloudRunService	`infra/` JSDoc

Edge types correspondingly:

Edge type	Role
calls / queries / reads_from / writes_to / publishes / triggers	code → other nodes (`@graph-connects`)
documented_by	code → Document (auto-generated on stack match)
HAS_TABLE / HAS_COLUMN	Schema → Table → Column (DB side)
shares_topic	Between boundary nodes sharing a topic

Code ↔ DB ↔ docs ↔ infra — all reachable in one hop on the same graph. This is what "Product Graph" means: cortex's unified knowledge graph.

Here's an actual visualization of a slice of cpg itself. Starting from generateEmbeddings (code), you can see cortex.product_graph_nodes (BigQueryTable) with its columns, the Pulumi table definition resource, docs/product-graph/README.md, external services like Vertex AI, and a separate layer's graph-boundary-daily (CronSchedule) — all connected by edges on the same node set:

Where the Sample Stops

graph-jsdoc-extractor intentionally leaves out:

Resolving @graph-connects targets to real node IDs (cortex uses a seven-stage resolver; the rules are project-specific)
Same-name merging (cortex promotes DB-schema-side rich nodes to replace stubs; the merge source is project-specific)
The docs directory convention parser (cortex's docs/{category}/{name}.md convention is cortex-specific)
Embedding generation (Vertex AI setup is up to you)

These are parts where the right answer differs per project — naming conventions, where docs live, which embedding model to use, when to promote a stub to a rich node. Baking one answer into the sample library would make it harder to use, not easier. The sample draws the line at JSDoc → graph structure, and this article's job is "here's how we did it in cortex — translate it to your project's context."

MCP Tool Design and the Runbook Pattern

The graph is now assembled. Next: how AI uses it.

cpg runs as an MCP server (cortex-product-graph). From the AI's side, three tools are visible, applying the three-layer tool design (search / detail / traverse) from the Agentic Graph RAG MCP post directly to cpg:

Tool	Role
`search_product_graph_nodes`	Find entry points (vector search + name search)
`get_product_graph_node_detail`	Deterministically fetch detail by ID
`trace_product_graph_connections`	BFS subgraph traversal (`via_filter` for parameter-level tracking)

Three layers only shows you what's in the graph. For jumping from graph nodes to the actual data they point to, supplementary tools live in the same MCP:

Supplementary tool	Role
`read_file`	Pass a node's `path` property directly to fetch source (Function / Class / Method / ApiEndpoint / Document — any code-origin node carries `path`)
`grep_code`	Pattern search across the repository
`git_blame`	Last author, commit, and timestamp per line
`query_product_graph_bq`	Direct SQL against BigQuery. Find a BQTable node in the graph, then jump to its live data (executed via user OAuth, so BQ IAM applies as-is)
`read_firestore` / `write_firestore`	Read/write Firestore collections. Find a FirestoreCollection node in the graph, then go to the live documents (Firestore access follows the same user / environment permission boundary; cpg provides the entry point, not a bypass around IAM)
`list_product_graph_stacks` / `list_product_graph_domains`	Lists all stack / domain names present in the graph; useful for orienting before a search

In other words, cpg's MCP is a two-tier design: the three-layer structure for graph traversal + supplementary tools for descending into live data (source code / BQ / Firestore). The AI can do "search by meaning → traverse by structure → pull live data" entirely within one MCP server.

Runbook Pattern — Return Values Contain the Next Action

Every MCP response ends with a "related nodes (next action candidates)" block. For example, after a search returns:

3 nodes found:
- apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount (Function)
- backlog_no_embedding.kpi_bug_rate_per_100pt (BigQueryTable)
- /kpi/bugs (ApiEndpoint)

## Related nodes (next action candidates)

### 🛠 Code (1)
- apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount
  → `get_product_graph_node_detail("apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount")`

### 🗄 DB tables (1)
- backlog_no_embedding.kpi_bug_rate_per_100pt
  → `trace_product_graph_connections(start_node: "backlog_no_embedding.kpi_bug_rate_per_100pt", direction: "backward")`

### 🌐 API (1)
- /kpi/bugs
  → `get_product_graph_node_detail("/kpi/bugs")`

Copy-pasteable tool calls are lined up by node type, showing exactly what to call next. The AI gets new options on every call, so it never has to figure out "what should I do now?"

Here's the AI ↔ MCP loop in diagram form. The MCP bundles next action candidates into every search response; the AI picks one and makes the next call, repeating:

`usecase` Parameter — Switching the Runbook

Every tool accepts a usecase parameter where the AI declares what kind of investigation it's doing:

usecase	Strategy (summary of what cpg optimizes for)
`general`	Basic investigation with unknown entry point. Default.
`design`	Understanding existing feature structure. Read business / connections via `get_product_graph_node_detail`. Deep trace is unnecessary; Document nodes take priority.
`impact`	Trace upstream and downstream impact deeply. Hit `trace_product_graph_connections` with `direction=both` / `max_depth=5`. Code + DB + infra + schedules are all on the same graph, so one traversal covers a wide area.
`test-create`	Test design. Fetch detail to read parameters and connected DB / called functions.
`test-review`	Compare existing tests against implementation coverage. Cross-check branch structure of target Function / Method against test case count.
`code-review`	Check impact of changes and detect `@graph-business` violations. Trace impact → detail to check business / source.
`bug`	Deep trace from error origin. `direction=both` / `max_depth=5` for upstream callers + downstream data flow.

The same search_product_graph_nodes call with usecase: "code-review" returns next action candidates optimized for "verify the change's impact first." With usecase: "bug" it returns candidates optimized for "trace deep from error origin + fetch logs." The Runbook switches to match the declared intent.

This matters because having the AI declare "what kind of investigation I'm doing" yields different angles from the same graph. Auto Review internally fires with code-review; Alert-Fix fires with bug — the flywheel elements from Part 1 each run a different Runbook.

CLAUDE.md Convention — Forcing AI to Always Hit cpg First

Throughout this post I've said "the AI uses cpg," but AI doesn't spontaneously choose cpg. Claude Code defaults to grep / glob / file read as its first instinct. To flip that, the root CLAUDE.md in cortex opens with:

Product Graph MCP (cortex-product-graph)

This is the single most important asset in this repository. cortex-product-graph MCP indexes all code, DB schemas, docs, and infra into a unified knowledge graph with business context. It knows everything about this repository.

Always query Product Graph MCP first before grep/glob/file reads. It returns richer, contextualized results.

If Product Graph MCP is unavailable (auth expired, server down) and you are NOT in autonomous/auto mode, stop all work immediately and ask the user to authenticate. Do not proceed with degraded grep-only investigation.

Two things matter here. First, the explicit ordering — "cpg first, grep only as fallback." Second, fallback to grep is explicitly forbidden if cpg is unavailable. Without that second clause, the AI happily degrades to "cpg seems down, I'll just grep" and proceeds with stale context and wrong assumptions. With it, cpg unavailability is a hard stop, not a graceful degradation.

One clause in CLAUDE.md, and Claude Code's first move on any code investigation is pinned to cpg. Article writing, Auto Review, Alert-Fix — all follow the same convention, so the entry point is always unified.

A Live Example — Investigating cpg with cpg

Enough abstraction. Let me walk through a real cpg query: using cpg to investigate cpg's own builder core — the meta-example.

Step 1: Semantic search for "the code that extracts graph source data from code annotations"

No function name assumed. Just the intent in plain language:

search_product_graph_nodes(
  query: "code that extracts graph source data from annotations written in code",
  search_mode: "semantic",
  usecase: "design"
)

Top 5 results:

- apps/graph/product/src/parsers/jsdoc-parser.ts:applyGraphTag (Function)
- apps/graph/product/src/parsers/jsdoc-parser.ts:extractTagsFromNode (Function)
- packages/eslint-plugin-graph/src/utils/jsdoc-utils.ts:extractGraphTags (Function)
- apps/graph/product/src/parsers/jsdoc-parser.ts:parseJSDocExports (Function)
- packages/eslint-plugin-graph/src/utils/jsdoc-utils.ts:getGraphTagValue (Function)

The query contained neither "JSDoc" nor "@graph-*" nor "parser" — yet the intent found the right nodes via the @graph-business embedding. grep cannot do this.

Step 2: Trace downstream from that node (`usecase: "design"` prioritizes Documents)

trace_product_graph_connections(
  start_node: "apps/graph/product/src/parsers/jsdoc-parser.ts:parseJSDocExports",
  direction: "forward",
  usecase: "design"
)

Edges returned:

- parseJSDocExports --calls--> extractDeclarationsFromFile
- parseJSDocExports --calls--> extractTagsFromNode
- parseJSDocExports --reads_from[via:filePath]--> filesystem
- parseJSDocExports --documented_by--> docs/product-graph/README.md (Document)

The last one — documented_by — is the point: the edge from code to the Document node was auto-generated. Following it with read_file retrieves docs/product-graph/README.md — and with it, the background, design rationale, and tag specification for this implementation, all in one hop.

Step 3: The meta-structure — this article itself is written with cpg

This article was drafted by Claude Code, not by me — I provided direction and review. That Claude Code has cpg MCP connected, so every time I said "show a real example from cpg's own code" or "use a cpg-related infra example," Claude queried cpg to pull actual function names, JSDoc, Pulumi definitions, and docs structure, then embedded them in the text.

In other words: the generateEmbeddings JSDoc, the Pulumi productGraphNodesTable description, the graph-boundary-daily cron annotation, the auto-link to docs/product-graph/README.md — none of these came from my memory. Claude queried cpg and found the real artifacts. My role is only the review judgment: "this is right / this is wrong."

This is the pattern repeating across all of cortex. Humans set the direction; AI uses cpg to verify and generate implementations / text / reviews. Part 1's ③ Auto Review and ④ Alert-Fix run on the same structure. Article writing isn't a special case — as long as cpg exists, AI-driven work always takes this shape.

What Changed / Bridge to Part 3

That covers the inside of cpg. A closing summary of how it affects cortex as a whole:

1. I stopped running grep

Without knowing file names or symbol names, I can get the relevant code back by just describing what I want to do. The combination of 120+ apps and a team of one works because of this, more than anything else.

2. Auto Review produces context-grounded comments

The [Graph] / [Impact] / [Doc] / [Security] level comments Part 1's ③ Auto Review produces all stand on cpg. The substance is review carried out with the entire codebase as context — that's the real benefit of the cpg integration.

3. Alert-Fix can trace from error origin to root cause

Part 1's ④ Alert-Fix can hop from a Grafana alert → code → dependent tables → related docs in one graph traversal because cpg exists. It fires with usecase: "bug" and takes the shortest path from error to root cause.

4. The static-analysis code graph is working somewhere else

I said "we abandoned code inference" at the top, but that was specifically for cortex itself. For the external-facing production repositories (the core of the business), a different approach supplies context, and static analysis continues to run there. More on that in a separate post.

Most AI coding setups try to make the AI better at reading an unchanged repository. cpg takes the opposite approach: change the repository's information structure so AI has a first-class semantic map to read. That's the line between "another GraphRAG" and what cpg actually is.

In that sense, Product Graph is literally a knowledge graph of the AI, by the AI, for the AI: generated alongside AI-written code, maintained through AI review, and consumed by AI agents as their primary map of the product.

Coming up in Part 3: the full pipeline of automated PR review built on top of cpg — from GitHub webhook ingestion through AI review / automated fix / automated merge / parallel deploy. What happens when Auto Review fires with usecase: "code-review", how [Graph] Critical comments are generated, and the worktree mechanism that lets AI apply fixes and push back.

Minimal post (test fixture)

Ryosuke Tsuji — Sun, 17 May 2026 16:48:43 +0000

Minimal test fixture used by $slug.test.tsx. No headings, no tags — covers the
null branches of TOC rendering and tag-list rendering in routes/posts/$slug.tsx.

Slugs prefixed with _ are excluded from /posts listing (production publishing
surface) but remain reachable via direct getRenderedPost(slug, lang) (the
virtual:rendered-posts lookup that backs /posts/$slug) so test fixtures can
be SSR'd without polluting the index.

Building a Real AI Harness: Auto-Reviewed PRs, Self-Healing Ops, and Non-Engineer Contributors (Series Intro)

Ryosuke Tsuji — Tue, 12 May 2026 16:34:39 +0000

Hi, I'm Ryan, CTO at airCloset.

In my previous posts I've introduced the full picture of our 17 internal MCP servers, an MCP server that searches 991 internal tables in natural language, a custom Graph RAG for measuring initiative impact, and the Sandbox MCP that lets non-engineers publish AI-built apps safely.

All of those run on top of an internal AI development platform we call cortex. This post is the first in a series about cortex itself — the platform, the design choices, and the operational experience.

Series Index

#	Theme	Key scene	Article
1	Series intro: cortex's harness	PRs auto-merge / incidents self-heal before you notice	this post ← you are here
2	Product Graph (cpg)	Code, docs, DB, infra unified into one graph	cortex-product-graph
3	AI PR review	webhook → AI review → auto-fix → squash merge	coming
4	Alert-Fix + observability + auto-added guardrails	Alert → AI investigates → fix PR + new lint/type gate → auto redeploy + recurrence blocked	coming
5	Scaling the harness from cortex to toC services	Non-engineer contributions in practice + scaling cortex's harness to the whole product org	coming

Two Scenes, Up Front

Scene 1: PRs merge themselves

Monday morning. An engineer implements a feature locally, pushes a branch, opens a PR.

A few minutes later, the AI reviewer comes back with REQUEST_CHANGES. Multiple comments:
- "This data formatting duplicates formatRow() in the shared package. Please consolidate."
- "You changed an API response type, but the related docs (docs/api/...) still describe the old shape."
A separate AI agent spawns a worktree, applies the fixes, pushes a follow-up commit
Re-review comes back as APPROVE
Auto squash-merge
GitHub Actions detects only the changed stacks and deploys them to Cloud Run / Cloudflare Pages

No human touched any of this. The engineer refreshes the PR tab and notices it's already merged.

Scene 2: Incidents fix themselves before you notice

7 AM. A Grafana alert fires: "BQ pipeline failed 3 times in a row."

An AI receives the webhook, fetches the error logs from Loki via the Grafana MCP
Walks the Product Graph (implementation name: cortex-product-graph — a unified knowledge graph of the codebase, docs, DB schemas, and infrastructure definitions; covered later in this post and in Part 2) to trace the pipeline's code, dependent tables, and related docs, identifying the root cause
Opens a fix PR
AI reviewer APPROVE → auto squash-merge → automatic redeploy

By the time the engineer logs in at 9 AM, Slack already shows: "pipeline patched." The only incidents engineers personally handle are the ones AI genuinely can't crack.

What's behind both scenes is the dev environment described in the rest of this post.

Industry Context — "Harness Engineering"

Before I get to cortex, one paragraph of context. Over the past six months, the practice of building proper foundations for AI agents in production has crystallized into a recognized industry trend.

"Harness" itself isn't a new word. In AI specifically, it traces back to EleutherAI's lm-evaluation-harness (2020) — the LLM evaluation framework that put the term in active use. What changed in the past six months is its elevation into an engineering discipline for LLM agents in production:

Feb 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world", describing how a small internal team led by Codex shipped 1 million lines in 5 months
A few days later, Mitchell Hashimoto (HashiCorp co-founder, Terraform creator) distilled it into the formula Agent = Model + Harness
April 2026: Martin Fowler (author of Refactoring, ThoughtWorks Chief Scientist) published "Harness engineering for coding agent users", establishing the Guides (proactive controls) / Sensors (reactive controls) framing
Same month: Anthropic and Cursor each published their own harness write-ups

The catchphrase that's gone viral: "2025 was the year of agents. 2026 is the year of harnesses."

The framing is: the model itself is rapidly commoditizing (the gap between Claude / GPT / Gemini is narrowing from the user side). Where you actually get differentiation is how you design the harness — the foundation that lets AI run in production.

cortex is most cleanly read as a real attempt to build that "harness" inside a real company. In this post I'll organize cortex using Fowler's Guides / Sensors framing.

From here, I'll show how the "harness beats model" thesis takes concrete shape on cortex.

Who Builds the Code

For the first few months, I built 100% of cortex by myself. The accurate framing isn't "without a harness, others can't safely PR" but rather "without a harness, no one — including me with extra hands — could ride this thing."

Even back then, between our Google Meet recording pipeline (Japanese), about half of the 17 MCP servers, and a long tail of unpublished features, roughly 50 loosely-coupled applications were already running. Each one had its purpose, background, and data flow documented carefully. But the volume was such that even with AI in the loop, you couldn't realistically have it read all the relevant docs and absorb the whole picture for any given change. The codebase had outgrown what a person — or an AI given pieces — could hold in their head at once.

Recently, with the harness in place, non-engineers (business-side managers, PMOs, etc.) have started shipping PRs to cortex too. As of writing, the cumulative commit ratio is ~91% me, ~9% other recent contributors.

If you imagine non-engineers opening PRs against a production repo, "can quality really hold?" is the obvious question. In cortex, the answer is yes, because AI review and automation own the quality gates:

PRs missing annotations, tests, or lint cleanliness get REQUEST_CHANGES from the AI reviewer
A separate AI agent applies the fixes
Until everything is satisfied, nothing merges

So whoever writes a PR — engineer or not — at the moment it merges, the same quality bar is met. The key point: it's not "you can write freely," it's "you can write inside rails that don't let you derail." The author's job stops at "communicating the intent precisely"; the harness owns code correctness.

The shift is from "X could write that because they're X" to "X can write that because of cortex." That property only emerges once the harness is built — and it's the core of cortex's design.

What's Running

cortex consists of microservices, jobs, MCP servers, web frontends, Cloudflare Workers, and so on. As of writing, there are 123 apps. The features I've already covered in past posts are each composed of multiple apps — but even adding them up by feature, only about 10% of cortex has been written about. The remaining 90% hasn't appeared in a post yet. A few examples:

A unified product UX measurement web app — UX metrics, screen analysis, funnels, and error analysis in one place
A dev-org portal web app — KPIs (bug rate, etc.), per-member GitHub Activity, QA evaluation results, plus an AI chat that answers natural-language questions about KPIs via Agentic RAG
A family of Slack bots for operational support:
- A config bot that lets you manage job configurations (DBs, attendance SaaS, Google Drive, etc.) directly from Slack
- An accounting-assist bot that takes invoice OCR and drafts payment requests / expense filings in our accounting SaaS
- In-channel knowledge search, issue/request management, meeting creation; a BigQuery cross-table RAG bot; a Google Drive cross-corpus RAG bot
- A marketing bot that returns insights (trend, creative analysis) from BigQuery marketing data
An APM auto-analysis agent that runs daily on monitoring-SaaS APM data, detects performance issues, and opens tickets in our issue-tracking SaaS
An AI-bot auditor bot that runs E2E tests against the Slack bots above and detects spec drift

…and so on. Each will get its own dedicated post later in the series.

Scale at a glance:

	Count
apps (microservices, jobs, MCP servers, web, etc.)	123
packages (shared libraries)	66
MCP servers	19
Pulumi stacks	110
TypeScript (implementation)	~630K lines
Tests	~560K lines
Markdown documentation	~110K lines / 389 files
Duration	~5 months (intensive development: ~4 months)
Merged PRs	~790

The 4-Element Flywheel — cortex's Harness

What lets "~4 months of intensive dev, mostly solo" coexist with "non-engineers shipping into the same repo" is a harness design that delegates quality to AI and automation across every layer.

cortex's harness is structured as a flywheel of 4 elements, mapped to Fowler's Guides (proactive) / Sensors (reactive) split, that mutually reinforce one another.

① Product Graph (Guides — supplying the right context)

All of cortex — code, documentation, DB schemas, infrastructure definitions — is indexed in real time as a single unified graph. It's queryable via MCP through semantic search.

"Where is the code that calculates this KPI?" → "Which BQ tables does that code touch?" → "What are those tables' column definitions?" → "What docs are related?" — all of these can be answered from a single query traversal. That graph becomes the context source for everything the AI does.

This is the foundation that "structurally reduces how often the AI gets confused." Where grep tells you "where the string appears," the Product Graph tells you "what is connected, why, and how." Implementation details come in Part 2.

② Lint / Quality Gates (Guides — physically blocking deviations)

eslint-disable / oxlint-disable are forbidden anywhere in the repo. In hand-written code, occurrences of : any / as any / TODO / FIXME are 0 (excluding generated files and unavoidable external-library cases). Type checking (using tsgo — Microsoft's Go port of the TypeScript compiler, ~10× faster than tsc; we use it to keep CI time down) runs on the entire codebase in CI.

On top of that, test coverage is enforced at ≥90% for statements / branches / functions / lines. Lowering the threshold to pass is forbidden — you write tests instead.

With every escape hatch sealed, even when the AI writes wrong code, it doesn't merge. This is also what stabilizes AI review judgments downstream.

③ Auto Review (Sensors — auto-fixing until the bar is met)

Scene 1 above is exactly this. The implementation-side note: AI review here isn't "lint with extra steps" — every comment is grounded in Product-Graph traversal of the actual impact. That's where it earns its keep. To give you a feel, comments that actually fire fall into categories like:

[Graph] Critical — missing annotation that breaks an edge in the graph
[Impact] Critical — a BQ MERGE statement referencing a column not present in the existing target table; would fail in production
[Doc] Critical — code change that left related docs stale
[Security] Minor — execSync doing string interpolation on an env var, opening a command injection vector

What you might mentally classify as "AI review" — surface-level — isn't this. Comments here are produced with the entire codebase carried as context, which is what the Product Graph integration buys you.

The only PRs that actually need a human are "AI review hits a hard case." Day-to-day PRs go from push to merge without anyone touching them.

④ Alert-Fix (Sensors — re-injecting production anomalies into the loop)

Scene 2 above is exactly this. Starting from a Grafana alert, the AI traces the root cause through Product Graph + Loki + git blame, opens a fix PR, and pushes it through ③ Auto Review until it's auto-merged. Re-injecting anomalies into the loop is the essence of Sensors. Details in a later post.

What Makes It a Flywheel

These 4 elements mutually reinforce one another:

① Product Graph exists, so ③ Auto Review can comment with real impact awareness
② Lint enforces the ground rules, so ③ Auto Review can assume "everything in the codebase meets the bar"
③ Auto Review exists, so new code lands in ① Product Graph with correct semantic annotations
④ Alert-Fix's incidents loop back through ③, maintaining the quality bar all the way back to ①

The harness's effectiveness scales with the size of the codebase, not against it.

Supporting Foundations

Three foundations make the 4 elements possible (covered in detail in Part 4):

Tests and coverage: ~630K lines of implementation, ~560K lines of tests (impl : test ≒ 1.13 : 1)
Documentation: ~110K lines / 389 files, written for both humans and AI, also ingested as Document nodes in the Product Graph
Observability: Frontend = Faro, backend = OTel, infrastructure and CI logs all consolidated in Grafana. The AI sees the same data humans see. Gemini API token usage and cost are tracked separately in Prometheus.

Technical Foundation

cortex is a full-TypeScript monorepo.

Layer	Stack
Applications (`apps/`)	TypeScript (Hono, TanStack Router, Vite, etc.)
Shared packages (`packages/`)	TypeScript
Infrastructure (`infra/`)	TypeScript (Pulumi)
Edge (`worker/`)	TypeScript (Cloudflare Workers)
Lint plugins	TypeScript
Doc scripts	TypeScript (tsx)

Having everything in one language is a much bigger win when viewed from the AI's side than from a human's. Specifically:

You can feed the AI ASTs and type definitions directly as context — no language boundary fragments the picture
Refactors don't cross language boundaries — one ESLint plugin can inspect and auto-fix apps/, packages/, and infra/ together
Edges don't break in the Product Graph — for example, a Cloud Run service definition (infra/, TS) connects in a single graph to the Hono route (apps/, TS) it actually invokes

When you ask the AI "what does this change affect?", the reason it can hop infra → apps → packages and answer in one round-trip is that all of this is one language.

Build is parallelized via Turborepo and pnpm workspaces. Deploys go through GitHub Actions, which detects only changed stacks and applies them in parallel via Pulumi.

Numbers (snapshot at time of writing)

	Value
Duration	~5 months (intensive development: ~4 months)
Commits	~4,000
Merged PRs	~790
% of commits authored by me	~91%
apps	123
packages	66
MCP servers	19
Pulumi stacks	110
TypeScript (implementation)	~630K lines
TypeScript (tests)	~560K lines
Markdown documentation	~110K lines / 389 files
`as any` / TODO / unjustified lint-disable in hand-written code	0 (excluding generated files / unavoidable external-library cases)
Coverage gate	90% (statements / branches / functions / lines)

The PR-flow Switch That Multiplied Throughput

Up until April, I was AI-assisted reviewing every change carefully on my own machine and then committing directly to main. The review bar was unchanged, but throughput was bottlenecked on my hands.

In April, switching to fine-grained, PR-based operation (auto review → auto fix → auto merge) dramatically changed the per-month merged-PR count:

Month	Merged PRs
2026-02	10
2026-03	23
2026-04	518
2026-05 (through the 10th)	235

A ~22× jump between March and April. Total commits actually went down (because committing directly to main was replaced by going through PRs), so this isn't "I wrote more code." This is "the manual review step got replaced by the harness, and the throughput ceiling moved." The 22× is exactly the moment a human reviewer was swapped for Auto Review — clean evidence of the flywheel property where the harness's effectiveness scales with codebase size.

What's Required for These Numbers to Hold

These numbers are not explained by "we use AI" alone. The prerequisites:

Full TypeScript monorepo — code, tests, infrastructure, scripts all under one static-analysis system
Composable Architecture — packages/ holds reusable parts; apps/ compose them. Direct imports between apps/ are forbidden — everything routes through packages/. This is what guarantees components don't interfere with each other.
Strict quality gates — lint / coverage / annotations are run "no lowering, no working around"
Unified graph — code, docs, DB, infrastructure on a single graph as the foundation that lets the AI act with context
Auto PR review / auto fix / auto merge / auto alert-fix — the harness that swaps the rate-limiting manual step for AI
Unified observability — humans and AI see the same data (OTel + Faro + Prometheus)

The design has to be in place first, and AI runs on top of it. That's what makes both volume and quality possible at the same time.

Composable Architecture in particular is what drives the headcount-of-one production. Because components don't interfere, multiple Claude Code sessions can run in parallel on different parts of the codebase. In practice, I've run up to ~10 sessions in parallel at peak — this multiplies with the harness's effectiveness.

It's system design, not magic. Each piece will get its own deep-dive in this series.

Some Honest Caveats

If you've read this far, it might sound like everything runs perfectly on autopilot. It doesn't. Three things I want to be upfront about:

1. High code quality doesn't prevent bugs.

What the harness protects is "correctness of the code" — not "correctness of the spec." Even when implementation is clean, getting the spec interpretation wrong still ships bugs. AI review can catch "code contradicts the documented spec," but if the spec itself is wrong, the issue sails right through. That part is still a human responsibility.

2. The work is split deliberately.

New pipelines that connect to external APIs, and anything touching secure data, are handled by engineers. Non-engineers mostly work on modifications to features that already exist (peeking at our business-side members' PRs makes it concrete pretty quickly). "Non-engineers can develop too" means "the harness provides rails they can't derail from, so they can safely modify in maintenance mode" — not "anyone can build anything from scratch."

3. This level of automation works because it's an internal platform.

Yes, cortex's full-auto deploy works partly because Composable Architecture cleanly separates apps and infrastructure. But honestly, a big part of it is that this is an internal-only platform. If something breaks, only employees are affected, and we can roll back fast. The same approach can't be applied directly to consumer products or systems where downtime is immediately critical (warehouse management, for example). We've started moves to close that gap on the consumer side too, but that's a separate post.

Series Roadmap

The series is planned as 6 parts.

Part 1: Series Intro (this post)
The big picture of what cortex is and why it works in "harness" form. The map to the rest of the series.

Part 2: Product Graph — code, docs, DB, infrastructure as one unified graph ★ recommended next
The implementation side: how the unified graph is built and maintained. What happens when you take the design principles from the Agentic Graph RAG MCP post and apply them to the entire cortex codebase.

Part 3: AI reviews, fixes, merges, and deploys PRs
GitHub webhook → AI review → on REQUEST_CHANGES, AI fixes via worktree → auto squash merge → changed-stack detection → parallel deploy: the full pipeline.

Part 4: Incidents self-heal, guardrails self-strengthen
Grafana alert → AI investigation (Loki + Product Graph + git blame) → fix PR + new lint/type gate → auto merge → automatic redeploy: the auto alert-fix system. Also covers the full OTel + Faro + Prometheus stack, Gemini cost tracking, and how the quality gates are designed to be "non-loweriable, non-bypassable, and self-growing."

Part 5: Scaling the harness from cortex to toC services
The first half covers how business members can already open PRs directly to cortex -- and where that breaks (additions to existing pipelines work; new pipelines and architectural changes still need humans in the loop). The second half is the roadmap and the thinking behind scaling cortex's harness across the whole product org (multiple services, multiple infra stacks, multiple teams).

Each post stands on its own, but Part 2 (Product Graph) is the foundation for the others, so the recommended reading order is Part 1 → Part 2 → any.

Cadence: Tuesdays or Thursdays, 8–10 AM JST.

Closing

Building cortex, what's struck me is that in an AI-era dev environment, "absorbing everything that comes after the writing" wins over "reducing the burden on the writer". Tests, lint, types, coverage, code review, incident response — instead of "these get in the way, let's reduce them," the choice that worked was "have the AI do all of them, without compromise." The counterintuitive result is that quality and dev speed both go up at the same time.

And it expands two things — how much one engineer can ship, and how much non-engineers can participate — well beyond what was possible before. That's the texture of the "harness" we've built on top of cortex.

In subsequent parts, I'll walk through the individual mechanisms that make this work.

→ Part 2: Product Graph — code, docs, DB, infrastructure as one unified graph

Graph RAG Isn't a One-Shot Anymore — The Case for Agentic Graph RAG MCPs

Ryosuke Tsuji — Thu, 07 May 2026 09:57:32 +0000

Hi, I'm Ryan, CTO at airCloset.

Over my last few posts, I've introduced internal MCP servers we've been building: DB Graph MCP, the full picture of our 17 internal MCP servers, Biz Graph, and Sandbox MCP.

DB Graph is built from ORM parsing. Biz Graph extracts initiatives from meeting slides and uses a hand-designed Week node structure. Sandbox MCP is an app deployment platform. The purposes and implementations are completely different — but as I was writing each piece, I noticed that the design ideas at the root are the same.

This post is about that root. Agentic Graph RAG — a design frame we keep coming back to whenever we build graphs across different domains.

If you've heard "Graph RAG" before — maybe Microsoft's open-source project — wait a moment. The same words mean different things in the era when retrieval was assumed to be a single shot versus the era when AI agents are everywhere. The optimal design changes completely. This post is about the latter — a new way to think about Graph RAG in a world where Claude Code, Codex, and friends are doing the orchestration.

What Is RAG, Really?

Quick refresher. Skip if this is familiar.

RAG (Retrieval Augmented Generation) is the umbrella term for any technique that retrieves related information from external data and mixes it into the prompt before the LLM generates an answer.

Why was this needed? In the early days of generative AI — late 2022 and through 2023 — we ran into three problems:

Tiny context windows: GPT-3.5 had 4K tokens, early GPT-4 had 8K. You couldn't fit your internal docs in there.
Stale model knowledge: The model didn't know anything past its training cutoff. It certainly didn't know your internal data.
Hallucination: It would confidently fabricate answers when it didn't know.

The RAG idea was: every time the user asks something, fetch the relevant chunks from external data and feed them in before generation.

Vector RAG — The First Practical Answer

The earliest RAG implementation that actually caught on was Vector RAG.

The recipe is simple:

Split documents into small chunks (say, 500 tokens each)
Embed each chunk with a model (e.g., 1536-dim vectors)
Store them in a vector DB (Pinecone, Weaviate, pgvector...)
Embed the user's question with the same model, retrieve the top-k closest by cosine similarity
Stuff those chunks into the prompt and call the LLM

For its time, this was a great invention. Because:

Search is fast: tens to hundreds of milliseconds
No training needed: feed it docs, it's instantly searchable
Domain-agnostic: works for legal documents, medical charts, internal wikis — the same machinery
Rides model improvements: better embedding models, better recall

And critically, agent technology was still immature. OpenAI's Function Calling shipped in June 2023, was unstable for a while, and running a meaningful agentic loop of multiple tool calls was both slow and expensive. So RAG was designed around the assumption: one retrieval has to fetch everything you need. Vector RAG was perfectly tuned for this constraint.

The Limits of Vector RAG

But anyone who runs Vector RAG in production discovers the same thing fast: it can't follow relationships.

Take a question like:

"How did last month's SNS ad campaign affect new member signups?"

Vector search returns chunks that are textually similar to the question. The campaign description might come up. But:

When was the campaign actually running?
What were the new-member numbers during that same period?
What happened with previous similar campaigns?

These aren't textual similarity — they're structural traversals across data. Embedding maps "spring SNS ads" and "spring promotion initiative" close together, but it cannot start from "ran from March 1 to March 31" and reach "new member counts in that same period". That's not a similarity problem; that's a join problem.

On top of that:

Chunk boundaries kill context: related info gets split across chunks
Top-k cliff: critical info at rank 11 is invisible
Granularity mismatch: questions like "summarize the whole thing" can't be answered by collecting chunks

Vector RAG nailed "fetch text similar to the question in one step." It's weak at "follow data through structural relationships." That's the gap that Graph RAG was born to address.

Graph RAG — Search That Follows Relationships

The basic idea of Graph RAG: extract entities (people, organizations, concepts) and relationships (belongs-to, affects, references) from your documents, store them as a graph, and at query time traverse the graph to gather information across multiple hops.

This handles questions like our SNS-ads-and-new-members example — anything that requires multi-hop reasoning.

Classical Graph RAG — Built for the One-Shot Era

The most well-known implementation right now is Microsoft's GraphRAG, released in 2024. The papers are well-written and I have a lot of respect for it. But the design philosophy is squarely from the one-shot retrieval era.

Roughly, Microsoft GraphRAG does this:

Entity extraction: feed the entire corpus through an LLM to extract entities and relationships
Community detection: find graph clusters (communities) using the Leiden algorithm (a community detection method)
Hierarchical summarization: have the LLM summarize each community. Then summarize groups of communities into higher-level summaries
Query time: pick the relevant community for the user's question, dump its summary into the prompt, answer in a single shot

Why is the preprocessing this heavy? Because of the assumption underneath: "calling tools many times at query time isn't realistic". Function calling loops were slow, expensive, and unstable. So you preprocess the entire corpus with an LLM, build community summaries, and front-load the work to make query-time retrieval a single hop or two.

This wasn't a design failure — it was the rational answer for that era. LangChain's RetrievalQA, LlamaIndex's query engines — all of them were built on the same premise: "retrieval is single-shot, generation is one-turn."

What Classical Graph RAG Solved, and Didn't

What it solved:

Relationship-aware search (community summaries even cover "the big picture")
Multi-hop questions like "the relationship between Sam Altman, OpenAI, and Microsoft"

What it didn't solve cleanly:

Construction is expensive: extracting entities from a large corpus via LLM costs real money
Schema is at the LLM's mercy: the entities and relationships extracted are whatever the LLM thinks. This works fine for public-knowledge corpora (papers, news, etc.), but for domains that lean on internal tacit knowledge, the extracted units don't always match what's meaningful for the business
Updates are heavy: every new document means recomputing communities
Sometimes off-target: community summaries get over-abstracted, and the specific information you actually need falls out

Honest disclaimer: I haven't seriously run classical Graph RAG in production myself. By the time I started building graph-based MCPs in our company, Claude Code was already running on my laptop, and I started from a world where agents calling tools many times was the default. As a result, I never actually needed the heavy "compress the answer ahead of time" preprocessing of community summaries. If AI can re-fetch as many times as needed, the graph just has to hold the facts accurately.

The flip side: if I had been doing this in 2023, I likely would have ended up on the same path as community summaries. The problems classical Graph RAG was solving are real — the underlying assumptions just changed faster than the design.

Things Changed — The Agentic Era

From late 2024 through 2025, the landscape shifted:

Production-grade agents arrived: Claude Code, OpenAI Codex — agents that can run long tasks while orchestrating their own tool calls
MCP (Model Context Protocol) landed: tool descriptions became a standardized contract the model can read
Tool-use accuracy from Sonnet/Opus-class models: "pick the right tool from 20" became reliable
Long context windows + prompt caching: stacking many tool calls in a session is now economically reasonable
stop_reason: tool_use as a natural loop: the model itself decides "I have enough info" or "I need to look more"

When all of these line up, the assumption "we can't afford retrieval as a loop" no longer holds. Five tool calls per session, ten, twenty — that's now the norm.

The constraint Microsoft GraphRAG was designed against — "loops are expensive at query time" — has dissolved.

This isn't to say Microsoft GraphRAG is "outdated." It was the right answer for its constraints. The constraints just changed, and so does the optimal answer.

Agentic Graph RAG — Deterministic Retrieval, AI-Driven Orchestration

Here's the thesis. In one line:

Each retrieval step is deterministic. Only the orchestration is AI.

For context: "Agentic Graph RAG" isn't a term I coined. Neo4j's NODES AI 2026 featured a session titled "Agentic GraphRAG," and O'Reilly is publishing Agentic GraphRAG by Anthony Alcaraz and Sam Julien in November 2026. The industry as a whole is pivoting from "one-shot Graph RAG" toward "agent-driven Graph RAG." This article is my attempt to put words around the design we'd been arriving at independently inside our company.

That said, when "Agentic GraphRAG" is used in public contexts, the dominant framing centers on agents automating the graph construction itself (Neo4j's talk above is in that lineage). What this article takes from that broader idea is specifically the query-side agentic pattern. We still hand-design the graphs because the domains we target (internal DB schemas, initiatives × KPIs, codebases) lean heavily on internal tacit knowledge — for now, hand-designing produces better results in practice. We aren't rejecting auto-construction in principle; we're applying the query-side concept to graphs we still build by hand.

Vector RAG had probabilistic retrieval. Embedding cosine is an approximation, and it sometimes misses. Hallucination starts at the retrieval layer.

Classical Graph RAG runs retrieval once at query time. Heavy preprocessing prepares "the answer itself" in advance, and at query time you just look it up.

Agentic Graph RAG sits between these two.

The graph is designed by humans. Our domains lean on internal tacit knowledge, so humans deciding "this is the granularity I want to slice the data with" produces better results.
Each tool call is deterministic. Pass an ID and you get the connected nodes and edges. There's no embedding wiggle.
The AI only judges which tool to call next, what ID to pass in, and when to stop.

The result: errors get localized. Retrieval itself is deterministic, so the only places to be wrong are "AI picked the wrong starting point" or "AI stopped too early." The data in the response is the truth.

Tool Return Values Become a Runbook

The most important design move in Agentic Graph RAG: the tool's return value tells the AI what to do next.

This is different from a regular API. Regular APIs answer the question they were asked. MCP tools are in conversation with an AI. The other side of the conversation needs not just an "answer" but candidates for the next move.

Concrete example.

When the AI calls DB Graph MCP's search_tables tool, it gets:

5 tables matched (vector similarity ranked):

warehouse.return_package_table (postgresql) (distance: 0.2557)
warehouse.receipt_record_table (postgresql) (distance: 0.2720)
inventory.receipt_confirmation_table (mysql) (distance: 0.2921)
warehouse.receipt_record_detail_table (postgresql) (distance: 0.2951)
app.return_status_change_history_table (mysql) (distance: 0.3170)

※ Schema and table names are anonymized — they map to internal system names.

Notice that the response itself contains the next tool's argument. The qualified name warehouse.receipt_record_table is exactly what get_table_detail(table_name: "warehouse.receipt_record_table") expects. If the AI decides "let me look at the details," it just copy-pastes.

The get_table_detail response is even more direct:

# warehouse.receipt_record_table
DB: POSTGRESQL / ORM: typeorm / Repo: warehouse-api

## Columns (9)
- id: int [PK, AI, NOT NULL]
- shipping_order_id: varchar [NOT NULL]
- status: enum [NOT NULL, default=IN_PROGRESS]
- ...

## References (2)
- shipping_order_id → warehouse.shipping_order_table.id (explicit)
- operator_id → warehouse.user_table.id (explicit)

## Enum / Status Definitions (2)
- Status: COMPLETE = received, IN_PROGRESS = in progress
- Type: RENTAL_RETURN = rental return, ...

This response implicitly tells the AI:

"The meaning of status is in the Enum definition" → don't guess, read it
"There are FK references" → if needed, you can follow them with trace_relationships
"There's no direct FK to the app schema" → you'll need a different path

In other words, the tool's response is a runbook for the AI. The AI reads it and assembles the next move on its own.

Now look at the response from sql_query_database:

**app** (staging) — 1 row

| id     | status   | warehouse_order_code |
|--------|----------|----------------------|
| 98765  | RETURNED | SO-2026-00012345     |

> **Table**: Manages the full lifecycle of delivery orders...

### Column descriptions
- **status**: Delivery status (1=awaiting shipment, 2=ready, 3=delivered, 4=returned, ...)
- **warehouse_order_code**: Link code to the warehouse-side shipping order

### Related tables
- → **app.member_table** (user_id → id)
- → **app.plan_master** (plan_id → id)
- ← **app.order_history_table** (delivery_id → id)

Column descriptions and related tables are auto-attached below the query result. This is composed dynamically from the graph data we cached in BQ. Reading that "warehouse_order_code links to the warehouse side," the AI immediately decides "next, look up the warehouse table by this code."

Nobody had to tell the AI "now look at warehouse." The response itself is the instruction.

DB Graph in Action — A Production Investigation in 4 Steps

Here's the full flow (also shown in the DB Graph MCP article).

The scenario: a CS agent asks, "This member shows 'returned' in the app, but did the warehouse actually confirm receipt?"

Step 1: Find tables in natural language (vector-similarity entry-point search)

search_tables(query: "return processing confirmation", search_type: "semantic")
→ warehouse.receipt_record_table, warehouse.return_package_table, ...

Step 2: Look at the details (deterministic detail retrieval)

get_table_detail(table_name: "warehouse.receipt_record_table")
→ status=COMPLETE means "warehouse received it"
→ shipping_order_id connects to warehouse.shipping_order_table

Step 3: Find the path to the other schema (deterministic graph traversal)

trace_relationships(table_name: "warehouse.shipping_order_table", direction: "both")
→ from the app side, connection goes through an intermediate table
search_tables(query: "warehouse linkage")
→ app.warehouse_linkage_table (warehouse_order_code maps to warehouse.shipping_order.code)

Step 4: Verify against real data (deterministic query execution)

sql_query_database(database: "app", sql: "SELECT ... WHERE user_id=12345 AND status='RETURNED'")
→ warehouse_order_code = "SO-2026-00012345"

sql_query_database(database: "warehouse", sql: "SELECT ... WHERE code='SO-2026-00012345'")
→ receive_status = COMPLETE → confirmed by warehouse

The crucial part: the AI built this 4-step flow autonomously. The human only asked the original question. Each step's response carried "look here next" inside it, so the AI could keep composing the next call correctly.

And each step's retrieval is deterministic. The enum definitions for status in warehouse.receipt_record_table are facts pulled from the graph — not values the AI invented. warehouse_order_code = SO-2026-00012345 is real data — not an ID the AI fabricated.

This is a different experience from both Vector RAG and classical Graph RAG. Vector RAG is "return all the text in one shot," but hallucinations slip in. Classical Graph RAG is "return the community summary in one shot," but specifics get lost in summarization. Agentic Graph RAG is "fetch as many times as you need, but every fetch returns nothing but facts."

The Same Pattern, Across Many Graphs

This pattern — what we adopt: human-designed graph + deterministic retrieval tools + responses that double as AI runbooks — isn't limited to DB Graph and Biz Graph. We use it across many MCP servers internally.

Including the ones I mentioned by name in the 17 internal MCP servers post, the lineup looks like this:

Graph	What it covers
DB Graph	991 tables × 15 schemas across the company
Biz Graph	5,000+ initiatives × 4,000+ KPIs
Code Graph	Functions, APIs, events across all repos
Cortex Product Graph	Code + DB + docs + infra unified for the cortex repo
Service Product Graph	API → DB dependencies per service

The structures are all different. DB Graph from ORM parsing. Biz Graph from meeting-slide extraction plus hand-designed MetricDomain. Code Graph from static analysis. Product Graph from JSDoc annotations on top of everything else. Different sources, different assembly.

But the shape from the MCP-tool side is identical:

Entry-point search: vector or substring to find "around here" (the only place fuzziness is allowed)
Detail retrieval: pass an ID, get facts (deterministic)
Relationship traversal: jump from ID to ID along edges (deterministic)
Embed next-step hints in responses: related IDs, enum definitions, annotations, links

This 3+1 template is the universal Agentic Graph RAG shape. Different graph internally, identical surface. From the AI side, they all feel the same — Claude Code uses DB Graph and Code Graph and Product Graph with the same "search → drill down → traverse" rhythm.

Of the graphs above, only DB Graph and Biz Graph have dedicated deep-dive posts so far. Code Graph and the Product Graph family will get their own writeups; for this post, they're listed as fellow examples of the pattern.

A Designer's Checklist

For implementers. Below are the six things I always keep top of mind when adapting Agentic Graph RAG to a new domain.

Things I keep top of mind when building an Agentic Graph RAG:

1. Choose the graph-construction method based on the domain

If the domain leans on internal tacit knowledge, humans deciding the nodes and edges produces better results. Sometimes you intentionally design a structure that doesn't exist naturally — Biz Graph's "Week node" and "MetricDomain" are examples. The design is what determines quality.

Conversely, when the domain is mostly public knowledge (papers, news, public docs), having agents automate construction is a strong option (the Neo4j talk lineage). This article assumes the former.

2. Make retrieval deterministic

The entry-point search may use vector similarity (to accept natural-language queries). After that, "get details by ID" and "follow relationships from this ID" must always return definite values via graph traversal. Using similarity here lets hallucination back into the retrieval layer.

3. Tool granularity: search → detail → traverse

Don't pile everything into one giant tool. Split into search-style entry points, detail lookups, and traversal/data tools. The AI understands the difference and uses them appropriately.

4. Tool descriptions are AI runbooks

Write tool descriptions as execution guides for the AI, not human documentation. "If you see this kind of response, call this tool next." "In this situation, format the argument like this." As I mentioned in the Sandbox MCP post, this directly determines how smart the agent appears.

5. Embed "next move candidates" in responses

Don't just return data. Return:

Related IDs: where to traverse next (FK targets, similar initiatives, parent commits)
Enums and definitions: so the AI can interpret values without guessing
Annotations and warnings: DEAD flags, deprecation marks, PII (personally identifiable information) redaction notes

At a granularity where the AI can read "this is what I should do next" out of the response.

6. Let the AI do the summarization

Don't pre-bake "community summaries" or similar on the server. The AI assembles facts case by case at the right granularity. Return facts. Let the AI interpret.

Limits and Caveats

Heads up. This approach has clear weak spots. If you're considering adopting it, read this section before you start designing.

Agentic Graph RAG is not a silver bullet. To be honest:

Quality depends entirely on graph design. If the schema doesn't carve up the domain correctly, no number of tool calls will reach what you want. And in tacit-knowledge-heavy domains, the call about which nodes/edges to include is one only someone deeply familiar with the domain can make.
If the agent picks the wrong entry, it falls into a deep hole. Miss at the first search_* and the rest of the graph traversal goes sideways. Entry-point quality matters.
Cost is tool-call-count × context length. 10–20 tool calls per session add up tokens straightforwardly. Prompt caching and progress reporting via MCP help, but you have to keep an eye on it.
Hallucination doesn't disappear — it relocates. From the retrieval layer to "entry point selection" and "stop judgment." But it's much narrower territory, so debugging and evals get easier.

The first item is the one designers should worry about most. In tacit-knowledge domains specifically, graphs aren't found — they're designed. I wrote this in the Biz Graph post too, and for these domains I don't think it can be overstated.

Summary

The three eras of RAG, in one table:

Era	Representative	Retrieval	Orchestration
Early days	Vector RAG	Probabilistic (cosine)	None (one-shot)
Function-calling era	Classical Graph RAG	Pre-summarized	Light, mostly one-shot
Agent era	Agentic Graph RAG	Deterministic (graph traversal)	AI assembles in many steps

Vector RAG made "search and dump some context" work. Classical Graph RAG packaged "follow relationships" into a single-shot lookup. Agentic Graph RAG separates "tools that return only facts, accurately" from "AI agents that orchestrate them in multiple steps."

The graphs we've built internally — DB Graph, Biz Graph, Code Graph, Product Graph family — they're all from the same lineage. The contents and construction differ, but in our domains they all share the same shape: "give Claude Code a human-designed graph through deterministic tools." Which is why, from the AI side, they all feel the same.

If you're building AI-native internal infrastructure, give this perspective a try. Don't hand the AI an answer. Hand it a map. It walks much further than you think.

And the quality of that map comes down to how deeply you understand the domain — at least for the domains where the relevant knowledge sits as tacit understanding inside people's heads. In those domains, the best AI systems are still built by the people who know the problem space best. Domain expertise hasn't lost value in the AI era — it's gained it. That's been my strongest takeaway from two years of building graphs across our company.

Cutting Self-Built MCP Server Token Usage by 90% — The Parking Pattern

Ryosuke Tsuji — Fri, 01 May 2026 01:10:27 +0000

Hi, I'm Ryan, CTO at airCloset.

In my previous posts I introduced the full picture of our 17 internal MCP servers, an MCP server that lets you search 991 internal tables in natural language, a Graph RAG MCP for measuring initiative impact, and the Sandbox MCP that lets non-engineers publish AI-built apps safely.

This time I want to share something that came out of running those in production — a small trick we use to cut token consumption on self-built MCP servers.

The Annoyance: MCPs Eat More Tokens Than You'd Think

The first surprise when extending an AI agent with MCP is that token consumption is higher than expected.

An MCP tool call is, at the end of the day, JSON-RPC over HTTP. Both the arguments the AI sends and the result the tool returns land directly in the conversation context. If you implement things naively:

Sending whole files as arguments → thousands of lines of source code stick to the context
Returning all DB query rows → a multi-thousand-row × multi-column table sticks to the context

A single tool call can easily consume tens of thousands of tokens, putting the Claude Code session straight into compaction.

It's worse than just inefficiency: above a certain row count, the response simply fails to come back at all because it exceeds MCP's payload size limit.

When we were ramping up our internal MCP fleet, this little mismatch was reliably making the tool experience worse.

The Pattern: Park the Big Stuff Elsewhere, Pass Only a Key

The fix is embarrassingly simple:

Take the parts that tend to grow and move them off the MCP wire. Pass only a reference key (or URL) through MCP itself.

Both the request side and the response side benefit from the same idea.

Direction	What to remove	Where to park it
Request	Large files / source code	GitHub, Drive, or any object store
Response	Large list data / query results	Spreadsheet / GCS / BigQuery

Two examples from airCloset.

Example 1: Lighter Requests — Sandbox MCP × Self-Hosted Git Server

Last time I wrote about Sandbox MCP, the platform that lets non-engineers publish AI-built apps internally. The first iteration was fully MCP tool-driven file uploads.

sandbox_write_file(app_name: "todo-app", path: "index.html", content: "<html>...")
sandbox_write_file(app_name: "todo-app", path: "app.js", content: "import ...")
sandbox_publish(app_name: "todo-app")

The moment apps got slightly bigger, this collapsed:

Constant chunking: hitting the payload size limit, the AI looped through "first half of file A → second half → first half of file B → ..."
Tokens going up in flames: full source code landed in the conversation context — a single deploy of a few-thousand-line app could burn tens of thousands of tokens
Retries made it worse: the AI would "verify after sending" by re-reading the same file with sandbox_read_file. Write → read → write loops

So we changed the contract: MCP only returns a URL; the actual content moves over git push.

# 1. MCP returns a git URL — no payload involved
sandbox_init_repo(app_name: "todo-app")
# → https://mcp-sandbox.example.com/git/sandbox/ryan/todo-app.git

# 2. AI runs git in the background — MCP isn't involved
git init && git add . && git commit -m "init"
git remote add sandbox <returned URL>
git push sandbox main

# 3. Only the deploy command goes through MCP
sandbox_publish(app_name: "todo-app")

git push gives us:

No file size limit
Differential transfer — second-time pushes are fast
Source code never lands in the MCP conversation context

From the AI's point of view, it's just "I got handed a git URL; I push to it." Fundamentally different in token economics.

By the way, we don't use GitHub Organizations here. Issuing GitHub seats for every employee wasn't worth the cost or operational overhead, and we already had a self-hosted Git Server on GCE for a different purpose, so we just added one repo (sandbox-apps). The "park" doesn't have to be something you build from scratch.

Example 2: Lighter Responses — DB Graph MCP × Spreadsheet

DB Graph MCP is the MCP that lets us search and query 991 internal tables in natural language.

The annoying-but-common case here is "give me everything"-style queries:

SELECT * FROM service_main.user WHERE created_at >= '2026-01-01'

When the result is several thousand to tens of thousands of rows, you get either:

A multi-million-token response that triggers immediate session compaction
An MCP error because the payload exceeds the size limit

Or both. The "right" AI behavior is to do LIMIT 100 and analyze a sample — but if the user actually wanted the full list as a CSV, that doesn't help them.

So we built a "export to spreadsheet, return only the URL" mode into DB Graph MCP. You can opt in explicitly, but the MCP also auto-falls back to this mode whenever the result exceeds a row-count threshold. Even if the AI forgets to add a LIMIT and the query is about to return 10,000 rows, the server decides "this is too big to return inline," exports to a spreadsheet, and hands back the URL.

// Conceptual call (the real shape is documented in the tool description)
sql_query_database({
  query: "SELECT * FROM ...",
  output: "spreadsheet"  // ← explicit export mode
})

// Without `output`, the server still auto-falls back over a threshold (e.g. 500 rows)
sql_query_database({
  query: "SELECT * FROM ..."
})
// → server detects row count → spreadsheet export + URL response

// Either way, the response shape is the same
{
  url: "https://docs.google.com/spreadsheets/d/{...}/edit",
  rows: 12483,
  columns: ["id", "email", "created_at", ...],
  exported_reason: "row_count_exceeded"  // set on auto-fallback
}

The response is just a URL plus metadata. The real data never enters the context. "Light if you're careful" becomes "light even when you're not" — and that's what makes it feel safe in day-to-day operation.

This pattern works because a surprisingly large fraction of real use cases are just "I want this data somewhere I can use it later" — not "let's analyze this in chat with AI." Things like:

Save it to a spreadsheet I can stare at later
Share it with another team
VLOOKUP it against another sheet

For those, MCP's job ends at "write the query, drop the result somewhere." That's enough.

If the user genuinely does want AI-side analysis, you do still need the data in context. The standard workflow becomes a two-step: LIMIT 100 for sample analysis, then output: spreadsheet for the full export once the conclusion is clear.

How Much Did It Save?

Every MCP we run logs every tool call. After rolling these patterns out, total token consumption across all tools dropped 70–90%.

Bonus: Google Workspace OAuth Pairs Beautifully With This

A note on choosing where to "park" data: if your MCP authenticates via Google Workspace OAuth, this whole design becomes much easier.

The reason is that you get two things from a single OAuth flow — two birds with one stone:

Authentication for MCP itself — figuring out who's using the tool
Authorization for Workspace apps — scoped access to Spreadsheet / Drive / Gmail / Calendar

Once the user has logged into the MCP, you don't have to ask for any additional permissions to write to the park location. Which means you can:

Use the operating user's own permissions
To save files to that user's My Drive
Without the MCP itself owning a write-anywhere service account

Files end up in the user's drive, not on a shared service account. "Accidentally world-readable" or "visible to people who shouldn't see it" stops being a realistic accident — it's structurally prevented.

You also dodge the operational cost of issuing a separate GCP service account, storing its key safely, and managing its IAM policy out of band. The safety property genuinely comes for free.

There's one catch though:

The AI agent has to be able to read the spreadsheet URL it got back.

Returning a URL alone doesn't help the AI access the underlying data. Stock tooling in Claude Code can't read a Spreadsheet directly, so you need a separate Workspace-operating MCP.

At airCloset we run a dedicated MCP that wraps the Google Workspace APIs (Drive / Sheets / Gmail / Calendar). Combined with the export pattern above, it gives us a clean flow: "drop results into a spreadsheet → call into the Workspace MCP later if the AI wants to actually read them."

DB Graph MCP → exports to Spreadsheet → returns URL
                                          ↓
              Workspace MCP ← invoked when the AI decides it needs to read the data

From the user's side, this naturally produces the rhythm of "dump it into a spreadsheet first, ask AI to analyze only when needed."

Wrap-Up

A few small tricks for keeping self-built MCP server token consumption under control:

Move the parts that tend to grow off the MCP wire
Park them somewhere — Git server, Spreadsheet, GCS — and only pass keys/URLs through MCP
Pick a park that pairs well with Google Workspace OAuth — you get safety almost for free
If you want the AI to read parked data later, run a Workspace-style MCP alongside

It's an unflashy design move, but the difference in MCP usability before and after is dramatic.

If you're running self-built MCP servers internally and feeling the token squeeze, give it a try.

Bridging 'I Want to Build' and 'I Want to Publish Safely' for Non-Engineers — Sandbox MCP

Ryosuke Tsuji — Mon, 27 Apr 2026 23:04:57 +0000

Hi, I'm Ryan, CTO at airCloset.

In my previous posts, I've introduced our internal MCP servers: an MCP server for natural-language search across all our databases, the full picture of our 17 internal MCP servers, and a custom Graph RAG that lets AI answer "Did that initiative actually work?".

This time I'm covering something a bit different: Sandbox MCP — a platform that lets non-engineer employees deploy apps they built with AI to a safe, internal-only URL with a single command.

The pitch is simple: "If Claude Code can build an app, why not publish it directly?" The hard part is making "directly" mean safely.

The Problem: Building Got Easy. Publishing Safely Did Not.

The arrival of Claude Code and other AI coding agents is reshaping how work happens inside our company.

"Building an app" used to be an engineer's job. You had to do requirements, design, frontend, backend, database, CI/CD, production deploy — all in one head.

Now PMs, designers, and customer-success folks are talking to Claude Code with "build me a screen that does X" and getting working mockups on the spot. Inside airCloset we're seeing more and more:

Mockups for new project proposals
Interactive reports that visualize research findings
KPI dashboards used only by a single team
Small tools for everyday operational improvements

These non-engineer outputs are growing fast. People are even saying "let's just run with this in production for a bit."

That's where the wall hits.

Easy to Build. Hard to Publish Safely.

Anyone can build something that runs locally now. Spin up python -m http.server 8000, view it on your Mac — five minutes max.

But the moment it becomes "I want my team to see this" or "I want others to actually use it," the difficulty curve goes vertical.

Where do you run it? Cloud means GCP/AWS accounts, IAM, billing.
What URL? Domain registration, DNS, SSL certificates, Cloudflare.
What about auth? If it touches confidential info, you need employees-only. OAuth implementation, domain restriction.
And the data? Is localStorage enough, or do you need a real DB? If a DB, who manages the password?
How do you deploy? Can you write a Dockerfile? Cloud Run config, env vars, service accounts, IAM.
What about security? What if the AI-written code has a vulnerability? An auth bypass?

You could "let the AI write all of it." But the result is left to the AI. Cloudflare misconfigured and exposed to the world. Auth bypassed. A service account with production database write access slipped into the code. The more code AI writes, the higher the risk of these accidents.

When a non-engineer says "I want to try building this," we need to clearly separate what the builder is responsible for from what the platform must guarantee by default.

There's also a quieter problem.

UI Inconsistency and Data Sprawl

When non-engineers build apps independently:

One person uses React, another Vue, another raw HTML
Buttons look and behave differently
Some store data in localStorage, some in Google Sheets, some in Firebase

After 10 or 20 such apps, internal tooling becomes chaos. Users wonder "wait, who built this one?" and "why does this button work differently?"

Even for internal tools, you need a baseline of consistency — both in design and in where data lives.

Sandbox MCP — Standing Between "Build" and "Publish"

That's why we built Sandbox MCP.

A non-engineer just says "build this" to Claude Code, and:

An app is generated using a unified UI Kit
They can verify it works locally
A single command deploys it to https://sbx-{nickname}--{app-name}.example.com/
Self-hosted OAuth on the Cloudflare Worker enforces internal-only access
Data is stored, isolated, in a dedicated Firestore database

— all of this completes within a single chat session with the AI.
The builder is only responsible for functionality. Security, data isolation, domain & SSL, authentication are all handled by the Sandbox MCP platform by default.

Scale

Resource	Details
MCP tools	10 (publish, status, schedule, list, delete, write_file, read_file, list_files, init_repo, unschedule)
Supported runtimes	Python (Flask + gunicorn), Node.js, static HTML/SPA, custom Dockerfile
URL	`sbx-{nickname}--{app-name}.example.com` (covered by Universal SSL, no ACM)
Authentication	Self-hosted OAuth on a Cloudflare Worker (Google Workspace)
Data	Firestore named DB `sandbox`, namespaced per nickname × app
Infrastructure	Self-hosted Git Server (GCE) + Cloud Run + Cloudflare Worker + KV
Deploy time	Typically 2–5 minutes (git push to public URL)

Let's walk through the internals.

What It Does — Web, API, DB, and Cron

Sandbox MCP supports four app shapes so it can cover almost any "I want to ship something internally" use case.

Type	Detected by	Use cases
Python	`.py` files present	Flask + gunicorn for APIs, analysis tools with a UI
Node.js	`package.json` present	Express APIs + UI; Bun also works
Static HTML/SPA	only `.html` files (no Python/Node)	nginx-served, React/Vue dist supported
Custom	includes a `Dockerfile`	Any runtime — Go, Rust, Bun, anything

Pick any of these and sandbox_publish deploys it with no extra config.

There's also sandbox_schedule for scheduled batch apps via Cloud Scheduler. Things like "post a risk summary to Slack at 9 AM every morning" become one-line cron setups.

sandbox_schedule(
  app_name: "risk-alert",
  schedule: "0 9 * * *",
  path: "/api/cron",
  timezone: "Asia/Tokyo"
)

Cloud Scheduler now hits the app's /api/cron every morning at 9. No need to open the scheduler UI or translate cron syntax into IaC.

Frontend — Unified Design via sandbox-ui-kit

Even apps built by non-engineers should feel consistent as a tool family. That's the job of the sandbox-ui-kit repo.

It lives on mcp-sandbox.example.com/git and provides:

File	Contents
`sandbox-ui.css`	Design tokens + glass-morphism component styles (dark/light)
`sandbox-ui.js`	Theme switcher, modals, toasts, generic JS utilities
`sandbox-db.js`	SandboxDB client SDK (more below)
`index.html`	Storybook-style component catalog
`README.md`	Full API documentation

The key: it's designed for AI to read and use.

The sandbox_publish tool description literally says:

When building an app, first read README.md with read_file and use the UI Kit.

When Claude Code builds a new app, it read_files this README, learns which CSS/JS to load and which component names to use, then generates code accordingly. Instead of a human walking the AI through UI guidelines, we centralized the "how to use" in one place targeted at the AI.

The result: apps built by anyone (with AI) end up with consistent buttons, modals, and forms.

Backend — Auto-Generated Dockerfile + Cloud Run

"I don't want to write Docker." "I don't want to think about runtime configuration." Classic non-engineer requests.

Sandbox MCP inspects the source files and generates a Dockerfile automatically.

// apps/mcp/git-server/src/sandbox/tools.ts
if (hasPy) {
  dockerfile = generatePythonDockerfile(hasRequirements);
  // Auto-create requirements.txt if missing
  if (!hasRequirements) {
    await writeFile('requirements.txt', 'flask\ngunicorn\n');
  }
} else if (hasPackageJson) {
  dockerfile = generateNodeDockerfile(true);
} else if (hasHtml) {
  dockerfile = generateStaticDockerfile();
}

For example, a Python app gets:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8080
CMD ["python", "-u", "$(ls *.py | head -1)"]

If requirements.txt is missing, flask + gunicorn get added automatically. AI can write from flask import Flask and the dependencies will resolve — no missing-package surprises.

Deployment uses gcloud run deploy --source, with Cloud Build handling the image build. App authors can write a Dockerfile, but they don't have to. No Dockerfile gets the standard, with one customizes — friendly to both non-engineers and engineers.

Database — Transparent Fallback Between localStorage and Firestore

"I want to save data. I don't want to set up a database."

The SandboxDB SDK handles that. The same code uses localStorage locally and Firestore once deployed.

<script src="https://mcp-sandbox.example.com/api/db/sdk.js"></script>
<script type="module">
  const db = new SandboxDB({ token: googleOAuthAccessToken });

  // Save (storage location auto-detected from hostname)
  const { id } = await db.collection('items').add({ name: 'test' });

  // List
  const items = await db.collection('items').get();

  // Get / update / delete
  await db.collection('items').doc(id).update({ name: 'updated' });
  await db.collection('items').doc(id).delete();
</script>

The SDK internals:

this._isLocal = location.hostname === 'localhost'
              || location.hostname === '127.0.0.1';

async add(data) {
  if (this._db._isLocal) return this._localAdd(data);  // localStorage
  return this._req('', 'POST', data);                  // Firestore REST API
}

When running on localhost, it uses localStorage. The moment it's deployed under sbx-*.example.com, it switches to Firestore. No code changes required.

This dramatically improves the experience of building apps with AI:

Local: no network, no auth, all features work
Deployed: same code runs, data is properly persisted
Development data never leaks into systems outside Sandbox (it physically can't reach them)

Firestore Namespace Isolation

Once deployed, data paths are strictly isolated:

sandbox_data/{nickname}--{app}/{collection}/{docId}

nickname: user identifier resolved via OAuth
app: Sandbox app name
_createdAt / _updatedAt: auto-attached by the SDK

Data from different apps is physically unreachable from each other. Even apps built by the same person live in different paths.

The most important point: we use a dedicated sandbox named database. It's a completely separate Firestore database from the (default) DB used by other internal systems. No matter how badly an app's code misbehaves, it can never touch data outside Sandbox.

Infrastructure — Wildcard DNS + Cloudflare Worker + Self-Hosted Git Server

Now for the infrastructure highlights.

How URLs Are Determined

The public URL takes the form:

https://sbx-{nickname}--{app-name}.example.com/

nickname is automatically pulled from the MCP OAuth session. When a user logs into Sandbox MCP via Google, the email is looked up in a Firestore users collection to resolve the nickname. Users never have to repeat "I am ryan" each time.

r.tsuji@air-closet.com → users[r.tsuji@air-closet.com].nickname → "ryan"
                                                       ↓
                                  sbx-ryan--todo-app.example.com

Note: The users collection is kept in sync from a separate internal pipeline (a daily batch that pulls from our HR system and Google Workspace directory). Sandbox MCP just reads from it — no need to maintain its own employee master.

The benefit: you can tell whose app it is just by reading the URL. When someone says "go look at ryan's todo-app," reading the URL aloud naturally communicates ownership.

Instant Publishing via Cloudflare Worker

Normally, publishing a new subdomain requires:

Adding A/CNAME DNS records
Issuing an SSL certificate (15–30 minute wait with ACM or Let's Encrypt)
Configuring a load balancer or DomainMapping

Sandbox MCP skips all of this with a Cloudflare Edge Router Worker.

DNS is fixed as *.example.com wildcard + Cloudflare proxy, with Universal SSL automatically covering every subdomain. The Cloudflare Worker receives all *.example.com/* traffic and routes by subdomain.

The logic is three-tier:

// apps/worker/edge-router/src/index.ts
export async function handleRequest(request, env) {
  const url = new URL(request.url);

  // ① sbx-* prefix → Sandbox routing
  const sandboxSub = extractSandboxSubdomain(url.hostname);
  if (sandboxSub !== null) {
    return handleSandboxRequest(request, url, sandboxSub, env);
  }

  // ② KV route:{subdomain} registered → Cloud Run proxy
  const subdomain = extractSubdomain(url.hostname);
  if (subdomain) {
    const proxyResponse = await handleCloudRunProxy(request, url, subdomain, env);
    if (proxyResponse) return proxyResponse;
  }

  // ③ Otherwise → fetch(request) passthrough
  return fetch(request);
}

When sandbox_publish finishes, all it does is write a route:{nickname}/{app} key into Cloudflare KV. That single write makes the new subdomain routable instantly.

await kvPut(`route:${nickname}/${appName}`, serviceUrl);

No DNS setup. No waiting for SSL issuance. No IaC deploy. Everything completes within the MCP tool execution.

Self-Hosted Git Server for Larger Apps

This setup actually started out without git at all.

Since the primary users were going to be PMs and CS folks, we figured "git concepts are too high a bar — let's keep everything inside MCP tools." Write files via sandbox_write_file, deploy via sandbox_publish. That should be enough, we thought.

The approach hit two walls quickly.

Wall 1: Constant chunking

MCP tool calls travel over HTTP, with a payload size limit. React/Vue build bundles, SPAs with images, business tools with dozens of files — they don't fit in a single call. We added an append mode to sandbox_write_file for chunking, but every "first half of file A → second half of file A → first half of file B → ..." sequence triggered error recovery and retries. Deployments became flaky.

Wall 2: Massive token consumption

This was the real killer. When you tell the AI "deploy this app," it sends the entire source as MCP tool arguments. The file contents land in the conversation context, and a few-thousand-line app burns through tokens fast. A single deploy easily consumed tens of thousands of tokens, and Claude Code sessions hit compaction quickly.

Worse, the AI tends to "verify after sending" — re-reading the same file via sandbox_read_file. Write → read → write loops, with tokens going up in flames.

So we pivoted to using git push as well. With git push:

No file size limit
Differential transfer — second-time pushes are fast
Source code stays out of the MCP conversation context (no AI tokens consumed)

We never expected business-side employees to run git push by hand. But if Claude Code runs git commands in the background, it's not a barrier. The user just says "build this and publish it" — the AI runs git init && git push on its own when needed.

Why a Self-Hosted Git Server?

Once we adopted git push, the next question was: where do we host the repos? We considered using GitHub Organizations but ruled it out.

Issuing and managing GitHub accounts for every employee — including non-engineers — wasn't worth the cost or the operational overhead. Paying for a GitHub seat just to ship one app is overkill.

Fortunately, we already operated a self-hosted Git Server on GCE for a different purpose: hosting an internal "read-only Git MCP for code investigation." A VM with repositories cloned under /mnt/repos/.

We just added a Git Smart HTTP Protocol endpoint and one new repo (sandbox-apps) to it. The VM was already running, so the marginal cost was near zero. Authentication piggybacks on the existing Google OAuth setup. Repository management is just OS directory operations. Borrowing space on the existing internal Git Server was vastly simpler than spinning up new infrastructure.

Actual Usage Flow

# 1. Get the git URL from the MCP tool (nickname is automatic)
sandbox_init_repo(app_name: "my-app")
# → https://mcp-sandbox.example.com/git/sandbox/ryan/my-app.git

# 2. Local commit (the AI does this in the background)
cd ~/my-app/
git init && git add . && git commit -m "init"
git remote add sandbox <returned URL>

# 3. Push
git push sandbox main
# Username: oauth2accesstoken
# Password: $(gcloud auth print-access-token)

# 4. Deploy
sandbox_publish(app_name: "my-app", description: "...")

Auth uses a Google OAuth token as the Basic Auth password (same pattern as GCP Source Repos). Only @air-closet.com accounts pass. No GitHub account required — any employee can push.

The remote repo is configured with receive.denyCurrentBranch=updateInstead, so the working tree updates server-side on push. Cloud Run uses that directory as --source, so there's no extra step between push and publish.

For small apps (a few files, hundreds of lines each), sandbox_write_file still works fine. Switch between MCP-only and git push depending on app size.

Security — Four Independent Gates

That covered the "convenient to build" side. Now the "safe to publish" side.

As I noted at the start, exposing AI-generated code in front of users is risky. So Sandbox MCP layers four independent safety mechanisms that don't depend on the app's own implementation.

① Public-Facing Gate — Self-Hosted OAuth on the Cloudflare Worker

sbx-*.example.com sits behind a self-hosted OAuth gate built into the same Cloudflare Worker that handles routing. When someone visits, the Worker first checks the cortex_session cookie; if it's missing or invalid, it redirects to a Google Workspace SSO entry point (auth.example.com/__edge/auth/start). Without an @air-closet.com account, requests never reach Cloud Run.

This is independent of the app's implementation. Even if the AI didn't write a single line of auth code, the Worker stops the request first. "Accidentally public" is physically impossible.

Why we migrated from ZeroTrust Access to self-hosted OAuth

The first iteration used Cloudflare ZeroTrust Access. You just configure the @air-closet.com domain restriction in the Cloudflare dashboard and you're done — no auth code at all. As a starting point it was ideal.

The catch: ZeroTrust's free tier caps at 50 users. As headcount grew and Sandbox MCP usage spread, we approached the cap, and switching to pay-as-you-go (~$7/user/month) wasn't trivially cheap. On top of that we wanted to share the same auth foundation with internal apps in production (KPI dashboards, inventory tools, etc.), so we decided to consolidate everything into a self-hosted OAuth with no user limit.

Conveniently, the Cloudflare Worker already in front of every *.example.com request — the routing layer Sandbox MCP relies on — was perfectly positioned for this. A small extension gave us:

auth.example.com/__edge/auth/start to kick off Google OAuth 2.0
auth.example.com/__edge/auth/callback to exchange tokens, persist the session in Upstash Redis, and issue a cortex_session cookie scoped to Domain=.example.com
Worker-level gating for sandbox + internal-app subdomains, injecting X-Cortex-User-Email and friends into the Cloud Run request when authenticated

All of this fits inside the existing Worker — no extra Cloud Run, no extra VM. Workers do have a CPU-time budget, but OAuth flows and cookie checks complete in single-digit milliseconds, so latency is indistinguishable from ZeroTrust.

Net result: the user cap is gone, anyone with @air-closet.com can use Sandbox out of the box, and the auth implementation is fully visible in our own codebase.

② Deploy Gate — MCP OAuth

Operations like sandbox_publish and sandbox_delete enforce Google OAuth on the MCP server side. Sandbox MCP implements RFC 8414 (/.well-known/oauth-authorization-server), so Claude Code runs the OAuth flow automatically on first connection.

The strongest guarantee is "you can't accidentally update or delete someone else's app."

When multiple people share a Sandbox MCP, an AI accident like "wait, I overwrote a coworker's app while updating mine" would be devastating. To prevent that, the AI doesn't get to decide whose app is being touched. The server injects nickname automatically from the OAuth session.

// Strip the `nickname` property from the MCP tool schema and have
// the server force-inject the logged-in user's nickname.
function injectNickname(tool: McpTool, userNickname?: string): McpTool {
  const { nickname: _, ...restProperties } = tool.schema.inputSchema.properties;
  return {
    schema: { ...tool.schema, inputSchema: { ...tool.schema.inputSchema, properties: restProperties } },
    execute: (args, ctx) => tool.execute({ ...args, nickname: userNickname }, ctx),
  };
}

From the AI's perspective, the nickname input doesn't exist. Even with a prompt injection like "delete ryan's app," there's no mechanism to do so. "You can only touch your own apps" is enforced at the API spec level.

On top of that, inputs are validated strictly against /^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$/, rejecting shell-injection and path-traversal patterns (.., /).

③ Data Gate — SandboxDB Namespace Isolation

As mentioned earlier, data lives at:

sandbox_data/{nickname}--{app}/...

Per request, the SandboxDB API resolves the path server-side:

Browser (OAuth): resolve email → users → nickname, take app from the Origin header
Backend (SA token): take nickname/app from the X-Sandbox-App header (required — missing returns 400)

The client cannot spoof the path.

We deliberately do not use the K-Service header (the Cloud Run-injected service name). That's a client-spoofable header, and another implementation that relied on it had a "read another app's data" vulnerability disclosed. Requiring X-Sandbox-App keeps the only valid route through an explicitly server-validated path.

The clincher: a dedicated named database for Sandbox. Instead of the (default) DB (which contains data from other systems), we use an independent Firestore database called sandbox, and the Cloud Run SA gets an IAM Condition that allows access only to the sandbox DB.

// From infra/mcp/git-server/index.ts
// IAM Condition on roles/datastore.user:
//   resource.name == "projects/.../databases/sandbox" ||
//   resource.name.startsWith("projects/.../databases/sandbox/")

No matter how badly the AI-written code goes wrong, it physically cannot reach data outside Sandbox.

④ Execution Gate — Cloud Run SA + IAM

All sandbox-* Cloud Run services run under a single shared SA (e.g. sandbox-run). The permissions on that SA are minimal.

roles/logging.logWriter (write its own logs)
roles/bigquery.jobUser + bigquery.dataViewer scoped to the sandbox_logs dataset only (its own access logs, nothing else)
roles/datastore.user (IAM Condition limiting to sandbox DB)

What it does not have:

Access to the (default) Firestore that holds data from other systems
Access to BigQuery datasets used by other internal systems
Direct access to Secret Manager
Permission to manage other Cloud Run services

In other words, even if a Sandbox app goes completely rogue, the blast radius is limited to sandbox_data and sandbox_logs. Nothing outside Sandbox is affected.

Logging — Apps Can Query Their Own Access Logs

Sandbox apps eventually want to look at logs too. "How many views did this page get?" "Who hit that error?"

We forward Cloud Run request logs to BigQuery via a Logging Sink:

// From infra/mcp/git-server/index.ts
const sandboxLogSink = new gcp.logging.ProjectSink('sandbox-logs-sink', {
  destination: `bigquery.googleapis.com/projects/${projectId}/datasets/sandbox_logs`,
  filter: [
    'resource.type="cloud_run_revision"',
    'resource.labels.service_name:"sandbox-"',
    'logName:"run.googleapis.com%2Frequests"',
  ].join(' AND '),
  bigqueryOptions: { usePartitionedTables: true },
});

The sandbox_logs dataset is locked down with project-owner-only ACLs (it contains PII like remoteIp and User-Agent), and the Sandbox SA gets a tightly scoped bigquery.dataViewer to it.

This lets apps query their own access logs from BigQuery. "Post last week's user count for this app to Slack" can be done entirely inside Sandbox.

Tool Design — Making AI Use Tools Correctly

Let me close with a note on tool definitions. I personally think this is where MCP design really makes or breaks.

Sandbox MCP exposes 10 tools:

Tool	Purpose
`sandbox_publish`	Start deploy (async)
`sandbox_deploy_status`	Check deploy status
`sandbox_init_repo`	Initialize git push repo
`sandbox_write_file`	Write file (overwrite/append)
`sandbox_list`	List apps
`sandbox_delete`	Delete app
`sandbox_schedule`	Configure Cloud Scheduler
`sandbox_unschedule`	Remove Cloud Scheduler
`sandbox_read_file`	Read source code
`sandbox_list_files`	List files

Whether the AI picks the right tool at the right moment is almost entirely determined by what's written in the tool description.

For example, the description for sandbox_publish covers not just functionality but also:

Supported app types and required files (Python / Node.js / static HTML / custom)
Startup command and PORT requirement per type
When to use write_file vs git push
How to use SandboxDB (with SDK code samples)
How to use the UI Kit (explicit instruction to fetch README.md via read_file)

With this in place, the AI can autonomously do:

User says "build me a tool that displays Slack emoji scores"
→ Reads sandbox_publish description and sees "first read the UI Kit README"
→ Calls read_file on sandbox-ui-kit/README.md
→ Generates HTML/CSS/JS following the guidelines
→ Sees the SandboxDB SDK usage in the description and integrates persistence
→ Calls sandbox_publish

— without asking the user a single follow-up question. Writing not just "what it does" but "what to do with it" into the tool definition is the secret to AI-friendly design.

If you write tool definitions tersely, the AI keeps coming back asking "what should I do next?" The description is less of a human-facing doc and more of an AI-facing runbook. That framing helps a lot.

Wrap-Up

Sandbox MCP exists to answer two challenges of building internal tools in the AI era:

Building is now possible for anyone, thanks to AI
Publishing safely remains hard

To close that gap, we:

Standardized every layer on the platform side: frontend / backend / DB / infra / auth / domain / SSL
Embedded a runbook into tool descriptions so the AI naturally uses things correctly
Layered four access gates (Worker-level OAuth / MCP OAuth / namespace isolation / IAM) so safety doesn't depend on the implementation being correct

Building this, what struck me again is that the role of platforms in an AI-powered development era is shifting. Platforms used to optimize for "easy for humans." Now they also need to optimize for "used correctly by AI." Tool descriptions are AI-facing docs, and safety must be designed assuming AI will write incorrect code.

At the same time, by limiting what the builder is responsible for, we drastically lower the barrier to "let me just try something." That's the entry point that turns a non-engineer's "I want to build this" into actual operational improvements.

I hope this is useful for anyone designing internal platforms.

Still Measuring Initiative Impact Manually? How We Used Graph RAG + MCP to Make It Explorable

Ryosuke Tsuji — Mon, 20 Apr 2026 15:27:35 +0000

Hi, I'm Ryan, CTO at airCloset.

In my previous posts, I introduced an MCP server that lets you search all company databases in natural language and showed the full picture of our 17 internal MCP servers. This time, I'm diving deep into what I briefly mentioned as "Biz Graph."

This is the story of how we represented the relationship between business initiatives and KPIs as a graph structure, enabling AI to answer "Did that initiative actually work?"

Why Graph RAG?

To get more value from AI, what matters is not just feeding it data — it's conveying the relationships between data.

If your data volume is small enough, tools like NotebookLM can deliver great results. But you can't fit all your business data into a context window. Initiative reports, KPI spreadsheets, marketing weekly reports, logistics daily metrics — you simply cannot dump all of that into a prompt.

That's why I believe the best available option right now is Graph RAG: making the right data searchable at any time, along with its relationships. When AI is asked "What metrics are related to this initiative?", it can traverse the graph and extract only the information it needs — because that structure was built in advance.

But there's a catch.

Making Non-Graph Data Into a Graph

Many of you have heard of "knowledge graphs" and "GraphRAG." But when you actually try to build one, most people hit the same wall:

Business data doesn't naturally form a graph.

With our DB Graph project, things were different. Tables had foreign keys. ORMs had @JoinColumn and belongsTo. Relationships already existed in the data — we just had to parse and convert them.

But the relationship between "initiatives" and "KPIs" has none of that.

A meeting slide says "SNS ad campaign launched"
A spreadsheet records "This week's new members: 1,234"
There's no FK between these. No join key.

"The SNS campaign affected new member signups" — that relationship exists only in someone's head. It's nowhere in the spreadsheet.

This is what "business data doesn't form a graph" means. The relationships between entities aren't self-evident — you have to design the graph structure itself.

The Problem: "Did That Initiative Actually Work?"

Every week, our company reports initiative progress in all-hands meetings and group-level standups.

"We launched the spring SNS ad campaign"
"We improved the recommendation engine"
"We're raising our CS SLA achievement rate"

— Dozens of initiatives reported weekly. Hundreds per year. Over 5,000 total.

Meanwhile, a separate spreadsheet tracks 200+ metrics daily and weekly: member count, new signups, retention rate, satisfaction scores, acquisition CPA...

The problem: these two worlds are completely disconnected.

"How much did last month's SNS campaign contribute to new member acquisition?"

Answering this requires:

Confirm the initiative's execution period (which slide was that again?)
Find KPI data for that period (which sheet, which tab?)
Align timeframes and compare numbers (week-over-week? month-over-month? year-over-year?)
Check if other initiatives were running simultaneously (confounding factors?)

This manual analysis takes 30-60 minutes, happening every week for multiple initiatives. Realistically, most initiative effectiveness reviews end with "it probably worked, I think."

Biz Graph: The Big Picture

We built Biz Graph to solve this.

Scale

Note: The numbers below differ from actual values but convey the order of magnitude. In any case, this is far too much data to fit in an LLM's context window.

Resource	Count
Nodes	~10,000 (14 types)
Edges	~71,000 (22 types)
Initiatives	~5,000
KPI Metrics	~4,000 (members/signups/retention/satisfaction/UX/marketing/logistics)
Marketing Channels	~100 (SEM/LINE/email/CRM etc.)
Data Sources	9 tables/spreadsheets

Three Components

Biz Graph Transformer — Weekly graph rebuild from all data sources (Cloud Run Job, every Friday 22:00)
Biz Graph MCP Server — Graph search + time series analysis accessible from AI (Cloud Run)
Biz Data Loader — Daily auto-import of marketing/logistics data (Cloud Run Job, every morning 6:00)

The Core Design: The Week Node

Here's the heart of this article.

How do you connect "initiatives" and "metrics" in a graph? The obvious first thought is direct edges:

Initiative("SNS campaign") ──AFFECTS──→ Metric("new_members")

This design breaks down. Three reasons:

Edge explosion: 5,000 initiatives × 4,000 metrics = up to 20 million edges
Causal uncertainty: "SNS campaign affected new members" is a hypothesis, not a fact. Direct edges make it look like a confirmed relationship
Missing temporal info: There's no way to express when the impact occurred

Instead, we designed Week nodes as shared anchors for indirect connections.

Initiative("SNS campaign")     ──ACTIVE_DURING_WEEK──→  Week:2026-03-03
Metric("new_members")          ──HAS_DATA_AT──→         Week:2026-03-03
QualityMetric("avg_rating")    ──HAS_QUALITY_DATA_AT──→ Week:2026-03-03
MarketingChannel("SEM brand")  ──HAS_MARKETING_DATA_AT──→ Week:2026-03-03

Initiatives and metrics aren't directly connected — they're indirectly linked through the same week.

Why This Works

1. Prevents edge explosion

Initiatives only connect to "weeks they were active." Metrics only connect to "weeks that have data." Instead of a cross-product, each connects independently to Week nodes — edge count grows linearly.

2. Expresses co-occurrence, not causation

"Initiatives that were active the same week as metric fluctuations" — this isn't asserting causation, it's a structure for discovering causal candidates. It leaves room for human or AI judgment.

3. Edge types distinguish data sources

Same Week node, but HAS_DATA_AT (business KPIs), HAS_QUALITY_DATA_AT (service quality), HAS_UX_DATA_AT (UX metrics), HAS_MARKETING_DATA_AT (marketing), HAS_LOGI_DATA_AT (logistics) — "what kind of data" is embedded in the edge type itself.

4. Time series traversal is natural

Week nodes are connected by NEXT_WEEK edges. "How did metrics change in the 3 weeks before and after initiative start?" can be expressed as graph traversal.

MetricDomain: Bridging Worlds Without Join Keys

Week nodes tell us "what happened the same week," but not which metrics are relevant to a given initiative. There's no point looking at logistics data when analyzing an SNS ad campaign.

However, there's no join key between initiative categories ("Marketing (Advertising)") and metric groups ("New Acquisition"). The knowledge that "ad initiatives relate to new acquisition" is tacit — it exists only in people's heads.

MetricDomain (6 domains) structuralizes this tacit knowledge.

Domain	Meaning	Connected metric types
acquisition	New acquisition	Marketing channels, new member count, registration CV
retention	Retention / churn prevention	Member count, churn rate, plan transitions
service_quality	Service quality	Satisfaction, ratings
operations	Operations	Selection, shipping, returns, logistics KPIs
ux	UX experience	Sessions, funnels
revenue	Revenue / purchases	Purchase CV, upsell

These 6 domains aren't fixed — they can be freely added or split as the business grows and the organization evolves. Domain definitions are just mapping tables in code, so the cost of expansion is nearly zero.

By humans defining the mapping between initiative categories and MetricDomains, and between metric groups and MetricDomains, we enable "automatically show acquisition-related metrics when viewing a marketing initiative."

Category("Marketing ads") ──CATEGORY_IN_DOMAIN──→ MetricDomain("acquisition")
                                                           ↑ IN_DOMAIN
                                                  MetricGroup("New Acquisition")
                                                  MarketingChannel("SEM brand")
                                                  UxMetric("registration_completed")

Result: Pass domain: "acquisition" to compare_metrics, and the initiative overlay automatically filters to acquisition-related initiatives only.

SIMILAR_TO: AI Answers "Have We Done Something Like This Before?"

Another unique design element: SIMILAR_TO edges.

Initiative text (title + description) is vectorized to 768 dimensions using Vertex AI's gemini-embedding-001, then BigQuery's VECTOR_SEARCH auto-detects similar pairs with cosine similarity >= 0.75.

SELECT base.id, query.id, distance
FROM VECTOR_SEARCH(
  TABLE cortex.biz_graph_nodes,
  'embedding',
  (SELECT id, embedding FROM cortex.biz_graph_nodes WHERE node_type = 'Initiative'),
  top_k => 6,
  distance_type => 'COSINE'
)
WHERE base.id != query.id AND distance <= 0.25  -- distance <= 0.25 = similarity >= 0.75

Currently ~13,000 SIMILAR_TO edges exist. Up to 5 similar initiatives are pre-computed for each one.

"Didn't we run a similar SNS campaign last summer? How did that one perform?" — traverse similar initiatives on the graph instantly, then compare KPI changes during weeks those initiatives were active.

Real Usage Examples

Here's how exploration works via MCP tools.

All tool execution examples below run through MCP from an AI coding agent. The response format matches the real system, but numbers are dummy values and content is simplified.

"Find marketing initiatives that drove acquisition"

search_initiatives({
  "query": "SNS advertising for new acquisition",
  "domain": "acquisition",
  "dateFrom": "2025-10-01",
  "dateTo": "2026-03-31",
  "limit": 5
})

Response (excerpt):

5 initiatives found (by vector similarity):

1. SNS Ad Spring Collection Campaign (2026-03-09)
   Category: Marketing (Advertising)
   Similarity: 892/1000

2. Instagram Reels Ad Test (2026-02-23)
   Category: Marketing (Advertising)
   Similarity: 845/1000
   ...

"Show me the impact of that initiative"

get_initiative_context({
  "initiative_id": "Initiative:2026-03-09:SNS Ad Spring Collection Campaign",
  "metric_window_days": 30
})

Response (excerpt):

## Initiative Context

Title: SNS Ad Spring Collection Campaign
Execution Period: 2026-03-01 to 2026-03-31
Category: Marketing (Advertising)
Target Domain: acquisition

## Similar Initiatives (SIMILAR_TO)
- Instagram Reels Ad Test (similarity: 0.82)
- 1-Month Free Trial Campaign (similarity: 0.78)

## KPI Changes During Initiative (30-day window)
| Metric | Pre-avg | Post-avg | Change |
|--------|---------|----------|--------|
| new_regular | 50 | 60 | +20.0% |
| new_lite | 30 | 35 | +16.7% |
| monthly | 1,000 | 1,050 | +5.0% |

## Service Quality Metrics
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| avg_rating | 3.50 | 3.60 | +2.9% |

## UX Metrics
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| total_sessions | 10,000 | 12,000 | +20.0% |
| registration_completed | 100 | 130 | +30.0% |

This is the power of the Week node design. Identify the weeks an initiative was active, then automatically pull all metrics (KPIs, quality, UX, marketing, logistics) from those same weeks.

"Visualize new acquisition YoY with initiative overlay"

compare_metrics({
  "metrics": ["new_regular", "new_lite", "new_monthly"],
  "dateFrom": "2025-10-01",
  "dateTo": "2026-03-31",
  "granularity": "weekly",
  "overlay_initiatives": true,
  "domain": "acquisition"
})

Time series data with acquisition-domain initiatives overlaid on the same timeframe. KPI spikes become instantly attributable to "that initiative's timing."

The Build Pipeline: 9 Phases

The graph is constructed in 9 phases:

Phase	Content	Output
1	Initiative nodes + Category/Business/Team	Initiative, Category, Business, Team
2	Daily KPIs (50 metrics)	Metric → MetricGroup (10 groups)
3	Business KPIs + Departments	Department → Metric (DEPT_TRACKS)
4	Week nodes (shared anchors)	HAS_DATA_AT + ACTIVE_DURING_WEEK + NEXT_WEEK
5	Service quality metrics (~50)	QualityMetric → Week
6	UX metrics (~40)	UxMetric → Week
7	Marketing channels (~100)	MarketingChannel → Week
8	MetricDomain (semantic bridge)	6 domains + IN_DOMAIN + TARGETS_DOMAIN
9	Logistics KPIs (~10 categories)	LogiMetric → Week

Phases 4 and 8 are the key design points. Other phases simply "turn data into nodes" — these two "structuralize relationships that don't exist."

Phase 4: Week Node Generation

// Convert initiative execution period to ISO weeks, generate ACTIVE_DURING_WEEK edges
for (const initiative of initiatives) {
  const weeks = getISOWeeksBetween(
    initiative.executionStartDate,
    initiative.executionEndDate
  );
  // Cap at 52 weeks (guard against long-running initiatives)
  for (const week of weeks.slice(0, 52)) {
    edges.push({
      edge_type: 'ACTIVE_DURING_WEEK',
      source_id: initiative.id,
      target_id: `Week:${week}`,
    });
  }
}

// Generate HAS_DATA_AT edges for weeks that have metric data
for (const metricWeek of metricWeeks) {
  edges.push({
    edge_type: 'HAS_DATA_AT',
    source_id: `Metric:${metricWeek.metric}`,
    target_id: `Week:${metricWeek.week}`,
  });
}

// NEXT_WEEK edges for time series traversal
const sortedWeeks = [...allWeeks].sort();
for (let i = 0; i < sortedWeeks.length - 1; i++) {
  edges.push({
    edge_type: 'NEXT_WEEK',
    source_id: `Week:${sortedWeeks[i]}`,
    target_id: `Week:${sortedWeeks[i + 1]}`,
  });
}

Phase 8: MetricDomain Generation

// Category → Domain (semantic mapping defined by humans)
const CATEGORY_TO_DOMAINS: Record<string, string[]> = {
  'Marketing (Advertising)': ['acquisition'],
  'CRM / Retention': ['retention'],
  'Quality / Service Improvement': ['service_quality'],
  'Operations Improvement': ['operations'],
  'New Feature': ['ux', 'revenue'],
  // ...
};

// Initiative → TARGETS_DOMAIN (main business only — limited to where KPI data exists)
for (const initiative of initiatives) {
  if (initiative.business !== MAIN_BUSINESS) continue;
  const domains = CATEGORY_TO_DOMAINS[initiative.category] ?? [];
  for (const domain of domains) {
    edges.push({
      edge_type: 'TARGETS_DOMAIN',
      source_id: initiative.id,
      target_id: `MetricDomain:${domain}`,
    });
  }
}

Why Not a Dedicated Graph DB or OSS Libraries?

We implemented the graph using BigQuery alone, without Neo4j, Amazon Neptune, or OSS like Microsoft's GraphRAG.

Why not a dedicated graph DB?

Aspect	Dedicated Graph DB	BigQuery
Graph traversal	Fast (native)	Fast enough (~10,000 node scale)
Vector search	Requires separate service	VECTOR_SEARCH built-in
Time series analysis	Weak	Native (window functions)
Operating cost	Always-on instances	Serverless (pay per query)
Joining other data	ETL required	Same project, instant JOIN

For Biz Graph, "graph structure + time series analysis + vector search combined" matters more than "deep graph traversal." BigQuery handles all three in one engine.

Additionally, BigQuery has announced Graph capabilities — once GA, native graph queries on node/edge tables will be available. Currently we traverse with SQL JOINs, but we expect to migrate to faster, more intuitive queries in the future.

Why not OSS libraries / SaaS?

OSS like Microsoft GraphRAG and various Graph RAG SaaS products focus on automatically extracting entities and relationships from text documents. Great for research papers or news articles, but not for our use case.

The reason is simple: we need to design the graph structure itself.

The concept of Week nodes as "temporal anchors" doesn't exist in generic tools
MetricDomain "semantic bridging" reflects our specific business structure
The Initiative → Week → Metric indirect connection pattern won't emerge from LLM entity extraction

Generic tools "auto-generate graphs from text." What we needed was "design the graph schema ourselves and integrate heterogeneous data sources." Fundamentally different problems.

Internal query example (get_initiative_context):

-- Get weeks the initiative was active
WITH active_weeks AS (
  SELECT target_id AS week_id
  FROM cortex.biz_graph_edges
  WHERE source_id = @initiative_id
    AND edge_type = 'ACTIVE_DURING_WEEK'
),
-- Get metrics that have data in those same weeks
co_occurring_metrics AS (
  SELECT e.source_id AS metric_id, e.edge_type, w.week_id
  FROM cortex.biz_graph_edges e
  JOIN active_weeks w ON e.target_id = w.week_id
  WHERE e.edge_type IN (
    'HAS_DATA_AT', 'HAS_QUALITY_DATA_AT',
    'HAS_UX_DATA_AT', 'HAS_MARKETING_DATA_AT'
  )
)
SELECT * FROM co_occurring_metrics

Graph traversal and time series data retrieval complete in a single SQL query. With a dedicated graph DB, you'd need to pass traversal results to another service for time series queries — an extra hop.

Initiative Data Ingestion: Auto-Extraction from Meeting Slides

Graph quality depends on source data quality. Initiative data comes from all-hands and group meeting slides.

Source	Format	Frequency
All-hands	pptx in Drive → Slides conversion → text extraction	Weekly
Group standups	Google Slides (cumulative, latest week appended)	Weekly

Text is extracted from meeting slides and structured by AI into the initiative table.

interface InitiativeRow {
  meetingDate: string;       // Meeting date
  source: string;            // Source (all-hands / group standup etc.)
  business: string;          // Business unit
  category: string;          // Marketing (Ads), New Feature, ...
  title: string;             // Initiative title
  description: string;       // Detailed description
  team: string;              // Executing team
  executionStartDate: string; // Execution start date
  executionEndDate: string;   // Execution end date
  metrics: string;           // JSON format numeric metrics
  status: string;            // planned / in_progress / retrospective
}

Critical: executionStartDate / executionEndDate. The meeting date (meetingDate) differs from when the initiative actually runs. "We started the SNS campaign last week," reported on 3/9, means executionStartDate is 3/1. This distinction is essential for accurate Week node connections.

Operating Cost

Resource	Cost
Vertex AI Embedding (weekly)	~$0.05/run
Claude Code (initiative extraction)	Within monthly plan
BQ storage	A few GB (negligible)
Cloud Run Jobs	Nearly free (1x weekly + 1x daily)
MCP Server	Nearly free (Cloud Run min-instances=0)

A few dollars per month to maintain a 10,000-node, 71,000-edge graph.

Comparison With Typical Knowledge Graphs

Let's take a step back and see how this design differs from conventional approaches.

Aspect	Typical Knowledge Graph	Biz Graph
Node design	Entities mapped directly to nodes	Deliberately designed temporal anchors ("Week")
Edge semantics	Relationships described as-is	Edge types encode data source classification
Intermediate nodes	Taxonomies for classification	MetricDomain as semantic bridge (structuralized tacit knowledge)
Graph construction	Relationships extracted from existing data	Deliberately designed graph from data with no inherent relationships
Use case	Primarily search and navigation	Goes further into causal candidate exploration for initiative impact
Similarity search	Text-based search	Pre-computed SIMILAR_TO edges via Embedding

In one sentence:

Our DB Graph "made existing relationships discoverable." Biz Graph "designed and created relationships that didn't exist."

The former is an analysis problem. The latter is a design problem — designing the graph structure from scratch and integrating heterogeneous data sources (meeting slides, spreadsheets, BQ tables) into a single explorable structure. That's the essence of Biz Graph.

Why Graph RAG Over Flat RAG

Let's revisit the "why Graph RAG?" question from the introduction.

For initiative effectiveness analysis, consider what happens with standard vector search (flat RAG). Ask "What was the SNS campaign's impact?" — flat RAG returns text chunks similar to the initiative description. You get info about the initiative itself.

But it won't return concurrent KPI changes. It won't return results from past similar initiatives. It won't return related domain metrics.

These are information connected "through the graph," not by "text similarity." You can only reach them by traversing Week nodes. This "need to follow relationships" use case is exactly where Graph RAG has a clear advantage over flat RAG.

Design Honesty: Not Asserting Causation

One thing I was conscious of in this design: not asserting causation.

Many BI tools and AI analyses want to declare "this initiative impacted this KPI." But in reality, there's no such certainty. Multiple initiatives may have been running simultaneously, it could be seasonal, it could be external market changes.

Week node indirect connections simply "lay out what happened in the same period." Causal judgment is left to human or AI reasoning. I believe this is a statistically honest approach.

"A structure for discovering causal candidates" — not "a structure for asserting causation." This distinction matters.

Limitations: The Designer's Tacit Knowledge Is the Bottleneck

Let me be honest about the weaknesses of this approach.

MetricDomain mappings ("Marketing Advertising → acquisition domain") are hardcoded by humans. If this design is wrong, the entire graph's exploration results are skewed.

This is simultaneously the answer to "why build it yourself." Off-the-shelf graph tools can't reflect your business structure — which initiative categories relate to which metric groups. Structuralizing this tacit knowledge requires someone who knows the business.

Going forward, we're considering having AI propose these mappings with humans reviewing them. Full automation is hard, but an "AI suggests, humans approve" workflow could reduce the maintenance cost of domain knowledge.

Summary

Turning business data into a graph is more of a design challenge than a technical one.

There's no FK between "initiatives" and "KPIs." No join key. But by deliberately designing two structures — temporal axis (Week nodes) and semantic domains (MetricDomain) — it becomes an explorable graph.

Week nodes: Indirect connections via "same week" instead of direct initiative-metric edges. A structure for discovering causal candidates
MetricDomain: Semantic bridge between initiative categories and metric groups. Structuralized tacit knowledge
SIMILAR_TO: Pre-computed similar initiatives via AI Embedding. Instant answers to "have we done this before?"

As a result, questions like "Did that initiative work?", "Find initiatives that drove acquisition", "Show metrics YoY with initiative overlay" — AI can now autonomously explore the graph to answer these.

Graphs aren't something you "find" — they're something you design. Especially for business data.

How We Built an Automated Meeting Intelligence System with Google Meet, Slack, and RAG

Ryosuke Tsuji — Sat, 11 Apr 2026 09:11:59 +0000

Hi, I'm Ryan, CTO at airCloset — a fashion subscription service based in Japan.

In previous posts, I wrote about building a DB Graph MCP server that lets you query 991 database tables across 15 schemas with natural language, and a suite of 17 MCP servers that opened our internal operations to AI.

This time, it's not about MCP. It's about something more fundamental — turning meetings into a searchable knowledge base. This is the system I've wanted to build first when thinking about digitizing our company's information assets.

We built a system that automatically shares Google Meet recordings and transcripts to Slack channels, and makes past meeting content searchable with natural language.

The Problem: Context Disappears the Moment a Meeting Ends

Face-to-face communication is fast and dense. A decision that takes 30 minutes over text can happen in 5 minutes in a meeting. That's the biggest advantage of meetings.

But the problem is that context starts disappearing the moment the meeting ends.

"What did we decide in that meeting again?"
"There's a recording but I don't have the energy to rewatch an hour-long video"
"Where did I write those meeting notes?"
"We keep having the same discussion over and over"

Building a habit of writing meeting notes is one solution, but honestly, getting everyone to consistently write good notes is hard. Even when they do, the nuance of the conversation is lost.

Meetings are a treasure trove of information, yet they're not being utilized. That's a huge waste.

What We Built

We built a system that automates four things:

One-click Meet creation from Google Calendar — A Chrome extension creates a Meet with recording, transcription, and notes all enabled by default
Automatic Slack notification when a meeting ends — Instant notification, followed by recording and transcript links minutes later
Automatic permission granting — Access is automatically given to Slack channel members, meeting participants, and Calendar invitees
RAG search over transcripts and screen shares — Ask a Slack Bot "What was the release date we discussed last week?" and get an answer

User Flow

Step 1: Create a Meeting (~10 seconds)

In Google Calendar's event editor, click the "AI Fassy Meet" button added by our Chrome extension.

The "AI Fassy Meet" button appears next to Google Meet's native video conferencing option

Select the Slack channel where notifications should be sent. Previously selected channels appear at the top, followed by your most active channels.

Channel search and selection dialog, sorted by selection history and activity

Click "Create Meet" and the Meet URL is automatically set on the Calendar event.

The Meet URL is set on the event with recording, transcription, and notes all enabled by default. The "Use Gemini to create meeting notes" shown on screen is Google Meet's native feature — our system additionally integrates Gemini 3 Flash for independent transcription and screen share analysis

Recording, transcription, and meeting notes are all ON by default. Users don't need to think about settings at all.

The channel dropdown shows previously selected channels first, then channels you're a member of, sorted by message activity. For recurring meetings, last week's channel is always one click away.

Step 2: Hold the Meeting

Just have your meeting normally. Recording and transcription run automatically in the background.

Step 3: Automatic Notification When the Meeting Ends

When the meeting ends, an instant notification appears in the designated Slack channel.

A few minutes later, a follow-up notification arrives in the thread with links to the recording and transcript. Channel members can view them immediately.

Step 4: Search Past Meetings with Natural Language

In the same thread, mention the Bot to ask about the meeting content.

Full thread flow: ①Meeting ended notification → ②Recording and transcript links → ③User asks "Give me a summary of this meeting" → ④Bot responds with a structured summary

The Bot searches past meeting transcripts, summarizes the relevant parts, and responds with source links. Screen-shared slides and code are also searchable.

Now let's dive into the technical implementation.

Architecture Overview

The system consists of four components:

Component	Role	Deployment
Chrome Extension + meet-calendar API	Meet creation UI + backend API	Chrome / Cloud Run
workspace-pipeline	Workspace Events API subscription management	Shared package
meet-pipeline	Core event processing: artifact storage, permissions, embedding generation	Cloud Run
Slack Bot	Meet creation + RAG search	Cloud Run

Shared domain logic (Space creation, Firestore operations, Drive access, caching) is extracted into a common package, reused by both the Chrome Extension API and the Slack Bot.

Tech Stack

Layer	Technology
Frontend	Chrome Extension (Manifest V3)
API	Cloud Run (Hono)
Event Processing	Cloud Pub/Sub → Cloud Run
Workspace Integration	Meet REST API, Drive API, Workspace Events API, Calendar API
AI/ML	Vertex AI Embeddings (gemini-embedding-001), Gemini 3 Flash
Data Stores	Firestore, BigQuery, Cloud Storage, Upstash Redis
Notifications	Slack Block Kit API
Infrastructure	Pulumi (TypeScript)

Deep Dive 1: Pre-Pooling Meet Spaces — LIFO Cache

Problem: Meet Creation Is Slow

Creating a new Google Meet Space via API takes 1–2 seconds for a response. Making users wait several seconds after clicking a button is an unacceptable UX.

Solution: Pre-Create and Pool

The idea is simple: pre-create Meet Spaces via API and return them instantly on request. Replenish in the background when consumed.

class MeetSpaceCache {
  private cachePool: CachedMeetSpace[] = [];
  private readonly targetSize = 3;
  private readonly maxSize = 5;
  private readonly ttlMs = 24 * 60 * 60 * 1000; // 24 hours

  getMeetSpaceFromCache(): CachedMeetSpace | undefined {
    // Filter expired entries, then pop the newest
    this.cachePool = this.cachePool.filter(s => !this.isExpired(s));
    const space = this.cachePool.pop(); // LIFO
    if (space) {
      this.emitter.emit('spaceConsumed'); // Trigger background replenishment
    }
    return space;
  }
}

Why LIFO? By always returning the newest Space, we minimize the risk of serving an expired one. Older Spaces naturally expire and get filtered out on the next pop().

Replenishment is event-driven via EventEmitter. When a Space is consumed, replenish() runs in the background after a 100ms delay. A mutex (isReplenishing flag) prevents concurrent API requests.

initializeMeetCache(createSpace) {
  this.emitter.on('spaceConsumed', () => {
    setTimeout(() => this.replenish(createSpace), 100);
  });
  // Build initial pool on startup
  this.replenish(createSpace);
}

This brings most requests down to under 100ms latency for returning a Meet URL. The cache lives in a shared domain package, reused by both the Chrome Extension API and the Slack Bot.

Deep Dive 2: Designing for Adoption — Chrome Extension

We Started with a Slack Command

The first thing we built was a /meet command in Slack. Mention the bot and it returns a Meet link. Technically, it worked perfectly.

But nobody used it.

Why? The meeting creation flow is "create a Calendar event → invite participants → set the Meet URL." The Slack command is outside this flow. Switching to Slack, typing a command, copying the URL, pasting it into Calendar — that's too much friction.

Meet Users Where They Already Are

The insight was that features must be placed on the user's existing path to get adopted.

Google Calendar's event editor is a place everyone passes through when scheduling a meeting. Put a button there and it's one click. That's why we built a Chrome Extension.

The Slack command still exists and some people use it. But adoption skyrocketed after shipping the Chrome Extension.

Optimizing Channel Selection

We also put effort into the channel selection UX. The dropdown order is determined by the following logic:

Tier 1: Personal Selection History (Redis ZSET)

// Store in Redis ZSET with score=timestamp
async saveChannelSelection(userId, channel) {
  // Remove duplicate of same channel
  await redis.zrem(key, existingMember);
  // Add with latest timestamp
  await redis.zadd(key, { score: Date.now(), member: JSON.stringify(channel) });
  // Cap at 50 entries
  await redis.zremrangebyrank(key, 0, -(MAX_RECENT + 1));
}

Previously selected channels appear at the top. For recurring meetings, last week's channel is always first. Using Redis ZSET with timestamps as scores gives O(log N) insertion and natural chronological ordering.

Tier 2: Channel Activity (Firestore sortPriority)

Channels without selection history are sorted by a pre-computed sortPriority (based on message volume) in Firestore. Frequently used channels rank higher.

Both sources are fetched in parallel, with Redis results taking priority in the merge, ensuring a useful list even on first load.

Deep Dive 3: Domain-Wide Delegation — Why a "Proxy Account" Is Needed

The File Ownership Problem

When you enable recording in Google Meet, the recording and transcript files are created in the organizer's personal Drive. This is a Google Workspace behavior that cannot be changed.

This is a major problem.

When files are scattered across different organizers' Drives, the system cannot uniformly access them. Copying recordings to GCS, loading transcripts into BQ, granting permissions to channel members — all these automated operations require reliable file access. If the organizer differs each time, you'd have to track which Drive the file is in and manage each person's OAuth tokens. This is operationally untenable.

Solution: Impersonation via a Shared Service Account

We use Domain-Wide Delegation (DWD) to have a service account act as a Workspace admin.

const auth = new google.auth.JWT({
  email: serviceAccountEmail,  // Service account
  key: privateKey,
  scopes: [
    'https://www.googleapis.com/auth/meetings.space.created',
    'https://www.googleapis.com/auth/drive',
  ],
  subject: workspaceAdminEmail,  // Act as this admin
});

Since APIs execute as the Workspace admin specified in subject, both Meet Space creation and Drive file ownership are consolidated under this shared account.

When creating a Space, we set recording and transcription to ON by default via artifactConfig:

body: JSON.stringify({
  config: {
    accessType: 'TRUSTED',
    entryPointAccess: 'ALL',
    artifactConfig: {
      recordingConfig: {
        autoRecordingGeneration: 'ON',  // Recording: ON by default
      },
      transcriptionConfig: {
        autoTranscriptionGeneration: 'ON',  // Transcription: ON by default
      },
    },
  },
}),

Users never "forget to turn on recording." Every Meet created through this system is guaranteed to be recorded and transcribed.

Benefits:

Files are always consolidated in the same account's Drive → uniform system access
No individual OAuth token management needed
Same credentials work regardless of who organizes the meeting
One-time setup in Workspace Admin Console, then it just works with the service account key

Workspace Admin privileges are required for the initial setup, but it's a one-time task.

Calendar Search via DWD

When notifying Slack on meeting end, we need the meeting title. But the Meet API doesn't provide it — the title only exists on the Calendar side.

DWD helps here too. We first search the organizer's Calendar, then iterate through participants' Calendars.

async function searchCalendarEventTitle(meetCode, creatorEmail, participants) {
  // 1. Search the organizer's calendar first
  const creatorEvent = await searchCalendar(creatorEmail, meetCode);
  if (creatorEvent) return creatorEvent.summary;

  // 2. Fall back to participants
  for (const participant of participants) {
    const event = await searchCalendar(participant.email, meetCode);
    if (event) return event.summary;
  }

  // 3. Fall back to Firestore cache
  return meetInfo.calendarTitle ?? null;
}

With DWD, you can search any user's Calendar by simply swapping the subject. No Calendar sharing settings needed.

Deep Dive 4: Workspace Events API — Real-Time Event-Driven Architecture

No Polling

"How do we detect when a Meet ends?" — this was the first challenge.

Polling the API for status checks lacks real-time responsiveness and increases API call volume.

Google Workspace Events API lets you receive Meet lifecycle events in real-time via Pub/Sub.

const subscription = await workspaceEvents.subscriptions.create({
  requestBody: {
    targetResource: `//meet.googleapis.com/${spaceName}`,
    eventTypes: [
      'google.workspace.meet.conference.v2.ended',        // Meeting ended
      'google.workspace.meet.recording.v2.fileGenerated',  // Recording ready
      'google.workspace.meet.transcript.v2.fileGenerated', // Transcript ready
    ],
    notificationEndpoint: {
      pubsubTopic: `projects/${projectId}/topics/meet-events`,
    },
    payloadOptions: { includeResource: true },
  },
});

We create a Subscription when the Meet Space is created, delivering three event types to the meet-events Pub/Sub topic.

Fighting the 7-Day Expiration

However, these Subscriptions have a 7-day maximum TTL (604,800 seconds). This is a Google API constraint that cannot be changed. Left unattended, subscriptions expire and events stop arriving.

This becomes a problem in cases like:

Recurring meetings — A weekly Monday standup reuses the same Meet Space. The subscription expires before next Monday
Future meetings — Creating a Meet in advance for next week's 1:1. If more than 7 days pass from creation, events won't arrive on the meeting day

In other words, without automatic subscription renewal, recurring and future meetings won't work.

Daily Batch Auto-Renewal

We run a daily batch via Cloud Scheduler at 5:00 AM JST, processing in two phases:

async function renewSubscriptions(): Promise<RenewalResult> {
  // Phase 1: Invalidate old Spaces (run before renewal)
  // → Processing invalidations first excludes them from Phase 2
  const spacesToInvalidate = await getMeetSpacesNeedingInvalidation(thirtyDaysAgo);
  for (const space of spacesToInvalidate) {
    await invalidateMeetSpace(space.spaceName);  // isValid = false
  }

  // Phase 2: Renew Subscriptions
  const spacesToRenew = await getMeetSpacesNeedingRenewal(sixDaysAgo);
  for (const space of spacesToRenew) {
    // Create new Subscription (old one auto-expires)
    const newSubscriptionName = await createMeetSubscription(
      space.spaceName, subscriptionConfig,
    );
    await updateMeetSpaceSubscription(space.spaceName, newSubscriptionName);
  }
}

Phase 1: Invalidation — Spaces where meetingEndAt is over 30 days ago are set to isValid: false. After 30 days since a meeting ended, no recording or transcript events will arrive. Invalidation excludes them from Phase 2, reducing unnecessary API calls.

Phase 2: Renewal — Spaces where subscribedAt is 6+ days ago (one day before expiration) get a new Subscription. Old subscriptions auto-expire, so explicit deletion is unnecessary.

Subscription Lifecycle

Day 0: Meet created → Subscription created (TTL: 7 days)
Day 6: Daily batch → Subscription renewed (new TTL: 7 days)
Day 12: Daily batch → Subscription renewed (new TTL: 7 days)
  ...repeats...
Day 30+: Daily batch → isValid=false → renewal stops

With this mechanism, even if you create a Meet today for a meeting next month, the subscription is auto-renewed daily so events are guaranteed to arrive on the meeting day. Recurring meetings similarly work across multiple weeks with the same Meet Space.

Deep Dive 5: Event Processing Pipeline

From meeting end to Slack notification to vector data generation for RAG search — everything starts from receiving a Pub/Sub message.

Event Router: Dispatching to Three Handlers

async function handleMeetEvent(pubsubMessage) {
  const eventType = pubsubMessage.attributes?.['ce-type'];
  const spaceName = normalizeSpaceName(pubsubMessage.attributes?.['ce-subject']);

  // Fetch space info from Firestore
  const meetInfo = await getMeetSpaceInfo(spaceName);

  switch (eventType) {
    case 'google.workspace.meet.conference.v2.ended':
      return handleMeetEnded(meetInfo, pubsubMessage);
    case 'google.workspace.meet.recording.v2.fileGenerated':
      return handleRecordingGenerated(meetInfo, pubsubMessage);
    case 'google.workspace.meet.transcript.v2.fileGenerated':
      return handleTranscriptGenerated(meetInfo, pubsubMessage);
  }
}

One caveat: the Pub/Sub event's targetResource may contain a conferenceRecordId instead of a spaceName. Google Meet creates a new conference record for each session in the same Space. In that case, we resolve conferenceRecordId → spaceName via the Meet API.

① handleMeetEnded — On Meeting End

Update Firestore status to ended
Fetch participant list from Meet API
Search Calendar API for the meeting title (DWD to search participants' calendars)
Save participant info to BQ (making "who attended" searchable via RAG)
Send "meeting ended" notification to Slack
Save notification ts (timestamp) to Firestore → subsequent notifications thread under it

② handleRecordingGenerated — On Recording Completion

The recording handler is the most complex:

Drive → GCS copy → Grant permissions → Update Firestore
                 → Gemini transcription (async)
                 → Screen share analysis (async)

Idempotency is critical. Pub/Sub guarantees at-least-once delivery, so duplicate messages are possible. We strictly maintain this order:

async function handleRecordingGenerated(meetInfo, message) {
  // Idempotency check: skip if already processed
  if (meetInfo.recordingReady && meetInfo.artifacts?.recording?.gcsUri) {
    return;
  }

  // 1. Get file info from Drive
  const fileInfo = await getFileInfo(driveFileId);

  // 2. Stream copy to GCS (with existence check)
  if (!(await gcsFileExists(gcsPath))) {
    await copyDriveFileToGCS(fileInfo.id, gcsPath);
  }

  // 3. Grant permissions to channel members ← BEFORE setting the flag
  await shareFileWithChannelMembers(fileInfo.id, meetInfo.channelId);

  // 4. Save artifact info to Firestore
  await updateMeetSpaceArtifact(spaceName, 'recording', { driveFileId, gcsUri });

  // 5. AI processing is async fire-and-forget
  processGeminiTranscription(gcsUri, meetInfo).catch(logError);
  processScreenShareAnalysis(gcsUri, meetInfo).catch(logError);

  // 6. Check if both are ready → send Slack notification if so
  await checkAndNotifyArtifacts(spaceName);
}

Why grant permissions before setting the flag? If the flag is set first, a retry would skip via the idempotency check, and permissions would never be granted. Drive permission granting is idempotent (HTTP 400 means permission already exists), so it's safe to execute multiple times.

③ handleTranscriptGenerated — On Transcript Completion

Structurally mirrors the recording handler. Extracts the Google Docs transcript as text, saves to GCS, then feeds into the embedding pipeline.

When Both Are Ready: Final Notification + Calendar Attachment

checkAndNotifyArtifacts() executes when both recording and transcript are Ready:

Send artifact notification to Slack
Attach recording and transcript files to the Calendar event
Grant permissions to Calendar invitees

Point 2 is key. Normally, Google Meet automatically attaches files to the Calendar event when recording and transcription complete. In our system, DWD creates the Meet under a different account, so that auto-attachment doesn't work. We explicitly attach files via the Calendar API to preserve the same experience as default Meet.

async function attachFilesToCalendarEvent(event, artifacts) {
  const attachments = [];
  if (artifacts.recording) {
    attachments.push({ fileUrl: artifacts.recording.webViewLink, title: 'Recording' });
  }
  if (artifacts.transcript) {
    attachments.push({ fileUrl: artifacts.transcript.webViewLink, title: 'Transcript' });
  }

  // Deduplicate by fileUrl to be idempotent
  const existing = event.attachments ?? [];
  const newAttachments = attachments.filter(
    a => !existing.some(e => e.fileUrl === a.fileUrl)
  );

  await calendar.events.patch({
    calendarId: organizerEmail,
    eventId: event.id,
    requestBody: { attachments: [...existing, ...newAttachments] },
    supportsAttachments: true,
  });
}

This lets users access recordings and transcripts directly from the Calendar event detail view — whether they come via Slack or Calendar.

Deep Dive 6: Three-Layer Permission Model

"Who gets access?" is the most delicate design point. Too narrow and it's useless; too broad and it's a security risk.

Layer 1: Slack Channel Members

When each artifact is generated, all members of the linked Slack channel get Drive viewer access.

async function shareFileWithChannelMembers(fileId, channelId) {
  // Enumerate channel members via Slack API
  const members = await getChannelMembers(channelId);

  for (const member of members) {
    // Slack ID → Firestore → email
    const userInfo = await getUserInfo(member);
    if (!userInfo.email?.endsWith('@air-closet.com')) continue; // Domain filter

    const role = (member === organizerSlackId) ? 'writer' : 'reader';
    await shareFileWithUser(fileId, userInfo.email, role);
  }
}

Importantly, members who join the channel later also get access. Since permissions are granted using the latest member list on each Pub/Sub retry, people who joined after the meeting naturally receive access.

The organizer gets writer permissions, allowing them to manage the recording file (rename, change sharing settings, etc.).

Layer 2: Meeting Participants

On meeting end, participant info from the Meet API is saved to BQ. Participants may be guests not in the Slack channel, requiring a separate permission axis from Layer 1.

Layer 3: Calendar Invitees

When both artifacts are ready, permissions are also granted to Calendar event invitees.

async function attachToCalendarAndShareWithAttendees(meetInfo, artifacts) {
  const event = await getCalendarEventByMeetCode(meetInfo.meetingCode);
  if (!event) return;

  // Attach files to the Calendar event
  await attachFilesToCalendarEvent(event, artifacts);

  // Grant permissions to all invitees (organizer = writer, others = reader)
  const emails = event.attendees.map(a => a.email);
  await shareFilesWithEmails(artifacts, emails, event.organizer.email);
}

People not in the Slack channel but on the Calendar invite (e.g., a manager who only wants to review meeting notes) also get access.

Security Guarantees

Common security rules apply across all three layers:

Domain filter: Only @air-closet.com email addresses are eligible. Prevents sharing with external users
Idempotent permission grants: HTTP 400 (permission already exists) is not treated as an error
Notification suppression: sendNotificationEmail: false prevents a flood of "X shared a file with you" emails

Deep Dive 7: Embedding Generation & RAG Search Pipeline

This was the most exciting part to build.

Three Content Sources

Up to three types of text are extracted from each meeting and vectorized separately:

Content Type	Source	Purpose
`transcript`	Google Meet's native transcript (Google Docs)	Spoken word text
`gemini_transcript`	Gemini-generated transcript from the recording	Higher quality than native
`screen_share`	Gemini Vision-extracted screen share content	Slides, code, documents

Text Chunking: Bilingual Sentence Boundary Detection

function chunkText(text: string, chunkSize = 1000, overlap = 100): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    let end = Math.min(start + chunkSize, text.length);

    if (end < text.length) {
      // Find a sentence boundary to avoid cutting mid-sentence
      end = findSentenceBreak(text, end, start + 100);
    }

    chunks.push(text.slice(start, end));
    start = end - overlap; // Overlap preserves context across chunks
  }
  return chunks;
}

findSentenceBreak() searches backward from the chunk boundary for sentence-ending punctuation. It supports both Japanese (。, ！, ？) and English (., !, ?), with fallback to spaces and fullwidth spaces. A minimum of 100 characters per chunk is enforced.

Meeting transcripts frequently mix Japanese and English, making bilingual boundary detection essential.

Screen Share Content Extraction with Gemini

Transcripts alone miss content shown via screen sharing — slides, code, documents. When you need to find "that thing on the slide," it's not searchable.

We use Gemini 3 Flash (gemini-3-flash-preview) multimodal input to extract screen share content directly from the recording video.

async function analyzeScreenShareFromVideo(gcsUri: string): Promise<string> {
  const result = await gemini.generateContent({
    model: GEMINI_MODEL,  // gemini-3-flash-preview
    contents: [{
      parts: [{
        fileData: { mimeType: 'video/mp4', fileUri: gcsUri },
        // Unlike transcription, video frames matter here — higher fps
        videoMetadata: { fps: 0.2 },
      }, {
        text: `Extract the content shown via screen sharing in this video.
               Transcribe any slide text, document content,
               or code that appears.`,
      }],
    }],
    generationConfig: { temperature: 0.2 },
  });
  return result.response.text();
}

The fps differentiation is key. For transcription, only audio matters, so fps: 0.1 (1 frame per 10 seconds) minimizes video tokens. For screen share analysis, visual content matters, so fps: 0.2 (1 frame per 5 seconds).

For long meetings that hit the input token limit, an automatic fallback splits the video into 30-minute chunks:

async function transcribeFromVideo(gcsUri: string): Promise<string> {
  try {
    // Try processing the full video first
    return await callGemini(gcsUri);
  } catch (error) {
    if (isTokenLimitError(error)) {
      // Token limit hit → split into 30-minute chunks
      return await transcribeVideoInChunks(gcsUri, 30 * 60);
    }
    throw error;
  }
}

BigQuery Vector Search

Vector data is stored in per-channel BQ tables (meet_{channelId}). Splitting tables by channel enables filter-free Vector Search for within-channel queries. A separate aggregated table with channel_id clustering handles cross-channel search.

async function insertMeetChunks(chunks, meetInfo) {
  const channelTableId = `meet_${meetInfo.channelId}`;

  // Auto-create table if it doesn't exist (day-partitioned)
  await ensureMeetChannelTable(channelTableId);

  for (const chunk of chunks) {
    await insertRow(channelTableId, chunk);
  }
}

Access Control at Search Time

SELECT
  chunkText, meetingId, channelId,
  ML.DISTANCE(text_embedding, @query_embedding, 'COSINE') AS distance
FROM `meet_chunks`
WHERE channelId IN UNNEST(@accessible_channels)  -- Access control
ORDER BY distance
LIMIT 10

@accessible_channels is the list of Slack channel IDs the user is a member of. Meeting content from channels you're not in will never appear in results, even if it exists in BQ.

COSINE distance is converted to a 0–1 relevance score via 1 - distance / 2. Only chunks above the threshold are fed into Gemini's context to generate the answer.

Deep Dive 8: GCS Operations

Streaming Copy from Drive to GCS

Recording files can be hundreds of MBs. Loading everything into memory would exhaust Cloud Run's memory, so we stream downloads directly into uploads.

async function copyDriveFileToGCS(driveFileId: string, gcsPath: string) {
  // Stream download from Drive API
  const response = await fetch(
    `https://www.googleapis.com/drive/v3/files/${driveFileId}?alt=media`,
    { headers: { Authorization: `Bearer ${token}` } }
  );

  // Stream upload to GCS JSON API
  await fetch(
    `https://storage.googleapis.com/upload/storage/v1/b/${bucket}/o?name=${gcsPath}&uploadType=media`,
    {
      method: 'POST',
      headers: { Authorization: `Bearer ${token}`, 'Content-Type': mimeType },
      body: response.body,  // Pass ReadableStream directly
    }
  );
}

Note: We use the GCS JSON API directly instead of @google-cloud/storage's file.save() because the latter has a bug where multipart boundary strings get mixed into binary data during upload, corrupting recording files.

GCS File Structure

gs://bucket/
└── meet/
    └── {channelId}/
        └── {spaceId}/
            ├── recording.mp4              # Recording file
            ├── transcript_original.txt    # Google Docs transcript
            ├── gemini_transcript.txt      # Gemini transcript
            └── screen_share.txt           # Screen share analysis

The channelId → spaceId hierarchy makes per-channel data management and lifecycle policy application straightforward. GCS lifecycle auto-deletes after 90 days (originals remain on Drive).

Deep Dive 9: Slack Notification Design

Two-Phase Notification

To avoid making users wait, we split notifications into two phases:

Phase 1 (immediately after meeting end):

🎬 Meeting ended

"Weekly Standup" has ended.
We'll notify you when the recording and transcript are ready.

Created by: @tanaka

At this point, the recording and transcript are still processing. But users can confirm that the meeting was successfully recorded.

Phase 2 (after artifacts are ready — thread reply):

📹 Recording and transcript are ready!

🎥 Recording
   https://drive.google.com/file/d/xxx

📝 Transcript
   https://docs.google.com/document/d/xxx

ℹ️ Channel members have viewing access

Phase 2 is sent as a thread reply to Phase 1. The Phase 1 message's ts (timestamp) is saved to Firestore and used as the thread parent for Phase 2.

Observability: OpenTelemetry + Grafana + Prometheus

All processing in this system is instrumented with OpenTelemetry and aggregated in Grafana. Meet Space creation, Pub/Sub event processing, Drive→GCS copy, embedding generation, Slack notifications — latency and error rates for each step are visible on a single dashboard.

Through the Grafana MCP introduced in the previous article, these logs and metrics are also accessible via MCP. Investigations like "Show me error logs from yesterday's Meet pipeline" can be done directly from Claude Code.

For Gemini API costs, we track actual usage and costs via Prometheus. Token consumption for transcription and screen share analysis is visualized in real-time, so cost anomalies are caught immediately.

Beyond: Meeting Data as a Project Knowledge Base

The system described so far is about "sharing and searching meeting recordings and transcripts." But this data is already being leveraged in a broader context.

Project-Level Meeting Data Integration

At airCloset, Slack channels are created per project. The mapping between channels and projects is managed in Firestore, and through our Project Management MCP (described in the previous article), meeting data linked to a project is searchable via MCP.

For example, "Tell me what was discussed about this spec in Project X's past meetings" searches all meeting transcripts from that project's Slack channel and returns relevant excerpts.

Unified Search with Slack Messages

Beyond meeting transcripts, Slack messages themselves are also stored and vectorized in BigQuery using the same approach. The same MCP can search across both meeting content and Slack discussions.

What was decided in a meeting and how it was implemented in Slack afterward. Conversely, what was debated in Slack and which meeting made the final call. Being able to search across meetings and chat as two unified communication channels is remarkably powerful in practice.

Exploring Code Review Integration

We're currently exploring whether business context from meeting and Slack data could be used for specification checks during code reviews.

If we could automatically surface meeting decisions and Slack spec discussions related to code changes in a PR, and verify "Is this change consistent with the spec decided in the meeting on date X?" during review, we might be able to prevent bugs caused by misunderstood requirements. It's still in the conceptual stage, but the potential for meeting data utilization continues to expand.

Summary: Maximizing Meeting Value

Here's what this system achieves:

Problem	Solution
Effort of writing meeting notes	Auto-transcribed and auto-shared
Effort of rewatching recordings	Ask in natural language, get a summary
Effort of managing permissions	Auto-granted to channel members, participants, and invitees
Effort of creating Meets	One click from the Chrome extension
"What was that thing we discussed?"	Instantly found via RAG search
Screen-shared content not preserved	Auto-extracted by Gemini Vision

Technical highlights:

LIFO cache bringing Meet Space creation to under 100ms
Chrome Extension placing features on users' existing workflow, dramatically boosting adoption
Domain-Wide Delegation solving the file ownership problem
Workspace Events API + daily batch covering the 7-day TTL constraint
Idempotent event processing handling Pub/Sub's at-least-once delivery
Three-layer permission model ensuring access for all stakeholders
Per-channel table strategy enabling both scoped and cross-channel search
Gemini Vision fps differentiation optimizing transcription and screen share analysis costs

Meetings are a treasure trove of information. Letting that information sleep is a waste.

Google Workspace × GCP × Slack — maximizing the value of every meeting. I hope this helps anyone facing similar challenges.

References

We Built 17 MCP Servers to Let AI Run Our Internal Operations

Ryosuke Tsuji — Tue, 07 Apr 2026 16:22:59 +0000

Introduction

In a previous article, I introduced "DB Graph MCP" — a system that enables safe, cross-schema search and query execution across our entire database estate of 17 DBs and 994 tables.

/posts/db-graph-mcp

Thanks to the positive response, this time I'd like to introduce the rest of our MCP server fleet beyond DB Graph.

These were all built in roughly 3 months starting January 2026. We now have 17 MCP servers in production, covering databases, infrastructure, documentation, project management, observability, CI/CD, and even code editing and deployment by non-engineers — making virtually every aspect of our operations accessible to AI.

Overview

Here's the full lineup:

Category	Server	Description
Data	DB Graph	Company-wide DB dictionary + query execution (previous article)
Infrastructure	GCloud	GCP resources, read-only
	AWS	AWS resources, read-only
Docs & Knowledge	GWS	Full Google Workspace access
	Git Server	All Git repos, read-only
Graph	Code Graph	Codebase analysis (function → API → DB → event dependency tracking)
	Product Graph	Unified knowledge graph: code + DB + docs
	Biz Graph	Business initiative × KPI relationship graph
Observability	Grafana	Logs, metrics, and alert inspection
CI/CD	CircleCI	Pipeline execution, build logs, test results
Project Management	Project Management	BQ/Firestore/Sheets-integrated PM support
Domain-Specific	Stylist Insights	Stylist performance & KPI data
	UX Insights	UX analytics from BQ
	freee	Accounting API integration
Dev Platform	Workspace	ACL-gated monorepo editing & deployment
	Sandbox	App deployment for non-engineers

All servers are implemented in TypeScript, deployed to GCP via Pulumi, and authenticated with Google OAuth.

Design Philosophy

Why So Many Servers?

We could have built one monolithic MCP server, but we deliberately split them. Here's why:

Auth scope isolation — GWS needs Workspace API scopes; the DB query server doesn't. Minimizing scopes prevents privilege escalation.
Deploy independence — A Grafana server change doesn't affect DB queries. Blast radius stays small.
Per-user selection — Engineers add everything; marketing adds only GWS. Just put what you need in .mcp.json.

Shared Foundation

Every server shares common patterns:

Auth: A shared package implements Google OAuth 2.0 + PKCE with RFC 8414 auto-discovery. Just add the URL to .mcp.json and Claude Code handles the auth flow automatically. For business users, we simply register them as custom connectors in the Claude organization settings.

{
  "mcpServers": {
    "server-name": {
      "type": "http",
      "url": "https://mcp-xxx.your-domain.example/mcp"
    }
  }
}

That's it. No auth block needed. Same format for every server.

Session management: Upstash Redis as a shared session store across all servers. SSO cookies mean one login grants access to everything.

Tool usage logging: Every tool invocation is recorded in BigQuery. Who used what, when — fully auditable. We monitor usage rates, error rates, and usage patterns to drive improvements.

Infrastructure: GCloud / AWS

Have you ever wanted to let AI investigate your cloud environment? And simultaneously thought: "Is it safe to let it do that?"

In my case, I have admin-level privileges, which makes it even scarier. So I built MCP servers that are physically incapable of writing anything.

Two key design decisions:

OIDC / STS / Impersonate for secure auth — Zero persistent credentials
Per-account audit logging — Individual email addresses recorded in GCP Audit Log / CloudTrail

GCloud MCP

Claude Code → MCP Server → gcloud CLI subprocess → GCP APIs

Runs gcloud CLI on Cloud Run. The key point: writes are made impossible at the OAuth scope level.

OAuth scope: cloud-platform.read-only
GCP APIs check both scope and IAM — even admin users cannot write
GCP Audit Log records the user's email address
Account revocation on departure: just disable the Google Workspace account

# What you can do
"Show me the Cloud Run services in prod"
"Check the env vars for this service"
"List the Secret Manager secrets"

AWS MCP

Same philosophy, but AWS can't accept Google OAuth directly, so we use STS as a bridge.

Claude Code → MCP Server → GCP metadata → ID Token
                         → AWS STS AssumeRoleWithWebIdentity → temp credentials
                         → aws CLI subprocess → AWS APIs

Two layers of safety:

IAM Role with ReadOnlyAccess policy only
Temporary credentials with 1-hour expiry

Supports multiple AWS accounts via profile parameter. CloudTrail records assumed-role/mcp-aws-readonly/user@example.com.

Docs & Knowledge: GWS / Git Server

GWS (Google Workspace) MCP

Operate all Google Workspace services from Claude Code.

Claude Code → MCP Server → gws CLI subprocess → Google Workspace APIs

Runs gws CLI remotely, passing the user's OAuth access token directly. Each user accesses resources with their own permissions — you can see your Drive but not someone else's.

Since OAuth authentication and Google Workspace authorization happen simultaneously, the moment you connect to the MCP you have immediate access to your Workspace resources. No additional login or token setup required — the experience is seamless.

# What you can do
"Summarize the sales data in this spreadsheet"
"Extract meeting notes from last week's calendar"
"Summarize this document"

Git Server MCP

A read-only server for all company Git repositories.

The motivation: bypassing GitHub MCP rate limits. GitHub's official MCP server hits the GitHub API under the hood, and the rate limit kicks in surprisingly fast when AI is investigating a codebase.

Git Server MCP keeps main-branch clones of all repos on a GCE VM, operating via local git commands with zero rate limiting. Query as much as you want.

Tool	Description
`git_blame`	Last change commit per line
`git_log`	Commit history
`git_grep`	Cross-repo text search
`git_show`	Commit details
`git_diff`	Diff between commits
`read_file`	Read file contents
`list_files`	List directory contents
`search_repos`	Search repositories

No GitHub account needed — OAuth authentication is sufficient.

Observability: Grafana MCP

The official mcp/grafana Docker image deployed on Cloud Run, with an OAuth proxy in front.

Claude Code → OAuth Proxy → mcp-grafana → Grafana Cloud

Supports PromQL/LogQL queries, dashboard inspection, and alert rule review.

What's important is that Grafana dashboards and alert rules are also defined in the same repository as Pulumi (TypeScript). This means:

Write application code
Define alert rules in the same repo
Alert fires in production
Claude Code reads logs via Grafana MCP
Fix the code in the same repo

The code → infra → observability → investigation → fix loop is completely closed.

CI/CD: CircleCI MCP

Integrates with CircleCI API v2. A shared CircleCI token sits behind Google SSO, so the whole team uses it without managing tokens.

Claude Code → OAuth Proxy → CircleCI MCP (sidecar) → CircleCI API v2

Cloud Run multi-container setup: the official @circleci/mcp-server-circleci runs as a sidecar, with our OAuth proxy in front.

# What you can do
"What's the status of the latest pipeline on main?"
"Show me the failure logs for this build"
"Find flaky tests"

Project Management MCP

A server for managing issues in Firestore and semantically searching Slack/Meet conversations.

Key capabilities:

Issue management: Create, update status, and list Issues in Firestore (with spreadsheet dual-write)
Context search: Vector search + Gemini summarization across Meet notes and Slack conversations
Project overview: View milestones, members, design docs, and test cases for your projects
Backlog integration: Retrieve ticket parent-child relationships via BQ

Domain-Specific

Stylist Insights / UX Insights MCP

Servers providing access to stylist performance/KPI data and UX analytics, respectively. Query interfaces over BQ aggregate tables.

freee MCP

An OAuth-authenticated proxy to the freee API for accounting data access.

Dev Platform: Workspace / Sandbox

This might be the most unique part.

Workspace MCP — Code Editing Without a GitHub Account

Provides ACL-gated file editing, commits, PR creation, and deployment for our internal monorepo.

No GitHub account required. Only a Google Workspace account (OAuth) is needed.

1. workspace_init          → Create worktree, initialize branch
2. workspace_write_file    → Edit code
3. workspace_diff          → Review changes
4. workspace_commit        → Commit
5. workspace_push          → Push to GitHub
6. workspace_deploy        → Deploy from feature branch (test)
7. Verify it works
8. workspace_create_pr     → Request review

Access control is managed in Firestore. Admins configure which stacks (directories) each user can edit and deploy.

{
  "allowedPaths": ["apps/web/xxx/", "apps/api/xxx/"],
  "allowedStacks": ["api-xxx", "pages-xxx"],
  "role": "developer"
}

Non-engineers can safely edit and deploy only the stacks they're authorized for. In practice, a non-engineer team member is already using AI + Workspace MCP to improve a full-scratch KPI dashboard.

Sandbox MCP — App Deployment for Non-Engineers

Going even further: non-engineers can deploy their own apps for internal use.

1. sandbox_init_repo(app_name: "my-tool")    → Initialize repo
2. sandbox_write_file(...)                    → Write files
3. sandbox_publish(app_name: "my-tool")       → Deploy to Cloud Run
   → https://sbx-{nickname}--my-tool.example.com/

No gcloud, no Docker. Just tell Claude "I want a tool that does X" and it's published on an internal URL.

Deployed apps are protected by Cloudflare Access with Google Workspace authentication, so only internal members can access them. Even though they're on the public internet, access from outside the organization is impossible.

I wrote detail article.

Graph Servers: Code Graph / Product Graph / Biz Graph

A family of servers that analyze codebases and business logic as graph structures.

Server	Scope	Key Feature
DB Graph	Company-wide DBs (previous article)	Table dictionary + semantic search + live DB queries + PII anonymization
Code Graph	All source code (cross-repository)	Static analysis tracking function → API → DB → event dependencies across repos
Product Graph	Internal monorepo	Unified knowledge graph of code + DB + docs. Every node has business context
Biz Graph	Business initiatives & metrics	Initiative × metric relationship graph

Each has a different design philosophy and solves different problems. See the previous article for DB Graph; details on the others are coming in future posts.

Security Model

Here's the security approach shared across all servers.

Defense in Depth

Layer 1: Google Workspace OAuth + domain restriction
  → Organization domain only. External users cannot log in.

Layer 2: SSO + session management
  → Upstash Redis, 7-day TTL, sliding window

Layer 3: Per-server scope restrictions
  → GCloud: cloud-platform.read-only
  → AWS: ReadOnlyAccess policy
  → DB Graph: SELECT only + PII anonymization

Layer 4: Data-level protection
  → Automatic PII anonymization (40+ column patterns)
  → Confidential datasets controlled by BQ IAM
  → Production DBs via read replicas only

Layer 5: Audit logging
  → All tool invocations recorded in BQ
  → Individual email in GCP Audit Log / CloudTrail

Automatic Revocation on Departure

Since every server depends on Google OAuth, disabling a Google Workspace account instantly revokes access to all MCP servers. No individual token revocation or account cleanup needed.

Takeaways

Lessons learned from building and operating our MCP server fleet:

1. Centralize authentication
Building OAuth as a shared package made adding new servers dramatically easier. Auth code per server is about 10 lines.

2. Start read-only
GCloud, AWS, and Git Server are all read-only. Allow reads first; add writes only when truly needed. This keeps security discussions simple.

3. Wrap existing tools
gcloud CLI, aws CLI, gws CLI, CircleCI MCP — put existing CLIs and MCP servers behind an OAuth proxy and the whole team can use them safely. No need to build from scratch.

4. Non-engineer access is the most exciting frontier
Workspace MCP and Sandbox MCP provide the foundation for non-engineers to edit code and deploy without a GitHub account. It's still early and the big wins are ahead, but this is where the most potential lies.

5. Keep everything in one repository
Application code, infrastructure (Pulumi), observability (Grafana alert rules), MCP servers — all in a single monorepo. This closes the loop: write code → deploy → monitor → find issues → fix.

In the DB Graph article, I described the problem of "how tables relate to each other existing only in specific people's heads." Looking at the full MCP server fleet, it's clear this isn't limited to databases.

Infrastructure state, code dependencies, document contents, project progress, user behavior logs — all of these were trapped in people's heads. Eliminating that is the essential role of our MCP server fleet.

Externalizing knowledge into a form that AI can access. That's the common theme across all our MCP servers.

Democratizing Internal Data — Building an MCP Server That Lets You Search 991 Tables in Natural Language

Ryosuke Tsuji — Wed, 25 Mar 2026 18:15:40 +0000

Hi, I'm Ryan, CTO at airCloset — Japan's leading fashion rental subscription service.

Today I want to share something I'm genuinely proud of: DB Graph and DB Graph MCP — a Model Context Protocol (MCP) server that lets anyone in our company search and query 15 schemas, 991 tables, 11 SQL databases, and 6 MongoDB instances using natural language through Claude Code.

You don't need to know a single table name. Ask "find tables related to returns" and it gives you the answer — across schemas, across database engines. And yes, it can query production data safely.

In this post, I'll walk through everything: what it does, how it works, the tool design, actual response formats, how we built the graph, how we operate it, and how we handle permissions and security.

The Problem: Nobody Knows All 991 Tables

airCloset has been running since 2015 — that's 10 years of accumulated database schema.

Resource	Count
SQL Databases	11 (MySQL 8 + PostgreSQL 3)
MongoDB Databases	6 (DocumentDB 5 + Atlas 1)
Schemas	15
Tables/Collections	991
ORMs	4 (TypeORM, Sequelize, Drizzle, Mongoose)
Repositories	28

Nobody in the company knows all of them. Not even close.

Here's a real scenario. Customer support asks: "This customer's app shows the return as completed, but has the warehouse actually confirmed receiving it?"

Think about what you need to investigate this.

The app-side return status lives in the aircloset schema's delivery order table. If the delivery status is "RETURNED", the app considers it done. Some people might know this much.

But the warehouse-side confirmation lives in the bridge schema. A receive record table's status being "COMPLETE" means the warehouse has physically processed the returned package.

The problem? These two live in completely separate databases. No foreign key connects them. To bridge the gap, there's an intermediate mapping table in aircloset that holds a warehouse order code (varchar) — which corresponds to a shipping order code in bridge. No FK, just a varchar match across schemas.

aircloset delivery order table (status = RETURNED)
  ↓ order_id
aircloset warehouse mapping table
  ↓ warehouse_order_code (varchar)
bridge shipping order table (matched by code — no FK!)
  ↓ shipping_order_id
bridge receive record table (status = COMPLETE = warehouse confirmed)

Table names are generalized for this article.

Four tables, two schemas, a foreign-key-less varchar join. How many people in the company know this path? You could count them on one hand. And if they're on vacation, the investigation stalls.

This is daily life in a 991-table × 15-schema world. It's not just "I don't know the table name." It's that the connections between schemas exist only in specific people's heads. That was the real problem.

DB Graph MCP — The Big Picture

This is what we built to solve it.

Four components:

DB Dictionary Graph Builder — A daily batch job that parses ORM definitions from 28 repositories and stores table/column/relationship info as a graph in BigQuery
DB Dictionary Review UI — A web app where humans verify AI-generated descriptions, mark deprecated columns, and add annotations. Review data survives daily rebuilds
DB Graph MCP Server — An MCP server (Cloud Run) that combines graph search with live DB querying
DB Account Pipeline — Fully automated DB access provisioning: application → approval → account creation → notification

Seeing It in Action

Let's solve the return investigation from above using DB Graph MCP.

Tool response examples below use generalized table/column names. The response format reflects actual output.

Step 1: Natural Language Table Search

Ask Claude Code: "Find tables related to return processing confirmation." Under the hood, search_tables runs a semantic search.

> search_tables(query: "return processing confirmation", search_type: "semantic")

5 tables found (by vector similarity):

bridge.return_packages (postgresql) (distance: 0.2557)
bridge.receive_records (postgresql) (distance: 0.2720)
cella.receive_confirmation_results (mysql) (distance: 0.2921)
bridge.receive_record_details (postgresql) (distance: 0.2951)
aircloset.return_status_change_histories (mysql) (distance: 0.3170)

A single search returns tables across three schemas (bridge, cella, aircloset). The table name "receive_records" doesn't contain the word "return" — but the AI-generated description includes "rental return processing" and "warehouse receiving", so it matches semantically.

Step 2: Table Detail

The second hit in bridge looks promising. Let's get the details.

> get_table_detail(table_name: "bridge.receive_records")

# bridge.receive_records
DB: POSTGRESQL / ORM: typeorm / Repository: bridge-api

## Columns (9)
- id: int [PK, AI, NOT NULL]
- code: varchar [NOT NULL]
- shipping_order_id: varchar [NOT NULL]
- status: enum [NOT NULL, default=IN_PROGRESS]
- type: enum [NOT NULL]
- receive_datetime: varchar [NOT NULL]
- operated_by: varchar [NOT NULL]
- created_at / updated_at: datetime

## References (2)
- shipping_order_id → bridge.shipping_orders.id (explicit)
- operated_by → bridge.users.id (explicit)

## Referenced By (1)
- bridge.receive_record_details.record_id → id (explicit)

## Enum Definitions (2)
- Status: COMPLETE=Received, IN_PROGRESS=Processing
- Type: RENTAL_RETURN=Rental return, BUSINESS_RETURN=Business return,
        RENTAL_RETURN_LACK=Rental return (missing items), BUSINESS_RETURN_LACK=Business return (missing items)

status = COMPLETE means "the warehouse has finished receiving." Exactly what we needed. Plus type = RENTAL_RETURN distinguishes rental returns from business returns. Enum definitions with human-readable labels — visible at a glance.

Step 3: Discovering the Cross-Schema Path

Now the question: how do we connect the aircloset delivery order (app side) to the bridge receive record (warehouse side)? Let's use trace_relationships.

> trace_relationships(table_name: "bridge.shipping_orders", direction: "both", max_depth: 1)

# Relationship trace: bridge.shipping_orders
Nodes: 23, Edges: 22

## Relationships (excerpt)
- shipping_orders.shop_id → shops.id (explicit)
- shipping_orders.warehouse_id → warehouses.id (explicit)
- receive_records.shipping_order_id → shipping_orders.id (explicit)     ← warehouse confirmation!
- return_packages.shipping_order_id → shipping_orders.id (explicit)     ← return shipment
- shipping_packages.shipping_order_id → shipping_orders.id (explicit)   ← outbound shipment
- shipping_inspections.shipping_order_id → shipping_orders.id (explicit) ← inspection
...

Found the path from bridge.shipping_orders to receive_records. Next, we find the mapping table connecting aircloset and bridge.

> search_tables(query: "warehouse_mapping", search_type: "table", adjacent_depth: 1)

aircloset.warehouse_shipping_relations (mysql)

### Related Tables
  → aircloset.delivery_orders (order_id → id)

> get_table_detail(table_name: "aircloset.warehouse_shipping_relations")

## Columns (4)
- order_id: int [PK, NOT NULL]              ← aircloset delivery order ID
- warehouse_order_code: varchar [NOT NULL]   ← bridge shipping order code

Found it. order_id links to the aircloset side, warehouse_order_code links to the bridge side. No FK, but this varchar is the only key connecting two schemas.

Step 4: Querying Real Data

Now we build cross-schema queries. First, get the delivery order and warehouse code from aircloset.

> sql_query_database(database: "aircloset", sql: "SELECT ... WHERE user_id = 12345 AND status = 'RETURNED'")

**aircloset** (staging) — 1 row

| id     | status   | returned_date       | warehouse_order_code |
|--------|----------|---------------------|----------------------|
| 98765  | RETURNED | 2026-03-20 10:30:00 | SO-2026-00012345     |

> **Table**: Manages the full lifecycle of delivery orders — styling → shipping → return status tracking

### Column Descriptions
- **status**: Delivery status (1=Awaiting shipment, 2=Ready, 3=Delivered, 4=Returned, 5=Cancelled)
- **returned_date**: Date/time the warehouse received the customer's return
- **warehouse_order_code**: Mapping code to bridge shipping order

### Related Tables
- → **aircloset.users** (user_id → id): Customer profile...
- → **aircloset.plans** (plan_id → id): Subscription plan definitions...
- ← **aircloset.styling_feedbacks** (delivery_id → id): Customer feedback on styling...
- ← **aircloset.rental_items** (delivery_id → id): Items in this order...

Notice that column descriptions and related tables are automatically appended below the query result. This metadata is pulled from the graph data cached in Redis (cache-invalidated on graph updates). AI can read this enrichment to determine its next step — like "use the warehouse code to query bridge."

Now check the warehouse side:

> sql_query_database(database: "bridge", sql: "SELECT ... WHERE code = 'SO-2026-00012345'")

**bridge** (staging) — 1 row

| code             | status  | receive_status | type          | receive_datetime    |
|------------------|---------|---------------|---------------|---------------------|
| SO-2026-00012345 | SHIPPED | COMPLETE      | RENTAL_RETURN | 2026-03-21 14:22:00 |

> **Table**: Records warehouse receiving operations — arrival confirmation and inspection status

### Column Descriptions
- **status**: Shipping order status (ORDERED→ALLOCATED→PICKED→INSPECTED→SHIPPED→CANCELED)
- **receive_status**: Receive status (IN_PROGRESS=Processing, COMPLETE=Received)
- **type**: Receive type (RENTAL_RETURN=Rental return, BUSINESS_RETURN=Business return)

### Related Tables
- → **bridge.warehouses** (warehouse_id → id): Source warehouse...
- → **bridge.shops** (shop_id → id): Source shop...
- ← **bridge.receive_record_details** (record_id → id): Individual item details...
- ← **bridge.shipping_packages** (order_id → id): Outbound package info...

receive_status = COMPLETE — the warehouse has confirmed receipt. Both the app-side return status and the warehouse-side physical confirmation are verified.

This enrichment is the key to AI-powered investigation. Claude Code reads the column descriptions and related tables to autonomously decide "what to query next" and "how to interpret these values." No human guidance needed.

Beyond Operations: Cross-Service Analytics

This isn't limited to operational investigations. It works for business analytics too.

Try asking Claude Code:

How many customers used our spot rental service last week, what percentage of them are airCloset monthly subscribers, and how frequently do those subscribers use the main service?

Answering this requires crossing the spot rental order table (spot_rental schema) with the main service's member and usage tables (aircloset schema).

Claude Code uses DB Graph MCP to identify the relevant tables via search_tables, discover join keys via trace_relationships, and run queries against both databases to produce the aggregated result. Cross-service analytics from a single natural language question — that's the core value.

Without DB Graph MCP

Imagine doing these investigations without any tooling:

Return confirmation:

You need to know the delivery order table exists in aircloset
You need to know about the warehouse mapping table that bridges schemas
You need to know that a varchar warehouse code maps to bridge's shipping code
You need to know that bridge's receive record table is the warehouse confirmation
You need to know what enum values like COMPLETE and RENTAL_RETURN mean

Cross-service analytics:

You need to know the spot rental DB schema name and table structure
You need to know the join key to the main service's member table
You need connection credentials for both databases
You need to correctly interpret member statuses and usage counts

In both cases, the required knowledge spans multiple services and schemas. Probably fewer than five people hold all of it in their heads. With DB Graph MCP, anyone can get there through natural language search → table detail → relationship tracing → live queries.

Now let's dive into how this works.

Tool Design: 7 Tools in 3 Categories

Dictionary Tools (no DB credentials required)

Tool	Purpose
`search_tables`	Name search + vector similarity search across tables/columns
`get_table_detail`	Full table info: columns, FKs, enums, DEAD annotations
`trace_relationships`	BFS traversal of table relationships

Dictionary tools read pre-built graph data from BigQuery — no individual DB credentials needed. Anyone with a Google OAuth login can use them immediately, with no access request.

Query Tools (DB credentials required)

Tool	Purpose
`list_databases`	List databases you have access to
`sql_query_database`	Execute SELECT queries against MySQL/PostgreSQL
`describe_database_table`	Get live schema from actual DB
`mongo_query_database`	Execute find/aggregate against DocumentDB/Atlas

Query tools use per-user credentials stored in Firestore. You only see databases you've been granted access to.

This separation is intentional. The dictionary is open to everyone; data access is permission-controlled. "Everyone should know what tables exist, but accessing the data requires authorization."

Why BigQuery? — Technology Choices

We use BigQuery as the graph store. "Shouldn't a graph DB use Neo4j?" you might ask.

We chose BigQuery because one store handles graph + vector search + analytics:

VECTOR_SEARCH: Store 768-dimensional embeddings and run cosine similarity search natively. No separate vector DB needed
Graph traversal: Node + edge table design enables BFS traversal through simple recursive JOINs
JSON type: JSON_SET on a properties column lets us flexibly append review data without schema changes
Serverless: No instance management. Pay only for queries, not idle time
Vertex AI integration: Gemini 3 Flash for description generation and embedding models connect seamlessly within GCP
Google Workspace integration: OAuth uses Google Accounts directly. Domain restriction, nickname resolution, and permission management all flow through the same identity — no separate IdP needed

A dedicated graph DB like Neo4j has superior traversal performance, but at 991 tables, BigQuery is more than sufficient. The operational simplicity of "vector search, JSON, analytics, and graph all in one place" far outweighs the performance difference.

How Natural Language Search Works

How does "return processing confirmation" find a receive records table?

Step 1: Generate Table Descriptions

The DB Dictionary Graph Builder runs daily at 6:00 AM JST, generating AI descriptions for each table using Gemini 3 Flash:

Example: bridge.receive_records
→ "Records warehouse receiving operations. Tracks rental returns
   and business returns with completion/in-progress status.
   Links to shipping orders to trace which order a return belongs to."

Step 2: Generate Embeddings

Each description is converted to a 768-dimensional vector using Vertex AI's embedding model and stored in BigQuery.

Step 3: VECTOR_SEARCH

The user's query is also converted to a 768-dimensional vector, then matched via BigQuery's VECTOR_SEARCH using cosine distance:

SELECT base.qualifiedName, distance
FROM VECTOR_SEARCH(
  TABLE `project.db_graph_nodes`,
  'embedding',
  (SELECT @query_embedding AS embedding),
  top_k => 20,
  distance_type => 'COSINE'
)
WHERE base.nodeType = 'Table'
ORDER BY distance ASC

Even if "return" doesn't appear in the table name, the AI description's mention of "rental return processing" places it close in vector space. That's the core of natural language search.

Building the Graph

6-Phase Pipeline

The builder runs six phases daily:

(See the Builder section of the diagram)

① ORM Parsing — Parse 4 ORM types (TypeORM, Sequelize, Drizzle, Mongoose) across 28 repositories to extract table definitions.

② Live DB Validation — Query actual staging DBs via Lambda to compare code definitions against real schemas. Auto-exclude tables that exist in code but not in the database.

③ AI Description — Generate table/column descriptions with Gemini 3 Flash. Incremental detection regenerates only changed tables to minimize AI cost.

④ Graph Construction — Generate 4 node types (Schema/Table/Column/Enum) and 5 edge types (HAS_TABLE/HAS_COLUMN/REFERENCES/USES_ENUM/SAME_ENTITY).

⑤ Embedding Generation — Generate 768-dimensional vectors per table via Vertex AI.

⑥ BQ MERGE — Load into BigQuery using MERGE, preserving human-written descriptions and DEAD flags. Auto-generated data never overwrites manual annotations.

Relationship Confidence Levels

Foreign key detection has varying confidence:

Confidence	Detection Method	Reliability
`explicit`	Directly from ORM `@JoinColumn()` or `belongsTo()`	Certain
`inferred`	Naming convention: `xxx_id` → `xxx` table	High probability
`manual`	Added by human reviewers	Certain

This lets AI judge the reliability of suggested JOIN conditions before using them.

SAME_ENTITY Edges

The same logical entity sometimes exists in both SQL and MongoDB — for example, a MySQL users table and a MongoDB user statistics collection both represent the same user. SAME_ENTITY edges express these cross-engine correspondences, enabling seamless cross-database discovery.

Human Review: AI Alone Isn't Enough

"Are AI-generated descriptions actually accurate?" Honestly — not always.

Gemini 3 Flash produces decent high-level descriptions, but 10 years of business context — "this column was migrated 3 years ago but never dropped from the schema", "enum value 5 is actually never used" — that kind of tacit knowledge can't be filled by AI alone.

That's why we built human review into the system from day one.

Review Web UI

We have a dedicated review web app for the DB Dictionary.

The schema list shows review progress bars. The table list supports filtering by "unchecked", "checked", and "has deprecated items."

The table detail screen displays columns with type badges, FK targets, and enum definitions — with inline editing for descriptions and deprecation flags.

Review UI: FK targets and enum definitions shown as badges. Descriptions can be edited inline.

Available review actions:

Action	Description
Edit table description	Supplement or rewrite the AI-generated description
Edit column description	Per-column annotations ("deprecated", "use XX instead", etc.)
Mark as DEAD	Deprecation flag + reason + empty percentage, at table or column level
Mark as Checked	Review completion flag — records who checked and when
Bulk DEAD marking	Mark up to 500 tables/columns as deprecated at once

DEAD Flags: Surfacing 10 Years of Tacit Knowledge

After 10 years, deprecated columns accumulate. A flag that once represented member type — migrated years ago, now NULL in every row — still sits in the schema.

When a reviewer marks a column as deprecated, the MCP table detail shows:

- old_member_flag: int [NOT NULL, default=0, DEAD] ⚠ Deprecated. Use membership_status instead
- cancel_date: datetime [DEAD] ⚠ All rows NULL
- legacy_import_id: varchar [DEAD] ⚠ Legacy CSV import field. No longer used

This matters because it prevents AI from writing code that references the wrong column. When Claude Code loads table details into context and sees a DEAD flag, it knows to avoid that column.

Change Detection and Diff Review

When the daily build detects changes in table structure or AI descriptions, they're recorded as "pending changes." Reviewers can view before/after diffs in the web UI and mark them as reviewed.

This ensures nothing slips through — if yesterday's build changed something, someone will see it.

Review Data Persistence

Review data is stored in Firestore and never overwritten by daily builds.

The daily build follows this sequence:

ORM parsing → graph construction — Re-extract table definitions from latest code
BQ MERGE — Merge while preserving human-written textForEmbedding and embedding
Re-apply Firestore reviews — Write humanDescription, isDead, deadNote, checkedAt back to BQ properties

Reviews survive unlimited daily rebuild cycles. Firestore is the source of truth; BQ is its reflection.

Crossing the VPC Wall: Cross-Cloud Architecture

Now for the security design I'm most proud of.

Problem: The MCP server runs on Google Cloud (Cloud Run). The databases are inside AWS VPCs. Cloud Run can't directly reach VPC-internal RDS/DocumentDB instances.

Solution: A three-stage authentication chain — GCP OIDC → AWS STS → VPC Lambda — enables secure cross-cloud connectivity.

Authentication Flow

1. Cloud Run (GCP) → Get OIDC token from GCP metadata server
2. OIDC token → AWS STS AssumeRoleWithWebIdentity
3. STS → Return temporary AWS credentials (1-hour TTL)
4. Temporary credentials → Invoke VPC-internal Lambda
5. Lambda → Execute query against VPC-internal RDS/DocumentDB

Key points:

Zero static AWS credentials. Dynamically obtained from GCP service account.
Temporary credentials cached for 5 minutes. Avoids per-request STS overhead.
Lambda executes inside VPC. DB connections never leave the VPC.
Production queries use Read Replicas only. Never connects to the master.

SQL Validation (Defense in Depth)

Query safety is enforced at two layers:

MCP layer (1st):

Allowed: SELECT, SHOW, DESCRIBE, DESC, EXPLAIN, WITH...SELECT
Blocked: INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, TRUNCATE, multi-statement via semicolons

Lambda layer (2nd):
The same validation runs inside Lambda. Even if the MCP layer is somehow bypassed, Lambda blocks it.

Protecting Production Data — PII Anonymization

Querying production data is powerful, but handling personally identifiable information (PII) requires the most care.

Automatic Anonymization Rules

For production + view permission queries, PII column values are automatically anonymized:

Column Pattern	Replacement
Email fields	`*@*.com`
Name fields	`***`
Phone fields	`*--**`
Postal code fields	`*-**`
Address fields	`***`
Password fields	`[REDACTED]`
Date of birth fields	`**--**`
Card number fields	`[REDACTED]`

Table-specific rules handle ambiguous columns. For example, a generic name column isn't PII globally, but users.name or orders.buyer_name clearly is. These are configured per-table.

Staging vs Production

Environment	PII Anonymization	Connection Target
Staging	None	Master DB
Production (view)	Auto-applied	Read Replica
Production (edit)	None	Read Replica

Staging uses test data, so no anonymization needed. Only production view queries get automatic PII protection.

Fully Automated Access Management — DB Account Pipeline

"Who do I talk to about getting database access?"

This question doesn't get asked anymore. The DB Account Pipeline automates everything.

Flow

User submits a workflow request — nickname, email, desired databases (multiple allowed)
Manager approves
Cloud Run Job processes automatically — reads approved requests, generates CREATE USER statements per DB, executes via Lambda
Credentials saved to Firestore + Secret Manager — passwords never stored in plaintext
Slack DM with connection info — includes bastion server guide

Zero Plaintext Passwords

Passwords are stored only in Secret Manager.

Firestore db_credentials:
  host: "xxx.rds.amazonaws.com"
  port: 3306
  username: "ryan_view_user"
  passwordSecretId: "db-cred-xxxxx"  ← Reference to Secret Manager only
  permLevel: "view"

When the MCP Server executes a query, it decrypts the password from Secret Manager via passwordSecretId and caches it in memory for 5 minutes. Cloud Run restarts clear the cache.

No plaintext password exists anywhere — this was a deliberate design decision we're particularly proud of.

Operations

Daily Cron

A cron job fires at 6:00 AM JST daily, triggering a Cloud Run Job:

6:00 AM JST — Cron fires
├── ORM parsing (28 repos × 4 ORMs)
├── Live DB validation (11 staging DBs)
├── Gemini description generation (incremental only)
├── Graph construction + Embedding
├── BQ MERGE (preserving annotations)
└── Slack notification

Cost

Resource	Cost
Gemini 3 Flash (daily, incremental)	~$0.10-0.20/day
Vertex AI Embedding	~$0.01/day
Cloud Run Job	Near-free (once daily)
BQ Storage	A few GB
Lambda	Shared with DB Account Pipeline

Thanks to incremental detection, we maintain an AI-powered dictionary for 991 tables at under $10/month.

Incremental Detection

Regenerating all table descriptions daily would spike Gemini costs. So we introduced change detection:

1. Compare previous property hashes
2. Detect column structure changes (additions/removals/type changes)
3. Identify affected tables via enum dependency graph
→ Regenerate only changed tables

If a status enum changes, all tables using that enum are regenerated. No changes? Skip. This cuts AI costs by roughly 90%.

Security Summary

Layer	Protection
OAuth	Google Account + corporate domain restriction
Credential Resolution	email → nickname → per-user DB credentials
Permission Filter	Per-user × database × environment × permission level
SQL Validation (MCP)	SELECT-only enforcement
SQL Validation (Lambda)	Same validation (defense in depth)
PII Anonymization	Production + view queries only
Production Connection	Read Replicas only
Passwords	Secret Manager only, 5-min TTL memory cache
Cross-Cloud Auth	GCP OIDC → AWS STS (zero static credentials)
Logging	Passwords and query results never logged

Takeaways

DB Graph MCP goes beyond solving the fundamental database problem of "you can't use what you don't know exists." It enables anyone to search real data without knowing SQL at all.

As a dictionary — Search 991 tables' structure, relationships, and enum definitions in natural language
As a query tool — Securely query staging and production data with automatic PII protection
As a knowledge base — DEAD flags and column annotations surface 10 years of tacit knowledge

The biggest lesson from building this: the real value of MCP is giving AI context. Table structure, relationships, enum definitions, column warnings — when these enter AI's context window, the SQL and code Claude Code writes become dramatically more accurate.

Making that happen required building the graph, securing cross-cloud access, automating permission management, and protecting PII — unglamorous but essential infrastructure, built with care.

I hope this helps anyone wrestling with internal database management at scale.

Forem: Ryosuke Tsuji

Human-on-the-Loop: AI Reviewing AI PRs at cortex (769 PRs/month, while raising the quality bar)

Series

Start with last month's numbers

How the review bottleneck stops forming

The conventional wisdom: the reviewer becomes the bottleneck

cortex's answer: move the reviewer role to AI as well

How the auto-review system is wired

How we ended up with Event Relay

Why sequential single-session review, not parallel sub-agents

Operational knobs

Output structure: tags and severity

Tags (dimensions)

Severity

Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2

Critical

[Graph] Missing @graph-business tag (x3)

[Graph] embedMeetContent's @graph-connects doesn't reflect generateEmbeddingV2

[Doc] Corresponding BigQuery schema doc is not updated

Minor

[Test] textEmbeddingV2 value is not asserted

[Test] No isolated scenario for "v2 returns null"

Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules

1. AI was downgrading because "existing code has the same issue"

2. The final verdict had 3 options, and "comment-only" left PRs in limbo

3. Checklist items had no severity, so the AI's judgment kept drifting

4. The existing guidelines didn't catch AI-specific traps

5. The AI tries to relax "the standard itself"

Evolving the guidelines is the meta-layer humans actually operate on

Auto-fix: a separate AI applies the changes and pushes

Auto-merge + parallel deploy

The numbers, in more detail

Depth of the review-fix loop

What the auto reviewer typically flags

Actual false-positive rate

What changed / Bridge to Part 4

The Heart of the AI Harness: A Knowledge Graph of the AI, by the AI, for the AI (Series Part 2)

Series Index

Start with One Scene

What Static Analysis Alone Couldn't Do

Meanwhile, DB Graph Was Working

The Hypothesis — Bring DB Graph's Essence into the Code Graph

The Approach — Abandon Code Inference, Make JSDoc the SSoT

Making Omissions Physically Impossible

Where Hallucination Happens Shifts

Build — AST to Graph via ts-morph

Build Cost Is Effectively Zero

Why BigQuery, Not a Graph Database

The Core Part Is Available as an Open-Source Sample

Connections — Landing Docs and DB on the Same Graph

1. DB Schemas as Nodes in the Same Graph

2. Docs Auto-Promoted to Nodes via Directory Convention

3. Infrastructure Definitions as Nodes

Result: Four Layers on One Graph

Where the Sample Stops

MCP Tool Design and the Runbook Pattern

Runbook Pattern — Return Values Contain the Next Action

usecase Parameter — Switching the Runbook

CLAUDE.md Convention — Forcing AI to Always Hit cpg First

Product Graph MCP (cortex-product-graph)

A Live Example — Investigating cpg with cpg

Step 1: Semantic search for "the code that extracts graph source data from code annotations"

Step 2: Trace downstream from that node (usecase: "design" prioritizes Documents)

Step 3: The meta-structure — this article itself is written with cpg

What Changed / Bridge to Part 3

Minimal post (test fixture)

Building a Real AI Harness: Auto-Reviewed PRs, Self-Healing Ops, and Non-Engineer Contributors (Series Intro)

Series Index

Two Scenes, Up Front

Scene 1: PRs merge themselves

Scene 2: Incidents fix themselves before you notice

Industry Context — "Harness Engineering"

Who Builds the Code

What's Running

The 4-Element Flywheel — cortex's Harness

① Product Graph (Guides — supplying the right context)

② Lint / Quality Gates (Guides — physically blocking deviations)

③ Auto Review (Sensors — auto-fixing until the bar is met)

④ Alert-Fix (Sensors — re-injecting production anomalies into the loop)

What Makes It a Flywheel

[Graph] Missing `@graph-business` tag (x3)

[Graph] `embedMeetContent`'s `@graph-connects` doesn't reflect `generateEmbeddingV2`

[Test] `textEmbeddingV2` value is not asserted

`usecase` Parameter — Switching the Runbook

Step 2: Trace downstream from that node (`usecase: "design"` prioritizes Documents)