Forem: Chetan Gupta

Enabling SSH & RDP on Ubuntu 24.04 VM in Proxmox (Complete Guide)

Chetan Gupta — Sun, 22 Mar 2026 15:24:52 +0000

Running Ubuntu inside Proxmox VE is powerful for homelabs, but accessing it efficiently (SSH + Remote Desktop) is essential.

This guide walks you through:

✅ Enabling SSH access
✅ Enabling Remote Desktop (RDP)
✅ Fixing common issues (like 0x204 error)
✅ Understanding architecture with diagrams
✅ Final checklist

🧠 Architecture Overview

🔷 Block Diagram: Access Flow

+----------------------+
|   Your Laptop/PC     |
| (SSH / RDP Client)   |
+----------+-----------+
           |
           |  Network (LAN / WiFi)
           |
+----------v-----------+
|     Proxmox Host     |
|  (Hypervisor Layer)  |
+----------+-----------+
           |
           | Virtual Network Bridge (vmbr0)
           |
+----------v-----------+
|   Ubuntu 24.04 VM    |
|----------------------|
| SSH Server (port 22) |
| RDP Server (3389)    |
| UFW Firewall         |
| QEMU Guest Agent     |
+----------------------+

⚙️ Part 1: Enable SSH on Ubuntu VM

Step 1: Access VM Console (Proxmox GUI)

Login to Proxmox
Select VM → Console

Step 2: Install OpenSSH Server

sudo apt update
sudo apt install openssh-server -y

Step 3: Verify SSH Service

sudo systemctl status ssh

If not running:

sudo systemctl enable --now ssh

Step 4: Configure Firewall (UFW)

sudo ufw allow ssh
sudo ufw enable
sudo ufw status

📌 If you see:

firewall not enabled (skipping reload)

→ That’s normal; just enable it.

Step 5: Get VM IP Address

ip addr show

Step 6: Connect from Your PC

ssh username@<vm_ip>

⚠️ Proxmox Firewall Check

If SSH doesn’t work:

Go to VM → Firewall
Add rule:
- Direction: IN
- Port: 22
- Protocol: TCP

🖥️ Part 2: Enable Remote Desktop (RDP)

You have 2 methods:

🔹 Option 1: GNOME Remote Desktop (Recommended)

Best for Ubuntu 24.04 Desktop.

Steps:

Go to:

   Settings → System → Remote Desktop

Enable:

✅ Remote Desktop
❌ Disable "Remote Login" (important!)

Set:

Username & Password

Open Firewall

sudo ufw allow 3389/tcp

🔹 Option 2: xRDP (Alternative)

Use if GNOME RDP fails.

Install:

sudo apt update
sudo apt install xrdp -y
sudo systemctl enable --now xrdp
sudo adduser xrdp ssl-cert

⚠️ Important Rule

👉 NEVER run both at the same time

Check port usage:

sudo ss -tulpn | grep :3389

🔧 Part 3: Enable QEMU Guest Agent (VERY IMPORTANT)

This fixes:

Missing IP
RDP issues
Proxmox communication

Install inside VM:

sudo apt install qemu-guest-agent -y
sudo systemctl enable --now qemu-guest-agent

Enable in Proxmox:

VM → Options
Enable QEMU Guest Agent
FULL shutdown → Start again

🧪 Part 4: Verify Services

Check SSH:

systemctl status ssh

Check RDP:

ss -tulpn | grep 3389

Expected:

gnome-remote-desktop OR xrdp listening

🛠️ Part 5: Fix Common Issues

❌ Error: SSH not found

ssh.service could not be found

✔ Fix:

sudo apt install openssh-server

❌ Error: RDP 0x204

Causes:

Firewall blocked
Wrong service
Wayland issue
NLA mismatch

✅ Fix 1: Disable Wayland

sudo nano /etc/gdm3/custom.conf

Uncomment:

WaylandEnable=false

Restart:

sudo systemctl restart gdm3

✅ Fix 2: Disable GNOME Auth (User Mode)

grdctl rdp enable
grdctl rdp set-credentials USERNAME PASSWORD
grdctl rdp disable-view-only
systemctl --user restart gnome-remote-desktop

✅ Fix 3: Client Settings

Disable NLA
Set Security Layer → RDP
Allow insecure connection

✅ Fix 4: Test Connectivity

From your PC:

nc -zv <vm_ip> 3389

or (Windows):

Test-NetConnection <vm_ip> -Port 3389

❌ Issue: Port Conflict

If both installed:

sudo apt remove xrdp -y

❌ Issue: Proxmox Firewall Blocking

Add rule:

Port: 3389
Protocol: TCP

❌ Issue: No GUI Session

👉 For GNOME RDP:

User must be logged in on console

🧾 Final Checklist

✅ SSH Setup

[ ] OpenSSH installed
[ ] SSH service running
[ ] UFW allows port 22
[ ] Proxmox firewall allows port 22
[ ] SSH connection works

✅ RDP Setup

[ ] GNOME Remote Desktop OR xRDP installed
[ ] Port 3389 open
[ ] No service conflict
[ ] Wayland disabled (if needed)
[ ] Credentials configured
[ ] RDP connection works

✅ Proxmox Integration

[ ] QEMU Guest Agent installed
[ ] Enabled in Proxmox
[ ] IP visible in dashboard

✅ Network Validation

[ ] VM reachable via ping
[ ] Ports 22 & 3389 reachable
[ ] Same subnet / no AP isolation

Pro Tip: Fix RDP Error Code 0x204 (Ubuntu 24.04 on Proxmox)

Error code 0x204 usually means your RDP client cannot reach the Ubuntu VM over the network.
In Proxmox setups, this is commonly caused by:

🔐 Certificate mismatch (GNOME RDP bug in 24.04)

🔥 Proxmox-level firewall blocking port 3389

💤 VM power/suspend issues

🔒 Strict Network Level Authentication (NLA)

🔧 1. Fix GNOME Certificate Bug (Most Overlooked Fix)

Ubuntu 24.04 has a known issue with RDP certificates.

Steps:

Open Microsoft Remote Desktop
Right-click your Ubuntu connection → Export
Open the .rdp file in a text editor
Find:

   use redirection server name:i:0

Change to:

   use redirection server name:i:1

Save and re-import

✅ This bypasses certificate validation issues

🔥 2. Check Proxmox Firewall (Very Common Issue)

Even if Ubuntu allows RDP, Proxmox may still block it.

Steps:

Go to: VM → Firewall → Options
Check if firewall is Enabled
Add rule:

Field	Value
Direction	IN
Action	ACCEPT
Protocol	TCP
Destination	3389

💤 3. Disable Ubuntu Power Saving

RDP can fail if the VM “sleeps”.

Fix:

Go to: Settings → Power
Set:
- Screen Blank → Never
- Automatic Suspend → Off

🔒 4. Relax Network Level Authentication (NLA)

Strict authentication can break RDP connection.

Fix on Client (Windows/Mac):

Open RDP settings → Advanced
Set:
- “If server authentication fails” → 👉 Connect and don’t warn me

🌐 5. Verify Network Reachability

Make sure your machine can actually reach the VM:

Windows:

Test-NetConnection <vm_ip> -Port 3389

Mac/Linux:

nc -zv <vm_ip> 3389

⚠️ Bonus Insight

👉 If you're connecting to:

192.168.x.x → Local network (should work easily)
External IP → You need:
- Port forwarding
- Router config
- Firewall rules

🧠 Quick Diagnosis Flow

RDP Error 0x204
      |
      v
Can you ping VM?
      |
   No ---> Network issue / AP isolation
      |
     Yes
      |
Is port 3389 open?
      |
   No ---> Firewall (Proxmox/UFW)
      |
     Yes
      |
Certificate / NLA / Wayland issue

This Pro Tip section fits perfectly under your Troubleshooting part and makes your blog much more practical.

If you want, I can next:

Merge this into your full blog cleanly
Or convert everything into a professional Medium/LinkedIn article format

🎯 Key Takeaways

SSH requires OpenSSH inside VM (not Proxmox-level)
RDP issues in Ubuntu 24.04 are mostly:
- Wayland
- Service conflicts
- Firewall rules
QEMU Guest Agent is critical for stability
Always validate:

  Network → Firewall → Service → Client

Part 3: Testing, Deploying, and Lessons Learned

Chetan Gupta — Wed, 18 Mar 2026 02:38:58 +0000

The final part of a three-part series on building our first MCP server for healthcare interoperability.

Where We Left Off

Part 1 covered the why — the problem space, the choice of MCP, and the architectural decisions. Part 2 covered the how — the indexer, URI scheme, tool handlers, and transport layer. This final post covers the operational reality: how we test an MCP server, the developer workflow, deploying to real AI clients, and the honest retrospective on what worked and what we'd change.

Testing an MCP Server: It's Weirder Than You Think

Testing a regular API is well-understood: spin up a server, send requests, assert on responses. Testing an MCP server adds a twist: your primary consumer is an AI, and you can't write assertions about AI behavior.

We developed a three-layer testing strategy:

Layer 1: Unit Tests for Handlers

Each handler is a pure function: it takes a Pydantic model and returns a dict. This makes unit testing straightforward.

The trick is the database. Our handlers query SQLite, so we needed a test database. We chose temporary databases per test module — each test file creates a fresh SQLite database in a temp directory, inserts known test data, and tears it down after.

The pattern looks like this conceptually:

┌─────────────────────────────────────────────────┐
│  Test Setup                                     │
│  1. Create temp SQLite file                     │
│  2. Create schema (same as production)          │
│  3. Insert known test data (Patient, etc.)      │
│  4. Rebuild FTS index                           │
│  5. Point FHIR_MCP_INDEX_PATH to temp file      │
│  6. Reload storage modules (pick up new path)   │
├─────────────────────────────────────────────────┤
│  Test Execution                                 │
│  - Import handler, create input model, call it  │
│  - Assert on returned metadata and payload      │
├─────────────────────────────────────────────────┤
│  Teardown                                       │
│  - Delete temp file                             │
│  - Restore environment                          │
└─────────────────────────────────────────────────┘

A subtle issue we hit: module-level state. Our SQLite store reads DB_PATH from an environment variable at module load time. In tests, we need to set the environment variable before the module is imported, or reload the module after setting it. We solved this with importlib.reload() — ugly but effective.

If we were starting over, we'd inject the database path through the Settings object rather than reading environment variables at module scope. Lesson learned.

Here are the kinds of tests we found most valuable:

Happy path tests: "Give me Patient from R4 → returns metadata with name='Patient'." These catch regressions in the handler logic or the SQL queries.

Not-found tests: "Give me NonExistentResource from R4 → returns empty dict, not an exception." These are critical because the AI will inevitably ask for things that don't exist, and the server must handle that gracefully.

FTS tests: "Search for 'Patient' → returns at least one result. Search for 'xyznonexistent' → returns empty list." These verify that the full-text search index is working and that our FTS queries are correct.

Layer 2: URI Scheme Tests

The URI parser and formatter are pure functions with no dependencies. Testing them is simple and satisfying:

Parse "fhir://R4/StructureDefinition/Patient"
  → { scheme: "fhir", version: "R4", name: "Patient" }  ✓

Parse "ig://hl7.fhir.us.core/StructureDefinition/us-core-patient"
  → { scheme: "ig", version: "hl7.fhir.us.core", name: "us-core-patient" }  ✓

Parse "not-a-valid-uri"
  → None  ✓

Format fhir_uri("R4", "Patient")
  → "fhir://R4/StructureDefinition/Patient"  ✓

We tested the round-trip: format a URI, parse it, verify the components match. This caught a few edge cases with dots in IG names and hyphens in profile names.

Layer 3: Smoke Tests

The smoke test script is our "does the whole thing work?" check. It:

Verifies the SQLite index file exists.
Queries for a known resource (Patient) by exact match.
Runs an FTS search and verifies results come back.

This runs against the real index (not a test database) and is designed to catch "the build broke the index" or "the schema changed in a way that breaks queries."

We run smoke tests as part of our local dev workflow — Tilt triggers them after building the index, and they fail-fast if anything is wrong.

What We Didn't Test (And Should Have)

Integration tests against the transport layer. We tested handlers and storage independently but never tested the full flow: "send a JSON-RPC message on stdin → get a response on stdout." This meant that when we had the stdout buffering issue (mentioned in Part 2), we didn't catch it until manual testing with Claude Desktop.

Schema evolution tests. When we added PostgreSQL support, we had to ensure both backends returned the same shape of data. We should have written cross-backend tests from the start.

The Developer Experience: Tilt, Docker, and the Inner Loop

Why Tilt?

If you haven't used Tilt, it's a local development orchestrator. You define resources (build steps, services, health checks) in a Tiltfile, and Tilt manages the lifecycle: watching for file changes, rebuilding what's needed, restarting services, and showing you a dashboard of what's running.

For our project, Tilt orchestrates four steps:

┌──────────┐    ┌─────────────┐    ┌─────────────┐    ┌──────────────┐
│ uv sync  │───▶│   fetch     │───▶│   build     │───▶│  MCP server  │
│          │    │  packages   │    │   index     │    │  (HTTP mode) │
└──────────┘    └─────────────┘    └─────────────┘    └──────────────┘
  deps:            deps:              deps:              deps:
  pyproject.toml   fetch_packages.py  fixtures/          build-index
                   uv-sync            packages/          
                                      fetch-packages     readiness:
                                                         GET /health

Each step declares its dependencies. If you change pyproject.toml, everything rebuilds. If you only change a handler file, only the server restarts. Tilt tracks file changes and does the minimum work needed.

Why not just a shell script? We had one initially:

uv sync
python scripts/fetch_packages.py
python scripts/build_index.py
python -m apps.mcp_server.main --http

The problem: when you change a handler, you have to Ctrl+C and rerun the whole thing. Tilt watches files and restarts only the server, keeping the index intact. It also gives you a dashboard showing the status of each step, and readiness probes that tell you when the server is actually ready (not just started).

The Tilt Configuration

Two key decisions in our Tilt setup:

Dual-backend support. The Tiltfile reads FHIR_MCP_STORAGE_BACKEND from the environment and configures either SQLite or PostgreSQL accordingly. For PostgreSQL, it uses docker-compose to spin up a Postgres container. For SQLite, everything is local files.

Health checks on the HTTP server. The MCP server in HTTP mode exposes GET /health which returns {"status": "ok"}. Tilt polls this endpoint to know when the server is ready. This prevents you from sending requests to a server that's still starting up.

Docker: The Deployment Story

Our Dockerfile follows a simple pattern:

FROM python:3.13-slim
    → Install dependencies with uv
    → Copy source code
    → Run fetch + build index at build time
    → CMD: start the MCP server (stdio mode)

Building the index at image build time is deliberate. The Docker image ships with a pre-built index, so the container starts instantly at runtime. The tradeoff is that the image is larger (includes the SQLite database), but startup is fast and there are no runtime initialization steps.

The docker-compose.yml mounts the data directory as a volume. This means you can rebuild the index on the host and have the container pick it up without rebuilding the image.

A subtlety: the container runs in stdin_open: true and tty: true mode. This is necessary for stdio transport — Docker needs to keep stdin open for the MCP client to communicate with the server.

Deploying to Real AI Clients

Claude Desktop

Claude Desktop supports MCP servers natively. Configuration is a JSON file:

{
  "mcpServers": {
    "fhir-mcp": {
      "command": "python",
      "args": ["-m", "apps.mcp_server.main"],
      "cwd": "/path/to/fhir-mcp"
    }
  }
}

Claude Desktop spawns the process, communicates over stdio, and presents the tools in its UI. The user can then ask questions like "What fields are in a FHIR R4 Patient resource?" and Claude will call fhir.get_definition behind the scenes.

Things we learned with Claude Desktop:

The cwd must be the project root (where pyproject.toml lives), not the apps/ directory. Relative paths in settings (like data/index/fhir_index.sqlite) resolve from cwd.
If the server crashes, Claude Desktop may not show a clear error. Check stderr output to diagnose issues.
Claude is remarkably good at choosing the right tool. With descriptive tool names and typed inputs, it correctly uses fhir.search for exploration and fhir.get_definition for exact lookups.

Cursor

Cursor's MCP configuration is nearly identical:

{
  "mcpServers": {
    "fhir-mcp": {
      "command": "python",
      "args": ["-m", "apps.mcp_server.main"],
      "cwd": "/path/to/fhir-mcp"
    }
  }
}

Differences we noticed:

Cursor tends to call tools in a coding context (while you're editing files), so the prompts and results are optimized for developer workflows.
Response formatting matters more in Cursor because results appear inline with code.

Key Takeaway on Client Support

Because MCP standardizes the protocol, supporting multiple clients was trivial. We wrote zero client-specific code. The same server binary, the same tools, the same transport — just different JSON config files for each client.

This was one of MCP's biggest wins for us. We didn't have to build a Claude plugin and a Cursor extension and a VS Code integration. We built one MCP server, and it works everywhere MCP is supported.

Prompts: The Underappreciated Third Pillar

MCP has three primitives: tools, resources, and prompts. We spent most of our effort on tools, some on resources (URI scheme), and almost none on prompts initially. That was a mistake.

Our prompts are simple strings:

"summarize_profile"   → "Summarize a FHIR profile in plain language."
"explain_constraint"  → "Explain a constraint in a StructureDefinition."
"migration_notes"     → "Describe migration notes between FHIR versions."

These seem trivial, but they serve an important purpose: they tell the AI how to use the tools' output. Without prompts, the AI might return raw JSON metadata to the user. With a prompt like "summarize this profile in plain language," the AI knows to translate the technical output into something human-readable.

If we were starting over, we'd invest more in prompts. Specifically:

Parameterized prompts that include the tool name and expected output format.
Chain prompts that guide the AI through multi-step workflows: "First call ig.list to see available IGs, then call fhir.search to find the relevant profile, then call fhir.get_definition to get the full definition, then summarize it."
Domain-specific prompts for common healthcare developer questions: "Compare this resource between R4 and R5 and list breaking changes."

The Honest Retrospective: What Worked, What Didn't, What We'd Change

What Worked

1. The layered architecture. Transport → Registry → Handlers → Packages → Storage. Every layer has one job. Adding PostgreSQL support was a one-layer change. Adding HTTP transport was a one-layer change. Adding a new tool is a two-file change (handler + registry).

2. Pydantic everywhere. Input validation, settings, data models — Pydantic caught bugs early and served as living documentation. The type system paid for itself in the first week.

3. SQLite + FTS5 for local use. Zero-config, fast, reliable. For a single-user local tool, SQLite is hard to beat.

4. Explicit registries. Being able to open one file and see every tool, resource, and prompt in the system is invaluable for onboarding and debugging.

5. The stub pattern. Having validate.instance as a stub from day one meant the interface contract was established early. When we eventually implement it, the tool name, input schema, and registry entry already exist.

What Didn't Work

1. Module-level state. Reading environment variables at module load time (e.g., DB_PATH = os.environ.get(...)) made testing painful. We had to reload modules to pick up test configuration. Dependency injection through the Settings object would have been cleaner.

2. The Tool class is boilerplate-heavy. Every handler file defines the same Tool class with the same three attributes. We should have defined it once in a shared module. We resisted DRY initially because we valued independence between handlers, but the duplication became annoying.

3. No end-to-end transport tests. We tested handlers and storage in isolation but never tested "JSON on stdin → JSON on stdout." The stdout buffering bug could have been caught by an automated test.

4. Prompts were an afterthought. We treated them as static strings rather than the powerful interaction guides they could be. They deserve the same rigor as tool definitions.

5. No client-facing schema export. MCP clients can request the tool schemas (input models) to understand what each tool expects. We return tool names in list_tools but don't include the full JSON schema for each tool's input model. Adding this would make it easier for clients (and AIs) to understand the tool interface without documentation.

What We'd Change in v2

1. Use a proper MCP SDK. We built the transport layer by hand (reading JSON-RPC from stdin, writing responses). There are now Python MCP SDKs that handle the protocol details. We'd use one of those instead of rolling our own.

2. Async handlers. Our handlers are synchronous. For a local SQLite-based server, this is fine. But with PostgreSQL or potential network-based data sources, async would allow concurrent tool calls. The MCP protocol supports this.

3. Streaming responses. For large payloads (like a full StructureDefinition), streaming would be better than loading the entire JSON into memory and truncating. MCP supports progressive responses, and we should use them.

4. Richer diff tool. The fhir.diff_versions tool currently only compares top-level metadata. A proper diff that compares element paths, cardinality changes, and type modifications would be dramatically more useful for migration work.

5. Package management in the server. Currently, packages are fetched and indexed offline by running scripts. Ideally, the server (or a companion tool) could fetch FHIR packages from a registry, index them, and make them available — all through MCP tools that the AI could invoke.

The Bigger Picture: What Building an MCP Server Taught Us

MCP changes how you think about AI integration

Before MCP, we thought about AI integration as "give the AI context and hope for the best." After building an MCP server, we think about it as "give the AI typed, validated tools and let it be an agent."

The difference is profound. With context stuffing, you're limited by the context window and the AI's ability to find the needle in the haystack. With MCP tools, the AI can make targeted, efficient queries — just like a developer would.

Healthcare needs more MCP servers

FHIR is just one specification. Healthcare interoperability involves CDA, HL7v2, SMART on FHIR, Bulk Data, DaVinci IGs, and dozens of other standards. Each of these could benefit from an MCP server that lets AI assistants look up specifications accurately instead of hallucinating.

The bar for building an MCP server is low

Our first working version was built in a few days. The core is ~500 lines of Python across the transport, registry, and handlers. The indexer is ~100 lines. The rest is data and configuration.

If you have a domain-specific data source that AI assistants get wrong, building an MCP server for it is probably easier than you think. The protocol is simple, the pattern is clear, and the payoff — AI that gives accurate, grounded answers about your domain — is immediate.

Quick-Start Mental Model

If you're thinking about building your own MCP server, here's the mental model we'd recommend:

┌─────────────────────────────────────────────────────────────────┐
│                    YOUR MCP SERVER                              │
│                                                                 │
│  1. DATA LAYER                                                  │
│     What data do you have?                                      │
│     How will you store/index it?                                │
│     → SQLite for local, Postgres for shared                    │
│                                                                 │
│  2. TOOLS                                                       │
│     What operations does the AI need?                           │
│     → One tool per distinct operation                          │
│     → Pydantic model for every input                           │
│     → Return structured data, not prose                        │
│                                                                 │
│  3. RESOURCES                                                   │
│     What data should be directly addressable by URI?            │
│     → Design URIs that are human-readable and parseable        │
│                                                                 │
│  4. PROMPTS                                                     │
│     How should the AI present results to users?                 │
│     → Guide the AI's interpretation of tool output             │
│                                                                 │
│  5. TRANSPORT                                                   │
│     stdio for AI clients, HTTP for dev/testing                  │
│     → Keep this layer as thin as possible                      │
│                                                                 │
│  6. TEST                                                        │
│     Unit test handlers with mock data                           │
│     Smoke test the full pipeline                                │
│     → Test the transport layer end-to-end                      │
│                                                                 │
│  7. DEPLOY                                                      │
│     JSON config for each AI client                              │
│     Docker for production                                       │
│     Tilt for local dev                                          │
└─────────────────────────────────────────────────────────────────┘

Final Thoughts

Building an MCP server was one of the most rewarding developer experience projects we've worked on. The feedback loop is immediate — you build a tool, restart the server, ask the AI a question, and watch it use your tool to give a better answer. It's like giving the AI a new superpower, one tool at a time.

If you work in a domain with complex, versioned, structured data — healthcare, legal, finance, infrastructure — and you're tired of AI assistants getting the details wrong, consider building an MCP server. Start small. One tool. One data source. See what happens when the AI can actually look things up instead of guessing.

You might be surprised how much better "AI-assisted" can be when the AI has access to ground truth.

*This is Part 3 of a 3-part series.
Part 0: MCP — The Missing Layer Between AI and Your Application →

Part 1: Why We Built an MCP Server — And What We Learned Before Writing a Single Line of Code
Part 2: Building the Engine — Tools, URIs, and the Art of Indexing FHIR
Part 3: Testing, Deploying, and Lessons Learned -> coming soon If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

Part 2: Building the Engine — Tools, URIs, and the Art of Indexing FHIR

Chetan Gupta — Tue, 10 Mar 2026 02:08:43 +0000

Part 2 of a three-part series on building our first MCP server for healthcare interoperability.

Where We Left Off

In Part 1, we talked about why we built an MCP server for FHIR and the architectural decisions we made before writing code. Now we're going to get into the how — the implementation details, the patterns that emerged, and the places where the reality of FHIR made us rethink our approach.

Phase 1: Turning Thousands of JSON Files Into a Searchable Index

The Data Problem

FHIR packages are distributed as folders of JSON files. A single core FHIR package (say, hl7.fhir.r4.core) contains thousands of files: one for each StructureDefinition, ValueSet, CodeSystem, SearchParameter, OperationDefinition, and so on.

Each file looks something like this (simplified):

{
  "resourceType": "StructureDefinition",
  "id": "Patient",
  "url": "http://hl7.org/fhir/StructureDefinition/Patient",
  "name": "Patient",
  "title": "Patient Resource",
  "status": "active",
  "fhirVersion": "4.0.1",
  "kind": "resource",
  "type": "Patient",
  "description": "Demographics and other administrative information about an individual receiving care.",
  "differential": {
    "element": [
      {"id": "Patient", "path": "Patient", "min": 0, "max": "*"},
      {"id": "Patient.identifier", "path": "Patient.identifier", "min": 0, "max": "*"},
      {"id": "Patient.name", "path": "Patient.name", "min": 0, "max": "*"}
    ]
  }
}

The challenge: we need to be able to (a) look up a specific resource by name and version, and (b) do full-text search across all resources. Doing that by scanning the filesystem on every query would be far too slow.

Why SQLite + FTS5

We chose SQLite with FTS5 (Full-Text Search 5) for the index. Here's the reasoning:

Zero infrastructure. SQLite is a single file. No server process, no ports, no configuration. For a local-first tool, this is ideal — the entire database is just a file in data/index/fhir_index.sqlite.

Ships with Python. The sqlite3 module is in Python's standard library. No pip install, no binary dependencies.

FTS5 is surprisingly powerful. SQLite's FTS5 extension supports ranked full-text search with a single SQL query. You create a virtual table that mirrors your main table, and then you can MATCH against it:

SELECT name, title, fhir_version
FROM fhir_resources_fts
WHERE fhir_resources_fts MATCH 'Patient'

This gives you ranked results, and it's fast — milliseconds over thousands of resources.

Predictable performance. SQLite's performance characteristics are well understood. For read-heavy workloads (which is all we do at runtime), it's excellent. No connection pooling, no query planning surprises.

The Indexing Pipeline

The indexer runs as a standalone script, separate from the server. It does three things:

Discover packages. Walk data/packages/ and data/fixtures/, find every directory with a package.json file.
Extract metadata. For each JSON file in a package, read the resource, extract the key fields (canonical URL, name, title, type, version, description), and normalize them into a consistent shape.
Write to SQLite. Insert every resource into the main table, then rebuild the FTS5 index.

Here's the key insight about the extraction step: we don't index everything. A StructureDefinition can have hundreds of elements, extensions, constraints, slicing rules. We extract only the metadata needed for lookup and search:

canonical_url  →  "http://hl7.org/fhir/StructureDefinition/Patient"
name           →  "Patient"
title          →  "Patient Resource"
type           →  "StructureDefinition"   (the resourceType)
fhir_version   →  "R4"
package_name   →  "hl7.fhir.r4.core"
package_version → "4.0.1"
resource_type  →  "Patient"               (the FHIR type, like Patient, Observation)
summary_text   →  "Demographics and other administrative information..."
json_payload   →  (the full JSON, stored for retrieval)

The json_payload is stored but not included in search by default. It's there so we can return the full resource when requested, but we don't want FTS5 indexing the entire JSON blob — that would bloat the index and produce noisy search results.

Normalization: Why It Matters

FHIR resources aren't consistent in their metadata fields. A StructureDefinition has type and kind. A ValueSet has neither. A CodeSystem has a content field that's irrelevant to us. Different FHIR versions may organize fields slightly differently.

We wrote normalization functions for each resource type:

normalize_structure_definition(sd, package_name, version, ...) → flat dict
normalize_value_set(vs, package_name, version, ...)             → flat dict
normalize_code_system(cs, package_name, version, ...)           → flat dict

Each one extracts the same set of fields into the same shape, regardless of the resource type. This means the handlers don't need to know the difference between indexing a ValueSet and a StructureDefinition — they all look the same in the database.

This was a lesson in write boring normalization code early, save debugging time later. We initially skipped normalization and tried to query the raw JSON fields with SQLite JSON functions. It worked, but the queries were fragile, slow, and different for each resource type. Flat normalization was a much better investment.

The "Rebuild the World" Pattern

Our indexer always starts by deleting all existing data and re-indexing from scratch:

conn.execute("DELETE FROM fhir_resources")
# ... re-index everything ...
conn.execute("INSERT INTO fhir_resources_fts(fhir_resources_fts) VALUES('rebuild')")

This is intentional. We're indexing static, versioned packages — not a stream of live data. The total data volume is small enough (seconds to minutes to index) that incremental updates aren't worth the complexity. "Delete everything and rebuild" is simple, correct, and fast enough.

The FTS5 'rebuild' command is important — it tells SQLite to reconstruct the full-text index from the content table. Without it, the FTS index would be stale after a bulk delete/insert.

Phase 2: Designing the URI Scheme

Why Custom URIs?

MCP has a concept of resources — read-only data items identified by URIs. The AI can "read" a resource by requesting its URI, similar to how a browser requests a URL.

We needed URIs that were:

Human-readable — a developer should be able to look at a URI and know what it refers to.
Parseable — the server needs to extract version, resource type, and name from the URI to do a lookup.
Unambiguous — the same name can exist in different contexts (the Patient StructureDefinition in R4 vs R5, or in US Core vs base FHIR).

We designed three URI schemes:

fhir://R4/StructureDefinition/Patient
 │     │         │               │
 │     │         │               └── Resource name
 │     │         └── Resource kind
 │     └── FHIR version
 └── Scheme (core FHIR)

ig://hl7.fhir.us.core/5.0.1/StructureDefinition/us-core-patient
 │        │              │            │                │
 │        │              │            │                └── Profile name
 │        │              │            └── Resource kind
 │        │              └── IG version
 │        └── IG package name
 └── Scheme (Implementation Guide)

uscore://5.0.1/StructureDefinition/us-core-patient
  │       │           │                  │
  │       │           │                  └── Profile name
  │       │           └── Resource kind
  │       └── US Core version
  └── Scheme (convenience shorthand for US Core)

The uscore:// scheme is a convenience alias. US Core is by far the most commonly referenced IG in the US healthcare ecosystem, so it gets a shorthand.

Parsing and Formatting

We built a small uri_scheme package with two modules:

Parsing uses a regex to decompose a URI into its components:

"fhir://R4/StructureDefinition/Patient"
  → { scheme: "fhir", version: "R4", name: "Patient" }

Formatting does the reverse — construct a URI from components:

format_fhir_uri("R4", "Patient")
  → "fhir://R4/StructureDefinition/Patient"

A design decision we made here: StructureDefinition is hardcoded in the URI path. We debated making the resource type a variable, but in practice, 95%+ of the resources that AI assistants ask about are StructureDefinitions (or profiles, which are StructureDefinitions). ValueSets and CodeSystems are almost always accessed via search, not direct URI lookup. Hardcoding simplified the URI scheme and made the common case cleaner.

If we ever need to support fhir://R4/ValueSet/administrative-gender, we can extend the regex. But we haven't needed to yet, and premature generalization would have complicated the parser for no benefit.

Phase 3: Building the Tool Handlers

The Handler Pattern

Every tool in our server follows the exact same structure:

1. Define a Pydantic model for the input
2. Write a handler function that takes the model and returns a result
3. Wrap them in a Tool object
4. Register the Tool in the registry

This isn't accidental — we arrived at it after trying a few alternatives.

Attempt 1: Functions with `kwargs`.** We tried defining handlers as functions that accept keyword arguments directly. The problem: no validation, no type checking, no way for MCP to communicate the expected schema to the AI. The AI would send inputs in unexpected shapes and we'd get runtime KeyErrors.

Attempt 2: Decorated functions. We tried a decorator approach where you'd annotate a function and metadata would be extracted automatically. Clever, but opaque. When something went wrong, the stack trace pointed to decorator internals, not our code. And new team members couldn't understand how tools were registered without understanding the decorator machinery.

Attempt 3 (what we kept): Explicit Tool class. A simple class with three attributes: name, input_model, handler. No magic. No metaclasses. The registration is a dictionary assignment. The cost is a few extra lines per tool. The benefit is total clarity.

Here's the conceptual pattern:

┌──────────────────────────────────────────────────────┐
│  Handler File: fhir_search.py                        │
│                                                      │
│  1. Input Model (Pydantic)                           │
│     query: str                                       │
│     version: Optional[str]                           │
│     kind: Optional[str]                              │
│     top_n: int = 10                                  │
│                                                      │
│  2. Handler Function                                 │
│     Takes validated input → queries SQLite FTS        │
│     Returns list of matching resources               │
│                                                      │
│  3. Tool Object                                      │
│     name = "fhir.search"                             │
│     input_model = FhirSearchInput                    │
│     handler = fhir_search_handler                    │
└──────────────────────────────────────────────────────┘
          │
          ▼
┌──────────────────────────────────────────────────────┐
│  Registry: tools.py                                  │
│                                                      │
│  TOOL_REGISTRY = {                                   │
│      "fhir.search": fhir_search_tool,                │
│      "fhir.get_definition": fhir_get_definition_tool,│
│      ...                                             │
│  }                                                   │
└──────────────────────────────────────────────────────┘

Tool-by-Tool: The Thinking Behind Each One

Let's walk through each tool and the reasoning behind it.

`fhir.get_definition` — The Surgical Lookup

What it does: Given a FHIR version, resource kind, and name, returns the metadata (and optionally the full JSON) for that specific resource.

Why it exists: This is the most fundamental operation. When an AI is discussing the Patient resource, it needs to be able to say "let me look that up" and get the authoritative definition. Not a search result. Not a "maybe." The exact definition.

Design choices:

include_json defaults to false. Metadata (name, title, canonical URL, version, description) is usually enough for the AI to answer a question. The full JSON is huge and should only be retrieved when specifically needed.
When include_json is true, the payload is truncated to 10,000 characters. A full StructureDefinition can be 50KB+. Truncation keeps the response within reasonable context window limits while still providing useful structural information.
Returns (meta_dict, json_string) — separating metadata from the payload lets the AI decide what to use without parsing raw JSON.

`fhir.search` — The Exploration Tool

What it does: Full-text search across all indexed resources, with optional filters for version, kind, and IG.

Why it exists: Sometimes the AI doesn't know the exact resource name. A user might ask "what FHIR resource handles allergies?" The AI needs to search, not just look up. This tool lets it query the index the same way a human would search a specification.

Design choices:

top_n defaults to 10. Returning too many results wastes context. 10 is enough for the AI to find what it needs.
Filters are all optional. You can search across everything (query: "allergy"), or narrow it down (query: "allergy", version: "R4", kind: "StructureDefinition").
Results include metadata only, not full JSON. If the AI finds what it's looking for, it can follow up with fhir.get_definition for the full payload.

`ig.list` — The Discovery Tool

What it does: Returns a list of all Implementation Guides that have been indexed.

Why it exists: Before the AI can query an IG, it needs to know what IGs are available. This tool answers the question "what IGs does this server know about?" It's the starting point for IG-related conversations.

Design choices:

Takes no input. It's purely a discovery mechanism.
Returns package name, version, and FHIR version for each IG.

`uscore.get_profile` — The Shortcut

What it does: Fetches a US Core profile by version and name.

Why it exists: US Core is the most commonly referenced IG in US healthcare development. Having a dedicated tool for it (instead of making the AI use fhir.get_definition with the right package name) reduces the number of parameters the AI needs to get right and makes the common case faster.

Design choices:

Separate from fhir.get_definition even though it queries the same database. The semantic distinction matters to the AI — "get a US Core profile" is a different intent than "get a FHIR definition."

`fhir.diff_versions` — The Migration Helper

What it does: Compares a StructureDefinition between two FHIR versions (e.g., R4 vs R5).

Why it exists: One of the most common questions in FHIR development is "what changed between versions?" When migrating from R4 to R5, developers need to know which elements were added, removed, or renamed.

Design choices:

Currently does a metadata-level diff only — comparing the top-level fields. A full element-path diff (comparing every element in the differential/snapshot) is complex and was deferred.
The tool exists with partial functionality rather than not existing at all. This is deliberate: the AI knows the capability exists and can provide partial answers ("the metadata changed in these ways, though a full element diff isn't available yet") rather than no answer.

`validate.instance` — The Placeholder

What it does: Nothing, currently. Returns a "not implemented" response.

Why it exists as a stub: We wanted the tool in the registry from day one, even though validation is hard. Why?

It signals intent. Other developers (and the AI itself) can see that validation is a planned capability.
It establishes the input contract early. The Pydantic model defines what validation will eventually accept.
It fails gracefully. If the AI tries to use it, it gets a clear "not implemented" message rather than a confusing error.

Phase 4: The Transport Layer — Less Is More

stdio: The Primary Transport

MCP's standard transport is JSON-RPC over stdio. The client (Claude Desktop, Cursor, etc.) spawns the server as a child process, sends JSON on stdin, and reads JSON from stdout. stderr is reserved for logging.

Our stdio transport is surprisingly simple. The core loop:

1. Read a line from stdin
2. Parse it as JSON
3. Route to the right handler based on the "method" field
4. Serialize the response as JSON
5. Write it to stdout + newline
6. Flush

A few things we learned:

Always flush stdout. If you don't explicitly flush after writing, the response may sit in a buffer and the client will hang waiting for it. This bit us during early testing — everything worked in manual testing (where stdout is line-buffered to a terminal) but hung in Claude Desktop (where stdout is fully buffered to a pipe).

Log to stderr, never stdout. Stdout is the protocol channel. Any print statement that goes to stdout will be interpreted as a JSON-RPC message and break the protocol. We learned to use print(..., file=sys.stderr) for all diagnostic output and configured Python's logging to write to stderr.

Catch and serialize all exceptions. If the handler throws, the transport catches it and returns a structured error response. If the transport itself throws (e.g., malformed JSON), it still writes a valid JSON error to stdout. The client should never see a raw traceback on the protocol channel.

HTTP: The Development Convenience

We added a simple HTTP transport for development and testing. It runs the same handlers but accepts requests via HTTP POST instead of stdin.

Why? Because testing via stdin is painful. You have to pipe JSON into the process, read from stdout, and deal with buffering. With HTTP, you can use curl:

curl -X POST http://localhost:8080 \
  -H "Content-Type: application/json" \
  -d '{"method": "invoke_tool", "params": {"name": "fhir.search", "input": {"query": "Patient"}}}'

The HTTP server also exposes:

GET /health — for readiness probes (important for Tilt, which we'll cover in Part 3)
GET /tools — quick way to see what tools are available

We built this using Python's built-in http.server module — no Flask, no FastAPI, no additional dependencies. For a dev-only transport, stdlib is enough.

The Glue: How Settings Hold It All Together

Configuration flows through a single Settings class built with Pydantic Settings:

Settings:
  data_dir:         "data"                              (base data directory)
  index_path:       "data/index/fhir_index.sqlite"      (SQLite index)
  packages_dir:     "data/packages"                     (FHIR packages)
  fixtures_dir:     "data/fixtures"                     (demo data)
  log_level:        "INFO"
  storage_backend:  "sqlite"                            (or "postgres")
  pg_host:          "localhost"                          (PostgreSQL config)
  pg_port:          5432
  pg_database:      "fhir_mcp"
  ...

Everything is configurable via environment variables with the FHIR_MCP_ prefix. So FHIR_MCP_INDEX_PATH=/custom/path.sqlite overrides the default index path.

Why Pydantic Settings instead of just os.environ.get()? Because:

Type coercion. pg_port is declared as int, so the string from the environment is automatically converted.
Defaults in one place. You can read the Settings class and see every configuration option with its default.
Validation at startup. If you set FHIR_MCP_PG_PORT=not_a_number, Pydantic catches it immediately rather than failing on first database connection.

What Surprised Us About Building Tools for AI

Surprise 1: The AI prefers narrow tools over flexible ones

We initially tried to build a single "query" tool that could do lookups, search, and filtering all in one. The AI struggled with it — too many optional parameters, too many modes. When we split it into focused tools (get_definition for exact lookup, search for exploration, ig.list for discovery), the AI's tool selection accuracy improved dramatically.

Lesson: Build many focused tools, not few flexible ones.

Surprise 2: Optional fields need good defaults

When we had top_n as a required field on the search tool, the AI would sometimes send top_n: 100 or top_n: 1000. When we made it optional with a default of 10, the AI almost always omitted it (using the default) or sent a reasonable value.

Lesson: Defaults guide AI behavior. Choose them carefully.

Surprise 3: Error messages are consumed by the AI, not humans

When a tool returns an error, the AI reads it and decides what to do next. We initially returned generic errors like {"error": "Not found"}. The AI would then tell the user "the resource wasn't found" without any helpful context. When we improved errors to include specifics — {"error": "StructureDefinition 'Patientt' not found in R4. Did you mean 'Patient'?"} — the AI became much better at self-correcting.

Lesson: Write error messages for your AI caller, not for a log file.

Surprise 4: The storage backend swap validated the architecture

Halfway through development, we decided to add PostgreSQL as an alternative storage backend (for teams that wanted shared indexes or larger datasets). Because we'd built the storage layer as an interface — get_definition_by_name(), search_definitions(), list_igs() — we could add a Postgres implementation without touching a single handler or transport file.

The storage module uses a simple factory based on an environment variable:

FHIR_MCP_STORAGE_BACKEND=sqlite  →  uses sqlite_store
FHIR_MCP_STORAGE_BACKEND=postgres →  uses postgres_store

PostgreSQL uses tsvector/tsquery for full-text search instead of FTS5. The query interface is the same. The handlers don't know or care which backend is active.

Lesson: Layer your architecture. The decision to separate storage from handlers paid for itself within weeks.

Coming Up in Part 3

In the final post, we'll cover the operational side: how we test an MCP server, the developer experience with Tilt and Docker, lessons learned about deploying to different clients (Claude Desktop vs Cursor), and what we'd do differently if we started over today.

*This is Part 2 of a 3-part series.
Part 0: MCP — The Missing Layer Between AI and Your Application →

If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

Part 1: Why We Built an MCP Server — And What We Learned Before Writing a Single Line of Code

Chetan Gupta — Mon, 02 Mar 2026 04:39:47 +0000

A three-part series on building our first Model Context Protocol server for healthcare interoperability.

The Problem That Wouldn't Go Away

If you've ever worked in healthcare tech, you know the feeling: someone asks an AI assistant — Claude, ChatGPT, Copilot, whatever — a question about FHIR (Fast Healthcare Interoperability Resources), and the answer is close but dangerously wrong. Maybe it hallucinates a field that doesn't exist in R4. Maybe it confuses a US Core profile with a base resource. Maybe it confidently describes an element that was removed two versions ago.

This isn't the AI's fault. FHIR is a vast, versioned specification. The core spec alone has hundreds of StructureDefinitions, ValueSets, and CodeSystems. Layer on Implementation Guides (IGs) like US Core, and you're dealing with thousands of artifacts across multiple versions (R4, R4B, R5). No language model has all of that committed to memory with version-level precision.

We kept running into this problem on our team. We'd be deep in implementation work — mapping clinical data, validating resources, reviewing profiles — and every time we turned to an AI for help, we had to mentally fact-check every response against the actual specification. It was exhausting.

So we asked ourselves: what if the AI could just look it up?

Not from a web search. Not from its training data. From the actual, versioned, canonical FHIR packages sitting right on our machine.

That's how fhir-mcp was born.

Why MCP? (And Why Not Just an API?)

Before we chose the Model Context Protocol, we considered the obvious alternatives:

Option 1: Fine-tune a model on FHIR specs

We dismissed this quickly. FHIR evolves. New IGs are published constantly. Fine-tuning is expensive, slow, and creates a snapshot in time. We needed something that could reflect the state of your local packages — whatever you've got downloaded today.

Option 2: RAG (Retrieval-Augmented Generation) pipeline

This was tempting. Embed all the JSON, throw it in a vector store, retrieve context at query time. But we realized two things:

FHIR resources are highly structured JSON, not prose. Embedding-based search over deeply nested JSON objects loses the structural relationships that matter most.
We didn't just want "related text chunks." We wanted the AI to be able to call specific, typed operations: "get me the Patient StructureDefinition from R4," "search across all indexed resources for 'blood pressure,'" "diff the Observation resource between R4 and R5."

Option 3: Build a REST API and tell the user to paste results

This works, but it breaks the flow. The whole point was to let the AI autonomously look things up during a conversation — not to make the human be the middleware.

Why MCP Won

MCP is purpose-built for exactly this: giving AI models structured access to external data and tools. Instead of building a generic API and hoping the AI figures out how to use it, MCP lets you declare:

Tools: Functions the AI can call with typed inputs. "Here's a function called fhir.search that takes a query string and optional filters and returns matching FHIR resources."
Resources: Read-only data the AI can access via URIs. "Here's fhir://R4/StructureDefinition/Patient — read it to get the Patient definition."
Prompts: Reusable prompt templates. "Here's a prompt called summarize_profile that guides you to explain a FHIR profile in plain language."

The AI doesn't need to know how we indexed the data, or where the SQLite database lives, or how the JSON was normalized. It just sees a clean interface of tools it can call.

And critically: MCP is transport-agnostic. The same server can talk to Claude Desktop over stdio, to Cursor over stdio, or to a web client over HTTP. We wouldn't have to rewrite anything when switching clients.

The Architectural Decisions We Made on Day One

Before writing any code, we spent time on design decisions that would shape everything downstream. Here's what we chose and why.

Decision 1: Local-First, Read-Only

We made a hard rule: this server will never write data, and it will never call external APIs at runtime.

Why? Because this is a healthcare context. We're indexing StructureDefinitions, not patient data — but even so, the principle matters. If you're building developer tooling in health tech, you want to be able to say "this thing runs entirely on your machine with zero network calls" without an asterisk.

This also made the architecture simpler. No auth, no API keys, no rate limits, no network error handling in the hot path. The server boots, reads from a local SQLite database, and responds. That's it.

Decision 2: Index First, Serve Second

We realized early that "just read the JSON files at query time" wouldn't scale. A full FHIR R4 package has thousands of JSON files. Searching them by scanning the filesystem on every query would be unacceptably slow.

So we split the system into two phases:

Index phase (offline): Read every FHIR package, extract metadata from each resource, and store it in a SQLite database with FTS5 (full-text search). This runs once, before the server starts.
Serve phase (runtime): The MCP server only talks to the SQLite database. Fast, predictable, no filesystem scanning.

This was one of our best decisions. It meant:

The indexer could be ugly and slow — it only runs once.
The server could be fast and simple — it only does SQL queries.
We could later swap SQLite for PostgreSQL without touching the server code (and we did).

Decision 3: One Handler Per Tool, Pydantic for Everything

We debated putting all tool logic in one big handler file. We're glad we didn't.

Each MCP tool gets its own file. Each file defines:

A Pydantic model for the tool's input
A handler function that takes the validated input and returns a result
A Tool object that bundles the name, input model, and handler together

Here's why this pattern matters:

Validation happens before logic. If an AI sends garbage input, Pydantic catches it and returns a structured error. The handler never sees invalid data. This is crucial when your caller is an AI — they will send unexpected inputs, and you need to fail cleanly.

Each tool is independently testable. You can unit test the search handler without spinning up the transport layer. You can test the diff handler without having any other tools registered.

Adding a new tool is mechanical. Create a file, define a Pydantic model, write the handler, register it in the tool registry. No touching the transport layer, no modifying the main server loop.

Here's a simplified example of what one handler looks like conceptually:

┌───────────────────────────────────────┐
│  FhirSearchInput (Pydantic Model)     │
│  ├── query: str                       │
│  ├── version: Optional[str]           │
│  ├── kind: Optional[str]              │
│  └── top_n: int = 10                  │
├───────────────────────────────────────┤
│  fhir_search_handler(input) -> list   │
│  └── Calls into SQLite FTS5 search    │
├───────────────────────────────────────┤
│  Tool("fhir.search", model, handler)  │
│  └── Registered in TOOL_REGISTRY      │
└───────────────────────────────────────┘

Decision 4: Registry Pattern for Discovery

MCP requires the server to respond to list_tools, list_resources, and list_prompts requests. The client needs to know what's available before it can call anything.

We used a simple dictionary registry:

TOOL_REGISTRY = {
    "fhir.get_definition": fhir_get_definition_tool,
    "fhir.search": fhir_search_tool,
    "ig.list": ig_list_tool,
    ...
}

This is deliberately low-tech. No decorators, no metaclasses, no auto-discovery. Just a dictionary. When the transport layer receives list_tools, it returns the keys. When it receives invoke_tool, it looks up the tool by name and calls it.

Why not something fancier? Because we wanted to see the full list of tools in one place. When you're building an MCP server, the tool inventory is your API surface. Making it explicit and visible in a single file means any developer can open that one file and understand the entire capability set of the server.

Decision 5: Transport as a Thin Layer

The transport layer (stdio, HTTP) should do as little as possible. Its job is:

Read a JSON-RPC request from the wire (stdin or HTTP body).
Route it to the right handler.
Write the JSON-RPC response back.

All business logic lives in the handlers. All data access lives in the storage layer. The transport is just plumbing.

This was validated when we added HTTP transport for development. The handler code didn't change at all. We just wrote a new way to receive requests and send responses. The HTTP server even reuses the same tool registry and the same routing logic.

The architecture looks like this:

┌─────────────────────────────────────────────────┐
│                  TRANSPORT LAYER                │
│  ┌───────────────┐    ┌──────────────────────┐  │
│  │  stdio (prod) │    │  HTTP (dev/testing)  │  │
│  └──────┬────────┘    └──────────┬───────────┘  │
│         │                        │              │
│         └───────────┬────────────┘              │
│                     ▼                           │
│           ┌─────────────────┐                   │
│           │  Request Router │                   │
│           └────────┬────────┘                   │
│                    │                            │
├────────────────────┼────────────────────────────┤
│              REGISTRY LAYER                     │
│  ┌──────────┬──────┴──────┬───────────┐         │
│  │  Tools   │  Resources  │  Prompts  │         │
│  └────┬─────┘             └───────────┘         │
│       │                                         │
├───────┼─────────────────────────────────────────┤
│       │          HANDLER LAYER                  │
│  ┌────┴─────────────────────────────────┐       │
│  │  fhir.get_definition                 │       │
│  │  fhir.search                         │       │
│  │  ig.list                             │       │
│  │  uscore.get_profile                  │       │
│  │  fhir.diff_versions                  │       │
│  │  validate.instance                   │       │
│  └────┬─────────────────────────────────┘       │
│       │                                         │
├───────┼─────────────────────────────────────────┤
│       │         PACKAGES LAYER                  │
│  ┌────┴──────────────────────────────────────┐  │
│  │  fhir_index (loaders, normalize, search,  │  │
│  │             storage)                      │  │
│  │  fhir_diff, fhir_validate, uri_scheme     │  │
│  │  shared (models, cache, schemas)          │  │
│  └────┬──────────────────────────────────────┘  │
│       │                                         │
│       ▼                                         │
│  ┌──────────────┐                               │
│  │  SQLite/PG   │                               │
│  │  (FTS index) │                               │
│  └──────────────┘                               │
└─────────────────────────────────────────────────┘

The Hardest Lesson: Designing for an AI Caller is Different

Here's something that surprised us. When you build a traditional API, your caller is a human developer who reads documentation, understands your mental model, and crafts requests thoughtfully.

When your caller is an AI, everything changes:

Tool naming matters enormously. We learned that names like fhir.get_definition and fhir.search aren't just organizational — they're what the AI uses to decide which tool to call. A vague name like lookup or query would lead to the AI guessing wrong. Namespaced, descriptive names (fhir.get_definition, uscore.get_profile, fhir.diff_versions) gave the AI clear signals about when to use each tool.

Input schemas are the AI's documentation. The Pydantic model for each tool isn't just for validation — it's what the AI reads to understand what inputs are expected. Field names, types, and defaults all serve as implicit documentation. We named fields like version, kind, name, top_n rather than abbreviations like v, k, n, limit because the AI interprets these names to understand their meaning.

Return shape consistency matters. Every tool returns a dict with predictable keys. The AI learns patterns quickly — if one tool returns {"meta": {...}} and another returns {"result": [...]}, it adapts. But inconsistency within a single tool across different call patterns (sometimes returning a list, sometimes a dict, sometimes a string) confuses it.

Truncation is a feature, not a bug. FHIR StructureDefinitions can be enormous — tens of thousands of characters of nested JSON. Sending the full thing back would blow the AI's context window. We learned to truncate payloads by default and only include the full JSON when explicitly requested (include_json: true), and even then, cap it at a reasonable size.

What We Didn't Build (And Why)

Equally important to what we built is what we deliberately left out of v0.1:

No authentication. This is a local-first, single-user tool. Auth would add complexity for zero benefit.
No write operations. The AI can look things up, not modify them. This was a safety and simplicity choice.
No network calls at runtime. Packages are fetched and indexed offline. The running server is fully air-gapped.
No custom FHIR SDK. We considered using existing FHIR Python libraries but decided raw JSON + SQLite was simpler, faster, and gave us full control over what we indexed.
No schema validation at the FHIR level. We have a validate.instance tool, but it's deliberately a stub. Proper FHIR validation is an enormous problem (profiles, extensions, invariants, terminology binding). We wanted the tool to exist in the interface — to signal future intent — without pretending we'd solved it.

Setting Up: The Toolchain Choices

A few notes on tooling, because they shaped the developer experience:

Python 3.13+ with uv: We chose Python because FHIR is a data-heavy domain and Python's ecosystem for data manipulation is unmatched. We used uv for dependency management — it's fast, it respects pyproject.toml, and it doesn't fight you. No requirements.txt files, no virtualenv scripts. Just uv sync and go.

Pydantic v2: For input validation and data modeling. Pydantic v2 is significantly faster than v1 and integrates cleanly with pydantic-settings for environment-based configuration.

SQLite with FTS5: For the search index. SQLite is zero-config, ships with Python, and FTS5 gives us full-text search without standing up Elasticsearch. For a local-first tool, this is perfect.

orjson: For JSON serialization/deserialization. FHIR resources are large JSON objects, and orjson is measurably faster than the stdlib json module. In a server that's mostly reading and writing JSON, this matters.

Coming Up in Part 2

In the next post, we'll get into the actual implementation: how we built the indexer, designed the URI scheme, implemented the tool handlers, and wired everything together through the transport layer. We'll share the specific patterns that worked (and the ones we had to throw away).

*This is Part 1 of a 3-part series.
Part 0: MCP — The Missing Layer Between AI and Your Application →

Part 1: Why We Built an MCP Server — And What We Learned Before Writing a Single Line of Code
Part 2: Building the Engine — Tools, URIs, and the Art of Indexing FHIR
Part 3: Testing, Deploying, and Lessons Learned -> coming soon If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

Part 0: MCP — The Missing Layer Between AI and Your Application

Chetan Gupta — Sat, 21 Feb 2026 15:07:31 +0000

A prequel to my three-part series on building an MCP server. This post stands on its own — no code, no codebase required. Just the idea that changed how we think about AI integration.

AI Has a Context Problem

Let's start with an uncomfortable truth: the AI you're chatting with right now doesn't know your application.

It doesn't know your database schema. It doesn't know which API version you're running in production. It doesn't know that your team renamed user_id to account_id six months ago, or that your FHIR implementation uses US Core 5.0.1, not 6.1.0, or that the Observation resource in your system carries a custom extension for lab accession numbers.

The AI knows a lot about the world in general. But it knows almost nothing about your world in particular.

And this isn't a failure of AI. It's a failure of plumbing.

The Way We Integrate AI Today Is Backwards

Think about how most teams add AI to their workflow today:

Copy some context from your app (a schema, a log snippet, an error message).
Paste it into an AI chat window.
Hope the AI interprets it correctly.
Read the response and mentally cross-reference it against reality.
Repeat.

This is the human-as-middleware pattern. You are the integration layer between the AI and your application. You ferry data back and forth, translate context, and validate every response because the AI has no independent way to check its own answers.

It works. Kind of. But it doesn't scale. And in domains where precision matters — healthcare, finance, infrastructure, compliance — "kind of works" is a liability.

Consider what happens when a developer asks an AI assistant:

"What are the required fields in a FHIR R4 Patient resource?"

The AI might answer from its training data. Maybe it's right. Maybe it's describing R3 fields. Maybe it's mixing in elements from a US Core profile without saying so. Maybe it hallucinated a field that never existed. The developer has no way to tell without opening the specification themselves — which defeats the purpose of asking the AI in the first place.

Now imagine the AI could do this instead:

"Let me look that up."
(calls fhir.get_definition with version=R4, kind=StructureDefinition, name=Patient)
"Here's the Patient resource from the R4 specification. The required elements are..."

Same question. But now the answer is grounded in the actual specification, not a statistical approximation of it. The AI didn't guess — it looked it up. Just like you would.

That's what MCP enables.

What Is MCP, Actually?

MCP stands for Model Context Protocol. It's an open protocol — originally developed by Anthropic and now an open standard — that defines how AI models communicate with external tools, data sources, and services.

But that description buries the lead. Here's what MCP actually is in practice:

MCP is a contract between an AI and the systems it can interact with.

That contract has three parts:

1. Tools — "Here are functions you can call"

A tool is a typed function that the AI can invoke. You define the name, the inputs (with types and descriptions), and what it returns. The AI sees this contract and decides when to call the tool during a conversation.

Think of it like giving the AI an API client — but instead of REST endpoints with ambiguous documentation, each tool has a strict schema that the AI can reason about.

Example of what a tool means to the AI:

"I have a tool called fhir.search.
 It takes a query string and optional filters.
 It returns a list of matching FHIR resources.
 I should use this when the user asks about FHIR resources
 and I'm not sure of the exact name or want to explore."

The AI isn't reading documentation to figure this out. The tool's name, its input field names, its types — all of that is the documentation. The schema is the interface.

2. Resources — "Here is data you can read"

Resources are read-only data items identified by URIs. Unlike tools (which are actions), resources are data you can look at. The AI can request a resource by URI and get back structured content.

Think of resources as a filesystem the AI can browse:

fhir://R4/StructureDefinition/Patient     → the Patient definition
fhir://R5/StructureDefinition/Observation → the Observation definition
uscore://5.0.1/StructureDefinition/us-core-patient → the US Core Patient profile

The AI doesn't need to know where these live on disk or how they're stored. It just requests a URI and gets data back.

3. Prompts — "Here's how to approach a task"

Prompts are reusable templates that guide the AI on how to use tools and present results. They're the "playbook" that says: "When someone asks you to summarize a FHIR profile, here's the approach..."

Prompts are the least understood part of MCP, but they're important. They bridge the gap between raw tool output (structured data) and what the human actually needs (an explanation, a comparison, a recommendation).

Why MCP Matters for Application Development

Here's the argument I want to make: every non-trivial application should eventually expose an MCP interface.

Not because it's trendy. Because the alternative — expecting AI to understand your application from general knowledge — will increasingly become a bottleneck.

Let me make the case through five observations.

Observation 1: AI is already in your team's workflow

Whether you've officially "adopted AI" or not, your developers are using Claude, ChatGPT, Copilot, or Cursor every day. They're asking it about your codebase, your APIs, your domain. And the AI is answering from general knowledge — which means it's getting your specifics wrong a non-trivial percentage of the time.

MCP lets you meet the AI where it already is. Instead of fighting the fact that developers use AI, you make the AI more useful by giving it access to your actual systems.

Observation 2: Context stuffing doesn't scale

The common workaround for AI's lack of context is to paste relevant information into the prompt. "Here's my schema. Here's the error log. Here's the config file." This is context stuffing, and it has hard limits:

Context window limits. Even with 200K token models, you can't paste your entire codebase.
Relevance filtering. The human has to decide what's relevant before asking the question, which assumes they already know the answer's shape.
Staleness. The pasted context is a snapshot. If the schema changed yesterday and you pasted last week's version, the AI's answer is wrong.

MCP replaces context stuffing with context fetching. The AI asks for what it needs, when it needs it, from the live source. No human in the loop. No stale snapshots.

Observation 3: Structured tools beat unstructured context

There's a fundamental difference between giving an AI a blob of text and giving it a typed tool.

Unstructured context: "Here's a JSON file with 3,000 lines of FHIR StructureDefinitions. Somewhere in there is the information about the Patient resource."

Structured tool: "Call fhir.get_definition(version='R4', kind='StructureDefinition', name='Patient') and you'll get exactly the Patient definition with metadata."

The unstructured approach makes the AI do the work of parsing, searching, and disambiguating. The structured approach makes the server do that work — where it can use proper indexing, query optimization, and validation — and gives the AI a clean result.

This is the same lesson the industry learned with databases decades ago. You don't give users a flat file and tell them to grep for what they need. You give them a query interface. MCP is the query interface for AI.

Observation 4: AI clients are converging on MCP

Claude Desktop supports MCP natively. Cursor supports MCP. VS Code is adding MCP support. The ecosystem is converging on this protocol as the standard way for AI assistants to interact with external systems.

This means building an MCP server isn't a bet on one AI provider. It's an investment that works across every MCP-compatible client. Write once, work everywhere — the same server handles Claude, Cursor, and whatever comes next.

Observation 5: The best time to build an MCP server is before you need one

Here's a pattern we see:

A team starts using AI for development.
AI gives wrong answers about the team's specific domain.
The team compensates with manual context stuffing and mental fact-checking.
Months pass. The workarounds become exhausting.
Someone says "we should build a tool for this."

The teams that build the MCP server at step 2 save months of accumulated friction. The ones that wait until step 5 have to retrofit it while already being frustrated.

The Motivation Behind my Project

I work in healthcare interoperability. My domain is FHIR — the standard that governs how health data is structured and exchanged between systems. It's a specification that:

Has hundreds of resource types (Patient, Observation, Condition, MedicationRequest, ...).
Spans multiple versions (R4, R4B, R5) with subtle but important differences between them.
Is extended by Implementation Guides (US Core, Da Vinci, mCODE, ...) that add constraints, profiles, and extensions.
Is deeply structural — a StructureDefinition has elements, types, cardinality constraints, slicing rules, invariants, and bindings to terminology.

This is exactly the kind of domain where AI confidently gives almost-right answers. And in healthcare, almost-right is dangerous. A developer who implements a resource mapping based on a hallucinated field name creates a real interoperability bug — one that might not surface until clinical data flows through the wrong path.

We needed the AI to stop guessing and start looking things up.

But we also wanted something broader than a single-purpose tool. We wanted to validate an approach: can you take a complex, versioned, deeply structured specification and make it available to AI in a way that's fast, local, and useful?

The answer is yes. And the approach generalizes.

MCP Is Not Just for FHIR

Everything we built for FHIR could be applied to any domain with these characteristics:

Complex, versioned specifications

OpenAPI/Swagger specs: An MCP server that lets AI look up your API endpoints, request/response schemas, and versioning — from the actual spec file, not from memory.
Database schemas: An MCP server that queries your database metadata (tables, columns, types, relationships, indexes) so the AI can write correct SQL without you pasting the schema.
Infrastructure-as-Code: An MCP server that reads your Terraform state, CloudFormation templates, or Kubernetes manifests so the AI understands your actual infrastructure, not a generic tutorial version.

Regulatory or compliance frameworks

HIPAA, SOC2, GDPR: An MCP server that lets AI look up specific regulatory requirements, controls, and your organization's compliance status.
Clinical terminology: SNOMED CT, LOINC, ICD-10 — enormous code systems that AI can't memorize but could search and retrieve through tools.

Internal knowledge

Internal documentation: An MCP server that indexes your team's runbooks, architecture decision records, and onboarding guides.
Configuration management: An MCP server that reads your application's feature flags, environment configs, and deployment status.

The pattern is always the same:

┌───────────────────────────────────────────────────┐
│              Your Domain Knowledge                │
│                                                   │
│  Specifications, schemas, configs, terminology,   │
│  documentation, compliance requirements, ...      │
└───────────────────┬───────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────┐
│              Indexer / Loader                      │
│                                                   │
│  Extract, normalize, store in a searchable index  │
└───────────────────┬───────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────┐
│              MCP Server                           │
│                                                   │
│  Tools: lookup, search, compare, validate         │
│  Resources: addressable items via URIs            │
│  Prompts: guidance on how to use the output       │
└───────────────────┬───────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────┐
│              AI Client                            │
│                                                   │
│  Claude Desktop, Cursor, VS Code, custom apps...  │
│  Calls tools, reads resources, follows prompts    │
└───────────────────────────────────────────────────┘

What Changes When AI Can Look Things Up

When we shipped the first working version of our FHIR MCP server and plugged it into Claude Desktop, something shifted in how we worked.

Before MCP:

"Claude, what elements are in FHIR R4 Patient?" → Read response, open spec to verify, correct two errors, paste corrections back
"What's different about Observation between R4 and R5?" → Claude gives a plausible but unverifiable answer. Spend 20 minutes diffing specs manually.
"Does US Core require Patient.identifier?" → Claude says yes confidently. Is it right? Open the IG, find the profile, check the cardinality. Claude was right this time, but you had to check.

After MCP:

"Claude, what elements are in FHIR R4 Patient?" → Claude calls fhir.get_definition, returns the actual definition, summarizes it. No need to verify — it's from the spec.
"What's different about Observation between R4 and R5?" → Claude calls fhir.diff_versions, gets the actual differences, explains them.
"Does US Core require Patient.identifier?" → Claude calls uscore.get_profile, reads the constraint, answers with the actual cardinality and must-support flag.

The mental overhead disappeared. Not partially — entirely. We stopped being the middleware between the AI and the specification. The AI handled it.

And here's the subtle thing: we started asking better questions. When you trust that the AI's answers are grounded, you ask more ambitious questions. You ask follow-ups. You explore edge cases. The conversation becomes collaborative instead of adversarial.

The Counterarguments (And Why We Disagree)

"Just use a bigger context window"

Context windows are getting larger, and some people argue that you should just dump everything into the prompt. But this misses several points:

Bigger context ≠ better retrieval. Studies consistently show that models struggle to find specific information in very long contexts ("lost in the middle" problem). A targeted tool call beats a 200K-token haystack.
Cost scales with context. Larger prompts cost more per request. A tool call that returns 500 tokens of targeted data is cheaper than pre-loading 50,000 tokens of "just in case" context.
Latency scales with context. Time-to-first-token increases with prompt length. Small, focused tool calls keep the conversation snappy.

"Just use RAG"

RAG is great for unstructured documents. But when your data is structured — schemas, specifications, typed resources — RAG's embedding-and-chunk approach loses structural relationships. You can't meaningfully embed a 40KB JSON StructureDefinition and expect cosine similarity to find "the cardinality of Patient.identifier.system."

MCP tools can do targeted, structured queries. RAG can't. They're complementary, but for structured domains, MCP is the right tool.

"We'll wait for AI to get better"

AI will get better. Models will memorize more. But the long tail of domain-specific, versioned, organization-specific knowledge will always exceed what's in training data. Your database schema isn't in GPT-5's training set. Your FHIR IG published last month isn't either. MCP bridges this gap regardless of how smart the model gets.

"Building an MCP server is too much work"

Our first working version was ~500 lines of Python across the server, handlers, and transport. The indexer was ~100 lines. We used SQLite (ships with Python), Pydantic (one pip install), and JSON-RPC (a trivial protocol). No infrastructure. No cloud services. No frameworks.

If you can build a CLI tool, you can build an MCP server. The protocol is simpler than REST.

How to Think About Your First MCP Server

If you're considering building an MCP server, here's the decision framework we'd recommend:

Step 1: Identify the "fact-checking tax"

Where does your team spend time verifying AI outputs against ground truth? Every time someone copies a schema into a prompt, checks an API response against documentation, or says "let me verify that" after reading an AI answer — that's the tax. The bigger the tax, the stronger the case for MCP.

Step 2: Identify the data source

What's the ground truth? A specification? A database? An API? A set of configuration files? This is what your MCP server will index or query.

Step 3: Identify the operations

What does the AI need to do with that data? Usually it's some combination of:

Lookup: Get a specific item by identifier.
Search: Find items matching a query.
Compare: Diff two versions or configurations.
Validate: Check if something conforms to a specification.
List: Enumerate available items.

Each of these becomes an MCP tool.

Step 4: Start with one tool

Don't build all six tools on day one. Build the lookup tool. Get it working in Claude Desktop or Cursor. Use it for a week. You'll immediately discover what the second tool should be.

Step 5: Iterate based on what the AI gets wrong

Watch how the AI uses your tools. When it calls the wrong tool, that's a signal that your tool names or schemas need clarification. When it sends bad inputs, that's a signal that your input model needs better field names or defaults. When it presents the output poorly, that's a signal that you need a prompt.

MCP servers are living things. They improve through use.

Where This Is Headed

We believe MCP (or something like it) will become standard infrastructure for software teams. Not today, maybe not this year, but soon. The same way that APIs became standard for service-to-service communication, MCP will become standard for AI-to-application communication.

The teams that build MCP servers early will have a head start. They'll have cleaner tool interfaces, better prompt patterns, and more experience with AI-as-caller design. They'll also have developers who trust their AI assistants because those assistants actually give correct, grounded answers.

Our FHIR MCP server was a proof of concept. It works. It's useful. And it proved to us that the pattern generalizes. If your domain has complex, structured, versioned knowledge that AI gets wrong — and what domain doesn't? — building an MCP server is one of the highest-leverage investments you can make.

If you understand MCP deeply, integrating any new data sources/application/AI context becomes significantly easier.
If you would like to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data/AI success!

This post is a prequel to our three-part implementation series:

If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

Anthropic Claude Opus 4.6

Chetan Gupta — Fri, 06 Feb 2026 03:06:12 +0000

Anthropic has officially released Claude Opus 4.6 — and the benchmark numbers speak volumes.

Key Performance Highlights
• GDPval-AA Elo: Opus 4.6 outperforms its predecessor (Opus 4.5) by ~190 Elo points and beats OpenAI’s GPT-5.2 by ~144 Elo points on economically valuable knowledge work tasks.

• Terminal-Bench 2.0 (agentic coding): Achieves a leading score of ~65.4%, placing it at the top of real-world coding and task automation benchmarks.

• Higher context retention: On an 8-needle 1M variant of MRCR v2 (needle-in-haystack benchmark), Opus 4.6 scores 76% vs. ~18.5% for Sonnet 4.5 — a massive uplift in long-context retrieval.

• BigLaw Bench (legal reasoning): Achieves 90.2%, including 40% perfect scores and 84% above 0.8.

• Across internal evaluations, Opus 4.6 leads on deep multi-step reasoning, search, and agentic workflows compared with other frontier models.

What this means:
This isn’t just an incremental update — it’s a meaningful leap in real-world task performance for coding, reasoning, multi-agent planning, and large-context work. Whether you’re building AI agents, automating workflows, or tackling enterprise knowledge work, these numbers signal greater reliability and capability on complex tasks.

Opus 4.6 now sets a new benchmark bar for frontier LLM performance — especially where depth, persistence, and real-world reasoning matter most.

If you are in the influencer market, whether it’s #tech, #health, #realestate, etc., it doesn’t matter what industry it is; what matters is that you should have an opinion about everything!!! Everything means everything!!!

Chetan Gupta — Sun, 01 Feb 2026 20:20:27 +0000

How One Can Start Their Journey in Data Engineering

Chetan Gupta — Sat, 10 Jan 2026 19:30:44 +0000

Data Engineering is everywhere today. Behind every dashboard, AI model, recommendation system, or business report, there is a data engineer making sure data flows correctly.

If you’re a complete newbie, the biggest challenge isn’t learning—it’s knowing where to start. The internet is full of roadmaps, tools, and opinions, and it’s easy to feel lost before you even begin.

This blog gives you a clear, simple, step-by-step starting point for your Data Engineering journey.

1. First, Understand What Data Engineering Is (In Simple Words)

Before learning anything technical, understand the role.

A data engineer:

Collects data from different sources
Stores it in an organized way
Cleans and transforms raw data
Makes data available for analysis and applications

Think of data engineers as plumbers of data—they build pipelines so data flows smoothly and reliably.

You don’t need to be great at math or AI to start. You need curiosity and consistency.

2. Don’t Start with Tools — Start with Basics

Many beginners make the mistake of jumping directly into tools like Spark, Kafka, or Airflow. This leads to confusion.

Step 1: Learn Basic Computer & Data Concepts

You should understand:

What files are (CSV, JSON)
What databases are
What rows and columns mean
What “data” actually looks like

This builds confidence before coding.

3. Learn SQL First (Your Best Friend)

If you learn only one skill to start data engineering, make it SQL.

SQL helps you:

Read data
Filter data
Group and summarize data
Join multiple tables

Start with:

SELECT
WHERE
ORDER BY
GROUP BY
JOIN

You don’t need advanced SQL on day one. Simple queries are powerful.

4. Learn One Programming Language (Python is Best)

You don’t need to be a hardcore programmer.

With Python, focus on:

Variables and loops
Functions
Reading and writing files
Lists and dictionaries
Basic error handling

Python is used everywhere in data engineering, and it’s beginner-friendly.

5. Understand How Data Moves (Core Idea of Data Engineering)

Once you know basic SQL and Python, learn how data flows.

Ask questions like:

Where does data come from?
Where is it stored?
How is it cleaned?
Who uses it?

Learn these concepts:

Batch data (run once a day)
Real-time data (streams)
ETL (Extract, Transform, Load)

You don’t need advanced tools yet—just the idea.

6. Learn About Data Storage (At a High Level)

Understand:

What a database is
What a data warehouse is
What cloud storage means

You don’t need to master cloud immediately—just know that modern data lives in the cloud.

7. Build Small, Simple Projects (Very Important)

Learning without building causes fear and confusion.

Beginner Project Ideas:

Read a CSV file using Python
Store data in a database
Write SQL queries to analyze it
Clean messy data
Automate a simple script

Even tiny projects count. Progress > perfection.

8. Learn Git & Basic Engineering Habits

Start thinking like an engineer early:

Use Git to save your code
Write small, clean scripts
Add comments
Handle errors properly

These habits matter more than tools.

9. Ignore the Tool Hype (For Now)

As a newbie, you do NOT need:

Spark
Kafka
Kubernetes
Complex cloud architectures

Those come later.

Focus on:

SQL
Python
Data concepts
Building confidence

10. Be Patient — Data Engineering Takes Time

Data engineering is not learned in weeks. It’s built over months.

You will:

Feel confused
Break things
Forget syntax
Rethink your path

That’s normal.

Consistency beats intelligence in this field.

Pro Tip: Start Interviewing Early (Even If You Feel “Not Ready”)

One of the most underrated learning strategies for beginners in Data Engineering is this:

Start interviewing for data engineering roles early — even before you think you’re ready.

This is not about getting the job immediately.
This is about gaining real-world experience of what the market wants.

Why Interviewing Early Is Powerful

When you interview, you learn things no course or roadmap can teach you:

What companies actually ask for
Which skills matter most right now
How deep your knowledge needs to be
Where your gaps are
How to explain your thinking clearly

Each interview becomes market research for your learning journey.

Interviews Show You the Real Trends in Data Engineering

By giving interviews, you’ll quickly notice patterns like:

SQL is asked almost everywhere
Python basics are expected, not advanced algorithms
Questions focus on data pipelines, not theory
Scenario-based questions are very common:
- “How would you design a pipeline for this?”
- “How would you handle late-arriving data?”
- “How do you ensure data quality?”

This tells you what to prioritize in your learning.

Interviews Are a Feedback Loop

Think of interviews like this:

You interview
You get stuck or rejected
You note what you didn’t know
You learn exactly that
You interview again — stronger

This loop is incredibly effective.

=> Many successful data engineers failed multiple interviews before landing their first role.

What Interviewers Look for in Entry-Level Data Engineers

For beginners, interviewers usually care about:

Clear understanding of data basics
Strong SQL fundamentals
Ability to explain your projects
Logical thinking
Willingness to learn

They do not expect mastery of every tool.

Don’t Wait for “Perfection”

A common beginner mistake is thinking:

“I’ll start applying once I know everything.”

That day never comes.

Instead:

Apply early
Interview often
Learn from rejection
Improve intentionally

Each interview adds experience, confidence, and direction.

Final Thought (Very Important)

Learning data engineering in isolation is slow.
Learning data engineering with market feedback is fast.

So while you:

Learn SQL
Practice Python
Build small projects

Also start interviewing.

It will shape your skills, sharpen your thinking, and prepare you for the real world of data engineering.

If you remember only one thing, remember this:

Start small. Learn slowly. Build continuously.

Data Engineering rewards people who:

Understand fundamentals
Think logically
Care about data quality
Keep learning

If you stay consistent, even as a newbie, you can grow into a strong data engineer. If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

Why Idempotence Is So Important in Data Engineering

Chetan Gupta — Sun, 14 Dec 2025 00:10:26 +0000

Introductions:

In data engineering, things fail all the time.

Jobs crash halfway. Networks timeout. Airflow retries tasks. Kafka replays messages. Backfills rerun months of data. And sometimes… someone just clicks “Run” again.

In this messy, failure-prone world, idempotency is what keeps your data correct, trustworthy, and sane.

Let’s explore what idempotency is, why it’s critical, and how to design for it, with practical do’s and don’ts.

What Is Idempotency?

A process is idempotent if:

Running it once or running it multiple times produces the same final result.

Simple Example

If a job processes data for 2025-01-01:

Run it once → correct result
Run it twice → same correct result
Run it ten times → still the same result

No duplicates. No inflation. No corruption.

Why Idempotency Matters in Data Engineering

1. Failures Are Normal, Not Exceptional

Modern data systems are distributed:

Spark jobs fail due to executor loss
Airflow tasks retry automatically
Cloud storage has eventual consistency
APIs timeout mid-request

Without idempotency:

A retry can double-count data
Partial writes can corrupt tables
“Fixing” failures creates new bugs

Idempotency turns retries from a risk into a feature.

2. Schedulers and Orchestrators Rely on It

Tools like:

Airflow
Dagster
Prefect

assume tasks can be retried safely.

If your task is not idempotent:

Retries silently introduce data errors
“Green DAGs” produce bad data
Debugging becomes nearly impossible

Idempotency is the contract between your code and your scheduler.

3. Backfills and Reprocessing Become Safe

Backfills are unavoidable:

Logic changes
Bug fixes
Late-arriving data
Schema evolution

With idempotent pipelines:

You can rerun historical data confidently
You don’t need manual cleanup
You avoid “special backfill code paths”

Without idempotency:

Every backfill is a high-risk operation
Engineers fear touching old data
Technical debt piles up fast

4. Exactly-Once Semantics Are Rare (and Expensive)

In theory, we want exactly-once processing.
In practice:

Distributed systems mostly provide at-least-once
Exactly-once guarantees are complex and costly

Idempotency lets you embrace at-least-once delivery safely.

Instead of fighting the system, you design your logic to handle duplicates gracefully.

5. Data Trust Depends on It

Nothing erodes trust faster than:

Metrics that change every rerun
Counts that slowly drift upward
Dashboards that don’t match yesterday

Idempotent pipelines ensure:

Deterministic outputs
Reproducible results
Confidence in downstream analytics

Common Places Where Idempotency Breaks

INSERT INTO table VALUES (...) without constraints
Appending files blindly to object storage
Incremental loads without deduplication
Updates without stable primary keys
Side effects (emails, API calls) inside data jobs

Design Patterns for Idempotency

1. Partitioned Writes (Overwrite, Don’t Append)

Instead of:

INSERT INTO sales SELECT * FROM staging_sales;

Prefer:

INSERT OVERWRITE TABLE sales PARTITION (date='2025-01-01')
SELECT * FROM staging_sales WHERE date='2025-01-01';

This ensures:

The partition is replaced, not duplicated
Reruns are safe

2. Use Deterministic Keys

Always have a stable primary key:

order_id
user_id + event_time
Hash of business attributes

Then:

Deduplicate on read
Merge on write

Example:

MERGE INTO users u
USING staging_users s
ON u.user_id = s.user_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

3. Make Transformations Pure

A pure transformation:

Depends only on inputs
Produces the same output every time

Avoid:

CURRENT_TIMESTAMP inside transforms
Random UUID generation during processing
External API calls during transformations

4. Track Processing State Explicitly

For streaming and incremental jobs:

Store offsets
Store watermarks
Store processed timestamps

But design them so:

Reprocessing the same window does not change results

5. Separate Side Effects from Data Processing

Data writes should be idempotent.
Side effects should be:

Downstream
Explicit
Carefully controlled

Example:

First write data safely
Then trigger notifications based on final state

Do’s and Don’ts of Idempotent Data Pipelines

✅ Do’s

✅ Design every job assuming it will be retried
✅ Use overwrite or merge instead of blind appends
✅ Make jobs deterministic and repeatable
✅ Use primary keys and deduplication logic
✅ Make backfills a first-class use case
✅ Log inputs, outputs, and checkpoints

❌ Don’ts

❌ Assume “this job only runs once”
❌ Append data without safeguards
❌ Mix side effects with transformations
❌ Depend on execution order for correctness
❌ Use non-deterministic functions in core logic
❌ Rely on humans to clean up duplicates

A Mental Model to Remember

If rerunning your pipeline scares you, it’s not idempotent.

A truly idempotent pipeline:

Can be rerun anytime
Produces the same result
Turns failure recovery into a non-event

Final Thoughts

Idempotency is not just a technical detail.
It’s a design philosophy.

It makes systems:

More resilient
Easier to operate
Cheaper to maintain
More trustworthy

In data engineering, where reprocessing is inevitable and failures are normal, idempotency is the difference between a fragile pipeline and a production-grade system.

Below is a practical, copy-pasteable checklist teams can use during data pipeline design reviews, PR reviews, and post-incident audits.
It’s opinionated, short enough to be usable, but deep enough to catch real production issues.

Bonus checklists: Idempotency Review Checklist for Data Pipelines

Use this checklist to answer one core question:

“If this pipeline runs twice, will the result still be correct?”

1. Retry & Failure Safety

Goal: The pipeline must be safe under retries, partial failures, and restarts.

⬜ Can every task be retried without manual cleanup?
⬜ What happens if the job fails halfway and reruns?
⬜ Does the orchestrator (Airflow / Dagster / Prefect) retry tasks automatically?
⬜ Are partial writes cleaned up or overwritten on retry?
⬜ Is there a clear failure boundary (per partition, batch, or window)?

🚩 Red flag: “We never retry this job.”

2. Input Determinism

Goal: Same inputs → same outputs.

⬜ Are inputs explicitly scoped (date, partition, offset, watermark)?
⬜ Is the input source stable under reprocessing?
⬜ Are late-arriving records handled deterministically?
⬜ Is there protection against reading overlapping windows twice?

🚩 Red flag: Inputs depend on “now”, “latest”, or implicit state.

3. Output Write Strategy

Goal: Writing data should not create duplicates or drift.

⬜ Is the write strategy overwrite, merge, or upsert?
⬜ Are appends protected by deduplication or constraints?
⬜ Is the output partitioned by a deterministic key (date, hour, batch_id)?
⬜ Can a single partition be safely rewritten?

🚩 Red flag: Blind INSERT INTO or file appends with no safeguards.

4. Primary Keys & Deduplication

Goal: The system knows how to identify “the same record”.

⬜ Does each dataset have a well-defined primary or natural key?
⬜ Is deduplication logic explicit and documented?
⬜ Are keys stable across retries and backfills?
⬜ Is deduplication enforced at read time, write time, or both?

🚩 Red flag: “Duplicates shouldn’t happen.”

5. Transformation Purity

Goal: Transformations must be repeatable and predictable.

⬜ Are transformations deterministic?
⬜ Are CURRENT_TIMESTAMP, random UUIDs, or non-deterministic functions avoided?
⬜ Are external API calls excluded from core transformations?
⬜ Is business logic independent of execution order?

🚩 Red flag: Output changes every time the job runs.

6. Incremental & Streaming Logic

Goal: Incremental logic must tolerate reprocessing.

⬜ Are offsets, checkpoints, or watermarks stored reliably?
⬜ Is reprocessing the same range safe?
⬜ Is “at-least-once” delivery handled correctly?
⬜ Can the pipeline replay historical data without corruption?

🚩 Red flag: “We can’t replay this topic/table.”

7. Backfill Readiness

Goal: Backfills should be boring, not terrifying.

⬜ Can the pipeline be run for arbitrary historical ranges?
⬜ Is backfill logic identical to regular logic?
⬜ Does rerunning old partitions overwrite or merge cleanly?
⬜ Are downstream consumers protected during backfills?

🚩 Red flag: Special scripts or manual SQL for backfills.

8. Side Effects & External Actions

Goal: Data processing should not cause unintended external effects.

⬜ Are emails, webhooks, or API calls isolated from core data logic?
⬜ Are side effects triggered only after successful completion?
⬜ Are side effects idempotent themselves (dedup keys, request IDs)?
⬜ Is there protection against double notifications?

🚩 Red flag: Side effects inside transformation steps.

9. Observability & Validation

Goal: Idempotency issues should be detectable early.

⬜ Are row counts consistent across reruns?
⬜ Are data quality checks rerun-safe?
⬜ Are duplicates, nulls, and drift monitored?
⬜ Is lineage clear for reruns and backfills?

🚩 Red flag: No way to tell if data changed unexpectedly.

10. Human Factors & Documentation

Goal: Humans should not be part of correctness.

⬜ Is idempotency behavior documented?
⬜ Can a new engineer safely rerun the pipeline?
⬜ Are recovery steps automated, not manual?
⬜ Is there a clear owner for data correctness?

🚩 Red flag: “Ask Alice before rerunning.”

Final Gate Question (Must Answer Yes)

⬜ Can we safely rerun this pipeline right now in production?

If the answer is no, the pipeline is not idempotent and needs redesign.

How Teams Should Use This Checklist

📌 Design reviews: Before building pipelines
🔍 PR reviews: As a merge gate
🚨 Post-incident reviews: To prevent repeat failures
🔁 Backfill planning: Before rerunning historical data

Just tell me how your team works. If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

REST API Calls for Data Engineers: A Practical Guide with Examples

Chetan Gupta — Sun, 14 Dec 2025 00:02:48 +0000

REST API Calls for Data Engineers

Introduction

As a Data Engineer, you rarely work only with databases. Modern data pipelines frequently ingest data from REST APIs—whether it’s pulling data from SaaS tools (Salesforce, Jira, Google Analytics), internal microservices, or third-party providers.

Understanding how REST APIs work and how to interact with them efficiently is a core data engineering skill.

This blog covers:

What REST APIs are (briefly, practically)
Common REST methods from a data engineering perspective
Authentication patterns
Pagination, filtering, and rate limiting
Real-world examples using Python
Best practices for production data pipelines

What is a REST API (Data Engineer Perspective)

REST (Representational State Transfer) APIs allow systems to communicate over HTTP using standard methods.

From a data engineer’s standpoint:

REST APIs are data sources
JSON is the most common data format
APIs are often incremental, paginated, and rate-limited
APIs feed data lakes, warehouses, or streaming systems

Core REST HTTP Methods You’ll Use

Method	Usage for Data Engineers
GET	Fetch data (most common)
POST	Submit parameters, create resources, complex queries
PUT	Update existing resources
DELETE	Rarely used in pipelines

In data engineering, GET and POST are used 90% of the time.

Anatomy of a REST API Request

A typical REST API call consists of:

https://api.example.com/v1/orders?start_date=2025-01-01&limit=100

Components:

Base URL: https://api.example.com
Endpoint: /v1/orders
Query Parameters: start_date, limit
Headers: Authentication, content type
HTTP Method: GET / POST

Example 1: Simple GET Request (Fetching Data)

Use Case

Fetch daily sales data from an external system.

API Request

GET https://api.company.com/v1/sales

Python Example (requests library)

import requests

url = "https://api.company.com/v1/sales"

headers = {
    "Authorization": "Bearer YOUR_API_TOKEN",
    "Accept": "application/json"
}

response = requests.get(url, headers=headers)

data = response.json()
print(data)

Typical JSON Response

{
  "sales": [
    {
      "order_id": 101,
      "amount": 250.50,
      "currency": "USD",
      "order_date": "2025-01-10"
    }
  ]
}

This JSON is later:

Flattened
Transformed
Stored in a data lake or warehouse

Example 2: Query Parameters (Filtering Data)

Use Case

Pull incremental data to avoid reprocessing historical records.

GET /v1/sales?start_date=2025-01-01&end_date=2025-01-31

Python Code

params = {
    "start_date": "2025-01-01",
    "end_date": "2025-01-31"
}

response = requests.get(url, headers=headers, params=params)
sales_data = response.json()

✅ Best Practice: Always design pipelines to be incremental.

Example 3: POST Request (Complex Queries)

Some APIs require POST when filters are complex.

API Call

POST /v1/sales/search

Payload

{
  "region": ["US", "EU"],
  "min_amount": 100,
  "date_range": {
    "from": "2025-01-01",
    "to": "2025-01-31"
  }
}

Python Example

payload = {
    "region": ["US", "EU"],
    "min_amount": 100,
    "date_range": {
        "from": "2025-01-01",
        "to": "2025-01-31"
    }
}

response = requests.post(url, headers=headers, json=payload)
data = response.json()

Authentication Methods (Very Important)

1. API Key Authentication

Authorization: ApiKey abc123

2. Bearer Token (OAuth 2.0)

Authorization: Bearer eyJhbGciOi...

3. Basic Auth (Less Secure)

requests.get(url, auth=("username", "password"))

🔐 Data Engineering Tip
Always store credentials in:

Environment variables
Secret managers (AWS Secrets Manager, Azure Key Vault)

Example 4: Pagination (Very Common in APIs)

Most APIs limit results per request.

API Response with Pagination

{
  "data": [...],
  "page": 1,
  "total_pages": 10
}

Python Pagination Logic

all_data = []
page = 1

while True:
    params = {"page": page, "limit": 100}
    response = requests.get(url, headers=headers, params=params)
    result = response.json()

    all_data.extend(result["data"])

    if page >= result["total_pages"]:
        break

    page += 1

✅ Always handle pagination, or you’ll silently miss data.

Example 5: Handling Rate Limits

APIs often limit requests:

429 Too Many Requests

Retry Logic Example

import time

response = requests.get(url, headers=headers)

if response.status_code == 429:
    time.sleep(60)
    response = requests.get(url, headers=headers)

📌 Production pipelines should use:

Exponential backoff
Retry limits

Example 6: Error Handling (Critical for Pipelines)

response = requests.get(url, headers=headers)

if response.status_code != 200:
    raise Exception(
        f"API failed with status {response.status_code}: {response.text}"
    )

Common HTTP Status Codes:

200 – Success
400 – Bad Request
401 – Unauthorized
404 – Not Found
500 – Server Error

REST API Data Flow in a Data Pipeline

REST API
   ↓
Python / Spark Job
   ↓
Raw Zone (JSON)
   ↓
Transformation (Flattening, Cleaning)
   ↓
Data Warehouse (Snowflake / BigQuery / Redshift)

Best Practices for Data Engineers

✔ Always design idempotent pipelines
✔ Log request/response metadata
✔ Store raw API responses for reprocessing
✔ Use incremental loads (timestamps, IDs)
✔ Monitor failures and latency
✔ Respect API rate limits

Conclusion

REST APIs are a primary data ingestion mechanism for data engineers. Mastering REST calls—authentication, pagination, retries, and error handling—will make your pipelines reliable, scalable, and production-ready.

If you understand REST APIs deeply, integrating any new data source becomes significantly easier.
If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

From Policy to Code: How Leading Companies Operationalize Privacy

Chetan Gupta — Wed, 26 Nov 2025 08:39:11 +0000

Most companies still treat privacy as a policy problem.
The best treat it as a systems problem.

That difference — between writing rules and enforcing them — is what separates organizations that talk about responsible data use from those that actually achieve it.

The Weekly Translation Failure

Every week, legal, product, and engineering teams sit down to align on privacy and responsible data use. And every week, they run into the same challenge:
no shared language.

It’s not a communication problem.
It’s a translation problem.

A privacy policy that reads cleanly in a spec document becomes a maze of implementation questions the moment it meets code:
• How are user preferences modeled across systems?
• What’s a valid state change when consent is updated?
• What’s the source of truth when systems conflict?
• How do we avoid race conditions in enforcement?

Policy teams speak in rights, obligations, and business rules.
Engineers work in schemas, state machines, and system design.
Product teams sit in the middle, trying to reconcile both worlds — often without the infrastructure to make alignment possible.

The result?
Requirements that feel legally sound but defy implementation.
Code that compiles but misses the spirit or scope of compliance.

The Missing Layer: A Shared Operational Foundation

What’s missing isn’t collaboration — it’s a common operational foundation.
A shared semantic layer that bridges policy intent and system behavior.

This is why privacy must be treated as a systems problem.
It can’t be solved in documents.
It has to be enforced in code.

That’s the core principle behind emerging privacy infrastructure — where legal definitions, business policies, and data models converge into a single executable framework.

When obligations are expressed as code, they become:
• Reliable – enforced automatically, not manually interpreted.
• Scalable – applied consistently across systems and teams.
• Trustworthy – transparent, testable, and provable.

When Policy Lives in Infrastructure

When privacy is embedded directly in infrastructure, the dynamic between teams changes entirely:
• Legal can write once and enforce everywhere.
• Engineering ships faster with clarity and confidence.
• Product no longer has to choose between trust and velocity.

That’s not just better governance — it’s a better growth model.

Instead of being boxed in by complexity, teams gain the freedom to innovate safely with sensitive data — whether it’s for AI, analytics, personalization, or compliance.

Privacy as a Competitive Advantage

Enterprises that get this right stop playing defense with privacy.
They build forward — turning trust into an operational advantage.

Because when privacy becomes part of your stack, not just your policy binder, you don’t just comply.
You scale responsibly.
You innovate with confidence.
And you turn privacy from a blocker into a feature of your growth model.

Templating Columns in dbt: Helpful path to forward

Chetan Gupta — Sat, 18 Oct 2025 14:35:26 +0000

Ship consistent, DRY models by generating column lists with Jinja, macros, and metadata.

Who this is for

Analytics engineers and data engineers who want to avoid copy‑pasting column lists across models and instead generate them from templates with dbt + Jinja.

What you’ll build

You’ll create a small dbt project that:

Centralizes column definitions in macros (with optional aliases/prefixes)
Generates SELECT lists from YAML metadata and adapter‑discovered schemas
Applies policy‑style transforms (e.g., lowercase emails) consistently
Supports environment‑specific or source‑specific column sets
Automatically tests and documents the templated columns

You can copy/paste each step. By the end, you’ll have a reusable pattern you can drop into any dbt project.

Prerequisites

dbt (Core or Cloud) installed
A warehouse connection configured (profiles.yml)
Basic awareness of Jinja syntax

Project structure

.
├─ dbt_project.yml
├─ models/
│  ├─ marts/
│  │  └─ customers.sql
│  ├─ staging/
│  │  └─ stg_customers.sql
│  └─ schema.yml
└─ macros/
   ├─ columns/
   │  ├─ select_common_columns.sql
   │  ├─ policy_columns.sql
   │  ├─ yaml_columns.sql
   │  └─ discover_columns.sql
   └─ utilities.sql

Create the macros/columns folder; we’ll populate it as we go.

Step 1 — The simplest win: variables → columns

Put repeatable column names in dbt_project.yml as vars.

dbt_project.yml

name: my_project
version: 1.0.0
config-version: 2

vars:
  common_columns:
    - id
    - created_at
    - updated_at

models/staging/stg_customers.sql

select
  {{ ', '.join(var('common_columns')) }},
  email,
  first_name,
  last_name
from {{ source('app','customers') }}

When to use: lightweight reuse without custom logic.

Step 2 — Macro‑driven SELECT lists (with prefixes/aliases)

Centralize the column list in a macro so you can pass a table alias and keep your SQL tidy.

macros/columns/select_common_columns.sql

{% macro select_common_columns(alias='') %}
    {{ alias }}id,
    {{ alias }}created_at,
    {{ alias }}updated_at
{% endmacro %}

models/marts/customers.sql

with base as (
  select * from {{ ref('stg_customers') }}
)
select
  {{ select_common_columns('b.') }},
  b.email,
  b.first_name,
  b.last_name
from base as b

Why it’s great: consistent, alias‑safe, easy to extend.

Step 3 — Policy columns: apply consistent transforms

Wrap transforms (PII handling, normalisation) in a macro and reuse them everywhere.

macros/columns/policy_columns.sql

{% macro customer_policy_columns(alias='') %}
    {{ alias }}id,
    lower({{ alias }}email) as email,
    {{ alias }}first_name,
    {{ alias }}last_name,
    {{ alias }}created_at,
    {{ alias }}updated_at
{% endmacro %}

models/marts/customers.sql

select
  {{ customer_policy_columns('b.') }}
from {{ ref('stg_customers') }} as b

Pattern: treat a column template like a “policy” you can apply repeatedly.

Step 4 — YAML‑driven column generation

Keep column specs as metadata in schema.yml and generate the SELECT list from it.

models/schema.yml

version: 2
models:
  - name: stg_customers
    columns:
      - name: id
      - name: email
        tests: [not_null]
        meta:
          transform: "lower({col})"
      - name: first_name
      - name: last_name
      - name: created_at
      - name: updated_at

macros/columns/yaml_columns.sql

{#
  Generate a SELECT list from model YAML metadata.
  - Reads columns defined under the *upstream* model name provided.
  - Supports a per-column meta.transform, where `{col}` is replaced by the column name.
#}
{% macro yaml_columns(model_name, alias='') %}
  {% set model = graph.nodes[model_name] %}
  {% if not model %}
    {# Fail early if the name is wrong #}
    {% do exceptions.raise_compiler_error('yaml_columns: model ' ~ model_name ~ ' not found in graph') %}
  {% endif %}

  {% set rendered = [] %}
  {% for c in model.columns.values() %}
    {% set transform = c.meta.get('transform') if c.meta else none %}
    {% if transform %}
      {% set expr = transform.replace('{col}', alias ~ c.name) ~ ' as ' ~ c.name %}
    {% else %}
      {% set expr = alias ~ c.name %}
    {% endif %}
    {% do rendered.append(expr) %}
  {% endfor %}
  {{ rendered | join(',\n  ') }}
{% endmacro %}

Usage

select
  {{ yaml_columns('model.stg_customers', 's.') }}
from {{ ref('stg_customers') }} as s

Notes

The graph object is available at compile time and includes YAML column metadata.
Use the fully qualified name: model.<node_name> for models and source.<source_name>.<table_name> for sources.

Step 5 — Discover columns dynamically from the warehouse

Sometimes you need to reflect the actual schema at compile time.

macros/columns/discover_columns.sql

{#
  Gets column names from a relation at compile time and optionally maps/filters them.
#}
{% macro discovered_columns_from(relation, include=[], exclude=[], alias='') %}
  {% set cols = adapter.get_columns_in_relation(relation) %}
  {% set names = cols | map(attribute='name') | list %}

  {# Apply include/exclude if provided #}
  {% if include and include | length > 0 %}
    {% set names = names | select('in', include) | list %}
  {% endif %}
  {% if exclude and exclude | length > 0 %}
    {% set names = names | reject('in', exclude) | list %}
  {% endif %}

  {{ names | map('lower') | map('string') | map('trim') | map('regex_replace', '^', alias) | join(',\n  ') }}
{% endmacro %}

Usage

{% set rel = ref('stg_customers') %}
select
  {{ discovered_columns_from(rel, exclude=['_load_dt'], alias='c.') }}
from {{ rel }} as c

When to use: quickly mirror upstream schema, or build pass‑through layers.

⚠️ Caveat: adapter.get_columns_in_relation runs at compile time; if the relation doesn’t exist yet (e.g., first run), create it once or fall back to a known list.

Step 6 — Column mapping & renaming at scale

Create a mapping to rename raw columns to canonical names, with optional transforms.

macros/utilities.sql

{% macro render_mapping(mapping, alias='') %}
  {# mapping: list of dicts: {from: "EMAIL_ADDRESS", to: "email", transform: "lower({col})"} #}
  {% set parts = [] %}
  {% for m in mapping %}
    {% set expr = (m.get('transform') or '{col}').replace('{col}', alias ~ m['from']) %}
    {% do parts.append(expr ~ ' as ' ~ m['to']) %}
  {% endfor %}
  {{ parts | join(',\n  ') }}
{% endmacro %}

models/staging/stg_customers.sql

{% set mapping = [
  {"from": "CUSTOMER_ID", "to": "id"},
  {"from": "EMAIL_ADDRESS", "to": "email", "transform": "lower({col})"},
  {"from": "FIRST_NAME", "to": "first_name"},
  {"from": "LAST_NAME", "to": "last_name"},
  {"from": "CREATED_TS", "to": "created_at"},
  {"from": "UPDATED_TS", "to": "updated_at"}
] %}

select
  {{ render_mapping(mapping, 'r.') }}
from {{ source('app','raw_customers') }} as r

Benefit: auditably documents how raw columns map to curated names.

Step 7 — Environment / source specific templates

Use vars or target.name to toggle column sets across environments.

macros/columns/select_common_columns.sql (extended)

{% macro select_common_columns(alias='') %}
  {% if target.name in ['prod', 'staging'] %}
    {{ alias }}id,
    {{ alias }}created_at,
    {{ alias }}updated_at,
    {{ alias }}ingested_at
  {% else %}
    {{ alias }}id,
    {{ alias }}created_at,
    {{ alias }}updated_at
  {% endif %}
{% endmacro %}

Tip: For source‑specific logic, branch on this.name / this.schema or pass a flag into the macro.

Step 8 — Tests & docs that follow your templates

When you template columns, also template the tests and descriptions.

models/schema.yml

version: 2
models:
  - name: customers
    description: Curated customers with standardized email and timestamps.
    columns:
      - name: id
        tests: [unique, not_null]
      - name: email
        description: Lower‑cased email address.
        tests:
          - not_null
          - accepted_values:
              values: ['@' ] # example; replace with proper regex test via package
      - name: created_at
        tests: [not_null]

Docs blocks

{% docs email_col %}
Email stored in lowercase for deduplication and matching.
{% enddocs %}

Reference the doc block in YAML:

- name: email
  description: "{{ doc('email_col') }}"

Result: dbt docs generate mirrors your templated columns with accurate docs.

Step 9 — Guardrails, debugging, and CI

Render‑only checks: dbt compile to verify generated SQL before running.
Preview macro output: use {% do log(your_macro(), info=True) %} temporarily.
Unit‑like tests: For complex macros, add models that snapshot the macro output and assert against expected fixtures.
CI: run dbt deps && dbt compile && dbt run --select state:modified+ && dbt test on PRs.

Step 10 — Production proofing & performance

Prefer static lists for high‑critical marts; use discovery only in staging.
Keep macros pure (deterministic) where possible; avoid warehouse calls in hot paths.
Use CTEs to keep the final SELECT flat and easy to debug.
Centralize transforms in a few macros to minimize surface area for change.

Reusable snippets (copy/paste)

A. Policy column template

{% macro user_policy_columns(alias='') %}
    {{ alias }}user_id,
    lower({{ alias }}email) as email,
    {{ alias }}created_at,
    {{ alias }}updated_at
{% endmacro %}

B. YAML‑driven generator

{% macro yaml_columns(model_name, alias='') %}
  {% set node = graph.nodes[model_name] %}
  {% if not node %}
    {% do exceptions.raise_compiler_error('unknown model: ' ~ model_name) %}
  {% endif %}
  {% set items = [] %}
  {% for c in node.columns.values() %}
    {% set t = c.meta.get('transform') if c.meta else none %}
    {% set expr = (t or '{col}').replace('{col}', alias ~ c.name) %}
    {% do items.append(expr ~ ' as ' ~ c.name) %}
  {% endfor %}
  {{ items | join(',\n  ') }}
{% endmacro %}

C. Discovery‑based generator

{% macro discovered_columns_from(relation, include=[], exclude=[], alias='') %}
  {% set cols = adapter.get_columns_in_relation(relation) %}
  {% set names = cols | map(attribute='name') | list %}
  {% if include %}{% set names = names | select('in', include) | list %}{% endif %}
  {% if exclude %}{% set names = names | reject('in', exclude) | list %}{% endif %}
  {{ names | map('lower') | map('string') | map('trim') | map('regex_replace', '^', alias) | join(',\n  ') }}
{% endmacro %}

Putting it all together

models/marts/customers.sql

with base as (
  select * from {{ ref('stg_customers') }}
),
final as (
  select
    {{ user_policy_columns('b.') }}
  from base as b
)
select * from final

This compiles to a clean, consistent SELECT with all policy rules applied.

Troubleshooting

Macro not found: ensure the file is in macros/ and the macro name matches.
KeyError on graph.nodes[...]: use the fully qualified node name (model.<name> or source.<src>.<table>), and confirm the YAML exists.
Relation not found (discovery): run once to create the relation, or guard with a fallback list:

  {% set rel = ref('possibly_missing') %}
  {% if rel is not none %}
    {{ discovered_columns_from(rel) }}
  {% else %}
    id, created_at
  {% endif %}

Checklist for your PR

[ ] Columns are defined via macro(s) or YAML metadata
[ ] Transforms are centralized (policy macros)
[ ] Tests & docs reference the canonical column names
[ ] dbt compile output looks correct and readable
[ ] CI runs deps, compile, run, and test

Where to go next

Extend the YAML driver to support type casting, default values, or PII tags via meta.
Build package‑style macros so multiple projects can share your policies.
Use exposures to tie templated columns to downstream assets.

Happy templating! 🧩