I Built a 278,000-Page Website From Federal Data for $5/Month

Artem Akulov — Mon, 13 Apr 2026 13:01:23 +0000

The US government collects insane amounts of data about every ZIP code. Water violations, lead levels, radon zones, flood claims, wildfire risk, bridge conditions, air quality. All public. Also scattered across dozens of federal agencies in formats that make you want to close the browser.

I spent the last few months stitching it all together into one site. ZipCheckup. Type in a ZIP code, get a full home safety profile. 278,000 pages. 50+ data sources. Running on $5/month.

Let me walk you through the build. Including the parts where I broke everything.

What you actually get

One page per ZIP code. Water quality, natural hazards, infrastructure age, health data, remediation costs. Each section cites its federal sources with direct links. EPA, FEMA, CDC, Census, USGS.

There are also 22 tools (water filter finder, lead risk calculator, climate risk projector), 6 maps, rankings, and comparison pages. All generated from the same data pipeline.

No opinions, no sponsored content. Just data that was already public but practically impossible to find.

The stack (this is the fun part)

It runs on almost nothing:

Node.js + EJS templates. One render script. Reads JSON, spits out HTML. 875 pages per second.
Cloudflare R2 + Workers. Storage, routing, API, OG image generation. $5/month for the Workers paid plan. That's the entire hosting bill.
GitHub Actions on a self-hosted runner. 17 workflows, 12 blocking validators. If smoke tests fail, it auto-rollbacks.
21 cron jobs pulling fresh data from federal APIs.
A Telegram bot that sends me daily briefs and lets me rollback deploys from my phone.

No database. No React. No backend framework. Templates, JSON files, and a very aggressive CI pipeline.

The data pipeline (where 80% of the work lives)

Each source has its own special brand of pain.

EPA SDWIS gives you water violations but won't map cleanly to ZIP codes. FEMA NFIP has 2.7 million flood claims but uses a completely different geographic scheme. Census ACS uses ZCTAs (which look like ZIP codes but aren't). CDC blood lead data covers 36 states. Just 36. And Consumer Confidence Reports? Those are PDFs. About 1,600 of them. I parse them with a 3-tier pipeline: regex first, then pdf-parse + LLM, then OCR + LLM for the ones that really don't want to be parsed.

Everything gets normalized, crosswalked to ZIPs, merged into per-ZIP JSON files. Then derived metrics on top: a composite Home Safety Score (A through F), remediation cost estimates, an "equity trap" ratio (what it costs to fix your home vs. what it's worth).

50+ sources. 90+ scripts. 41,000 ZIP reports. A lot of plumbing. (Pun intended.)

How I actually built this

Honestly? I built most of it with Claude Code (Anthropic's AI coding tool). Not as a gimmick. As a core workflow.

I'd write up tasks before bed. "Parse these 200 CCR PDFs." "Generate county-level pages for all 3,100 US counties." "Build a water filter recommendation engine from contaminant data." Wake up, review the work, fix what needed fixing, push it.

For bigger decisions (should we add earthquake data? which monetization model?) I'd run what I call a "council." Multiple AI agents playing different expert roles (data scientist, growth strategist, revenue realist, devil's advocate) arguing it out in parallel. Sounds bizarre. Works surprisingly well when you're a solo dev with nobody to bounce ideas off of.

The unsexy truth: maybe 70% of the AI-generated code needed adjustments. Templates missing sections. Parsers handling 90% of edge cases and silently failing on the rest. The value wasn't "AI writes perfect code." It was "AI does the structural work at 3 AM so I can focus on architecture and data quality during the day."

The failures

Now for the good part.

The 186K deletion. One night I ran git add -A instead of adding specific files. Committed. Pushed. CI happily deployed a version with 186,000 pages missing. That's now a hardcoded rule in my dev guidelines: never git add -A on a repo with 278K generated files. Ever.

The JSON double-encoding bug. 41,000 ZIP pages rendering with garbled data. Escaped quotes in the HTML, broken characters everywhere. Data was getting JSON-encoded twice somewhere in the pipeline. Took forever to find because the output looked almost right if you squinted.

The Hugo wall. Started with Hugo. Worked great until ~60,000 pages, then just... choked. Build times went from minutes to "is it still running?" Migrated to a custom Node.js renderer. Hugo was doing 40+ minutes for a fraction of what Node.js now builds in 5 minutes.

The MacBook that gave up. Rendered the full site on my laptop for weeks. 278,000 pages × data lookups = out of memory, crash, start over. Eventually moved the build to a dedicated server (64GB RAM). The MacBook hasn't crashed since.

Four weeks in

Site went live March 15, 2026.

Week	Google Impressions	Clicks/day
1	24	0
2	7,322	9
3	13,959	17
4	23,850	17

7,686 pages ranking in Google already. That's 2.8% of the total. The other 97% is still waiting. No ads, no link building campaigns. Just structured data with proper schema markup.

Will it keep growing or is this just Google's honeymoon effect for new domains? Ask me in two months.

Open data

The full dataset is open under CC BY 4.0:

GitHub: artakulov/us-water-quality-data
HuggingFace & Kaggle: auto-synced weekly
API: api.zipcheckup.com/v1/ (6 endpoints, free)
npm: us-water-quality-data

If you want to build something with US environmental data, go for it. PRs welcome.

What's next

Revenue right now: $0. Not a single affiliate link on the site yet. Building first, monetizing second. Next up: more state-level data, better tool UX, and eventually some form of affiliate/referral revenue.

Honestly, I don't know if this turns into a real business or stays a very elaborate side project. But the data is there, the pages are ranking, and I'm learning a ton about federal APIs that I never expected to care about.

If you want to check your ZIP: zipcheckup.com

Happy to talk architecture, data pipeline, or any specific federal API in the comments.

Some sections of this article were edited with AI assistance for clarity.

What I learned processing 50+ federal data sources with Node.js

Artem Akulov — Sun, 05 Apr 2026 11:47:23 +0000

I've been working on a project that pulls environmental data from federal agencies — EPA, FEMA, USGS, CDC, Census, and about 45 others.

Some things I ran into that might save you time:

Federal APIs are wild

No two agencies use the same format. EPA gives you XML. USGS gives you tab-separated files from the 90s. FEMA has a decent REST API but paginated at 1,000 records. Census has its own query language.

I ended up writing a separate parser for each source. No universal adapter worked.

Streaming EJS at scale

I needed to render ~280K static HTML pages from templates. The first approach (one EJS render per file, sequential) took 14 hours. Switched to streaming writes with a worker pool — got it down to ~5 minutes. The bottleneck was always disk I/O, not rendering.

Cloudflare R2 is underrated for static sites

Hosting 280K HTML files on traditional hosting is painful. R2 + Workers turned out to be perfect — no file count limits, edge-cached globally, and the free tier covers a lot.

Rate limits will find you

Every federal API has different rate limits, and most don't document them well. EPA ECHO silently returns empty results after ~300 requests/minute. USGS returns 503s. I ended up building a generic retry queue with exponential backoff that handles all of them.

Would love to hear from others who've worked with government data APIs. What's the worst format you've had to parse?

Forem: Artem Akulov