Forem: Matthew Hefferon

The story behind our AI Dataset Generator

Matthew Hefferon — Wed, 16 Jul 2025 06:00:00 +0000

At Metabase, I often need fake data to demo new features. I found myself digging through Kaggle, but not feeling very inspired, and wasting a lot of time searching. So I built a little tool to help me generate datasets and decided to open source it.

It ended up hitting the front page of Hacker News, got 600+ stars on GitHub, received contributions from a YC-backed startup, and was picked up by TLDR newsletter.

Why not Kaggle or ChatGPT

As mentioned above, I was feeling very uninspired by Kaggle datasets and kept turning to ChatGPT to generate fake data. I'd ask for something, get results back, visualize it, and spot issues. Bar charts all the same height, growth trends going the wrong way, not enough variation, etc. I found myself repeating that cycle and thought… maybe there's a better way.

What I actually did

Since I'd already been writing prompts and had some experience, I figured, why not turn that process into a simple tool? So I converted my prompt inputs into a few dropdowns:

Business type
Row count
Single or multi-table schema
Date range
Growth pattern
Variation and granularity

You hit "Preview Data" and get back a sample schema and 10 rows of data. If it looks good, you can export a full dataset as CSV, SQL, or launch Metabase to explore it.

How It Works

Step 1: How schema generation works

When you hit "Preview Data," the app sends a prompt to your selected LLM provider (OpenAI, Anthropic, or Google) via LiteLLM. It's tailored to the business type and returns a JSON spec defining the tables, fields, relationships, and logic. Think of it as a blueprint for a believable dataset.

Originally, I was just generating the schema with ChatGPT. But after a few folks on Hacker News mentioned it'd be cool to switch models, we got an awesome PR that added LiteLLM support, so now you can swap between providers easily. Thanks for the contribution @manueltarouca!

Step 2: Rows are generated locally by the DataFactory

I originally had the LLM generate all the rows, but it was painfully slow, even for 100 rows. I tried splitting the job into batches, but that introduced new issues. For example, a user ID might be 001, 002, 003 in the first batch and something like u099, u100 in the second.

So I took a step back and had a deep discussion with Cursor. I needed something fast, more realistic, and cheaper to run. After some back and forth, I decided to build the DataFactory. It generates data locally using Faker.js and applies the schema + simulation rules from the LLM. It also enforces logic like:

Realistic SaaS churn and pricing plans
E-commerce subtotals, tax, and shipping that actually add up
Healthcare claims where payouts never exceed procedure costs

Step 3: Performance and cost

By splitting it into two phases, the tool stays fast and surprisingly cheap. Schema generation is the only part that hits the LLM, and I wanted to make sure it wouldn't lead me to bankruptcy. So I added token tracking and ran the numbers using a super advanced formula:

total_tokens × cost_per_token = ???

Turns out… not that bad. Most previews come in around $0.03-$0.05 with GPT-4o. After that, it's all free. No extra API calls, just pure, 100%, Grade A data.

Try it yourself + contribute

It's still early, so it's not bulletproof. But if you need quick, realistic datasets, give it a try. Everything runs locally with Docker, and all you need is an API key from your favorite LLM provider to get started.
If you want to contribute, there's plenty of room to jump in:

Add new business types or tweak existing ones
Improve schema logic or simulation rules
Add your awesome feature here

The groundwork is already there. If you've got ideas, I'd love your help taking it further. Star it, fork it, or open a PR on GitHub.

Making Hacker News Jobs Searchable Using GPT + Metabase

Matthew Hefferon — Wed, 21 May 2025 16:33:00 +0000

The problem

Every month, Hacker News hosts a wildly popular thread: “Ask HN: Who is hiring?” It’s a treasure trove of job opportunities, but scrolling through endless walls of text to find your dream role feels like searching for a needle in a haystack.

The solution

I vibe coded a little project with Cursor that turns those Hacker News job posts into clean, searchable data using OpenAI for parsing, PostgreSQL for storage, and Metabase for visualizations. The Holy Trinity of data wrangling :)

What it does

Fetches “Ask HN: Who is hiring?” threads via the Hacker News API
Uses GPT to extract fields like company, role, location, salary, and contact
Stores it all in a PostgreSQL database
Spins up Metabase so you can search, filter, and explore

How it works

Fetch the thread
The script pulls a specific Hacker News thread using its ID (you can grab this from the URL of any “Ask HN: Who is hiring?” post).

Parse comments with GPT
I started with regex, but the unstructured formatting in each comment made it impossible to get clean, reliable data. So I switched to GPT. Each comment is passed into OpenAI using a structured prompt that extracts the relevant fields in JSON. Here’s the core of the prompt:

OPENAI_PROMPT = (
    "You are a structured data parser for Hacker News job posts. "
    "Extract the following fields as plain strings (no quotes, arrays, or brackets unless necessary):\n"
    "- company: the name of the hiring company\n"
    "- role: the job title or position name\n"
    "- location: city/state/country or 'Remote' if applicable\n"
    "- salary: salary range or note (e.g. '$120k–$150k', 'Competitive', etc.)\n"
    "- contact: email address or direct application link (cleaned, no obfuscation like [at] or [dot])\n"
    "- description: a cleaned-up version of the full job post, useful for search\n\n"
    "Requirements:\n"
    "- Output a flat JSON object using the keys above\n"
    "- If any field is missing or not available, use null (not 'n/a', 'none', or empty string)\n"
    "- Do not include any markdown, HTML, or formatting characters\n"
    "- Fix obfuscated emails (e.g. convert 'john [at] domain [dot] com' to 'john@domain.com')\n"
    "- Do not include any commentary or explanation, only output the JSON object.\n"
    "- No trailing commas in the JSON.\n\n"
    "Job post:\n"
    '"""{job_text}"""'
)

Store in PostgreSQL
Parsed results are saved in a jobs table:

CREATE SCHEMA IF NOT EXISTS hn;

CREATE TABLE hn.jobs (
  hn_comment_id bigint primary key,
  company text,
  role text,
  location text,
  salary text,
  contact text,
  description text,
  posted_at timestamp with time zone,
  created_at timestamp with time zone default now(),
  updated_at timestamp with time zone default now()
);

Explore in Metabase

Metabase connects directly to the Postgres database and gives you a clean UI to explore the data.

I made a little table that lets you:

Filter by company, role, location, and salary
Link to thread ID

Potential enhancements

Here are a few ideas to take this project to the next level:

Automate monthly updates: Use GitHub Actions to fetch and parse new “Ask HN: Who is hiring?” threads automatically each month.
Deploy a public dashboard: Share a live Metabase dashboard for anyone to explore job listings without needing to run the code.
Add notification alerts: Set up email or Slack notifications for new job postings that match specific criteria, like role or location.

Take it for a spin and contribute

Check out the full code and detailed setup instructions on GitHub. Whether you’re searching for your next role or simply curious, feel free to clone the repo and try it out. Better yet, consider contributing. If you have a feature idea, a bug fix, or any improvements, submit a pull request or open an issue.

Thanks for reading, and good luck with your job search!