<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Matthew Hefferon</title>
    <description>The latest articles on Forem by Matthew Hefferon (@matthewhefferon).</description>
    <link>https://forem.com/matthewhefferon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2097724%2F27b5fdb6-edcd-4277-b5d3-b73ba666184b.jpg</url>
      <title>Forem: Matthew Hefferon</title>
      <link>https://forem.com/matthewhefferon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/matthewhefferon"/>
    <language>en</language>
    <item>
      <title>The story behind our AI Dataset Generator</title>
      <dc:creator>Matthew Hefferon</dc:creator>
      <pubDate>Wed, 16 Jul 2025 06:00:00 +0000</pubDate>
      <link>https://forem.com/matthewhefferon/the-story-behind-our-ai-dataset-generator-53fi</link>
      <guid>https://forem.com/matthewhefferon/the-story-behind-our-ai-dataset-generator-53fi</guid>
      <description>&lt;p&gt;At Metabase, I often need fake data to demo new features. I found myself digging through Kaggle, but not feeling very inspired, and wasting a lot of time searching. So I built a little tool to help me generate datasets and decided to open source it.&lt;/p&gt;

&lt;p&gt;It ended up hitting the front page of Hacker News, got 600+ stars on GitHub, received contributions from a YC-backed startup, and was picked up by TLDR newsletter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not Kaggle or ChatGPT
&lt;/h2&gt;

&lt;p&gt;As mentioned above, I was feeling very uninspired by Kaggle datasets and kept turning to ChatGPT to generate fake data. I'd ask for something, get results back, visualize it, and spot issues. Bar charts all the same height, growth trends going the wrong way, not enough variation, etc. I found myself repeating that cycle and thought… maybe there's a better way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually did
&lt;/h2&gt;

&lt;p&gt;Since I'd already been writing prompts and had some experience, I figured, why not turn that process into a simple tool? So I converted my prompt inputs into a few dropdowns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business type&lt;/li&gt;
&lt;li&gt;Row count&lt;/li&gt;
&lt;li&gt;Single or multi-table schema&lt;/li&gt;
&lt;li&gt;Date range&lt;/li&gt;
&lt;li&gt;Growth pattern&lt;/li&gt;
&lt;li&gt;Variation and granularity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You hit "Preview Data" and get back a sample schema and 10 rows of data. If it looks good, you can export a full dataset as CSV, SQL, or launch Metabase to explore it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: How schema generation works
&lt;/h3&gt;

&lt;p&gt;When you hit "Preview Data," the app sends a prompt to your selected LLM provider (OpenAI, Anthropic, or Google) via LiteLLM. It's tailored to the business type and returns a JSON spec defining the tables, fields, relationships, and logic. Think of it as a blueprint for a believable dataset.&lt;/p&gt;

&lt;p&gt;Originally, I was just generating the schema with ChatGPT. But after a few folks on Hacker News mentioned it'd be cool to switch models, we got an awesome &lt;a href="https://github.com/metabase/dataset-generator/pull/6" rel="noopener noreferrer"&gt;PR&lt;/a&gt; that added LiteLLM support, so now you can swap between providers easily. Thanks for the contribution &lt;a href="https://github.com/manueltarouca" rel="noopener noreferrer"&gt;@manueltarouca&lt;/a&gt;!&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Rows are generated locally by the DataFactory
&lt;/h3&gt;

&lt;p&gt;I originally had the LLM generate all the rows, but it was painfully slow, even for 100 rows. I tried splitting the job into batches, but that introduced new issues. For example, a user ID might be &lt;code&gt;001&lt;/code&gt;, &lt;code&gt;002&lt;/code&gt;, &lt;code&gt;003&lt;/code&gt; in the first batch and something like &lt;code&gt;u099&lt;/code&gt;, &lt;code&gt;u100&lt;/code&gt; in the second.&lt;/p&gt;

&lt;p&gt;So I took a step back and had a deep discussion with Cursor. I needed something fast, more realistic, and cheaper to run. After some back and forth, I decided to build the DataFactory. It generates data locally using &lt;a href="https://fakerjs.dev/" rel="noopener noreferrer"&gt;Faker.js&lt;/a&gt; and applies the schema + simulation rules from the LLM. It also enforces logic like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Realistic SaaS churn and pricing plans&lt;/li&gt;
&lt;li&gt;E-commerce subtotals, tax, and shipping that actually add up&lt;/li&gt;
&lt;li&gt;Healthcare claims where payouts never exceed procedure costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vifmqudretjty9y24qg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vifmqudretjty9y24qg.png" alt="AI dataset generator screenshot" width="800" height="650"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Performance and cost
&lt;/h3&gt;

&lt;p&gt;By splitting it into two phases, the tool stays fast and surprisingly cheap. Schema generation is the only part that hits the LLM, and I wanted to make sure it wouldn't lead me to bankruptcy. So I added token tracking and ran the numbers using a super advanced formula:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;total_tokens × cost_per_token = ???&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Turns out… not that bad. Most previews come in around $0.03-$0.05 with GPT-4o. After that, it's all free. No extra API calls, just pure, 100%, Grade A data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself + contribute
&lt;/h2&gt;

&lt;p&gt;It's still early, so it's not bulletproof. But if you need quick, realistic datasets, give it a try. Everything runs locally with Docker, and all you need is an API key from your favorite LLM provider to get started.&lt;br&gt;
If you want to contribute, there's plenty of room to jump in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add new business types or tweak existing ones&lt;/li&gt;
&lt;li&gt;Improve schema logic or simulation rules&lt;/li&gt;
&lt;li&gt;Add your awesome feature here&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The groundwork is already there. If you've got ideas, I'd love your help taking it further. &lt;a href="https://github.com/metabase/dataset-generator" rel="noopener noreferrer"&gt;Star it, fork it, or open a PR on GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>nextjs</category>
    </item>
    <item>
      <title>Making Hacker News Jobs Searchable Using GPT + Metabase</title>
      <dc:creator>Matthew Hefferon</dc:creator>
      <pubDate>Wed, 21 May 2025 16:33:00 +0000</pubDate>
      <link>https://forem.com/metabase/find-your-dream-gig-faster-on-whos-hiring-hacker-news-with-gpt-and-metabase-343</link>
      <guid>https://forem.com/metabase/find-your-dream-gig-faster-on-whos-hiring-hacker-news-with-gpt-and-metabase-343</guid>
      <description>&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Every month, Hacker News hosts a wildly popular thread: “Ask HN: Who is hiring?” It’s a treasure trove of job opportunities, but scrolling through endless walls of text to find your dream role feels like searching for a needle in a haystack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution
&lt;/h2&gt;

&lt;p&gt;I vibe coded a little project with Cursor that turns those Hacker News job posts into clean, searchable data using OpenAI for parsing, PostgreSQL for storage, and Metabase for visualizations. The Holy Trinity of data wrangling :)&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Fetches “Ask HN: Who is hiring?” threads via the Hacker News API&lt;br&gt;
Uses GPT to extract fields like company, role, location, salary, and contact&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stores it all in a PostgreSQL database&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spins up Metabase so you can search, filter, and explore&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fetch the thread&lt;/strong&gt;&lt;br&gt;
The script pulls a specific Hacker News thread using its ID (you can grab this from the URL of any “Ask HN: Who is hiring?” post).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parse comments with GPT&lt;/strong&gt;&lt;br&gt;
I started with regex, but the unstructured formatting in each comment made it impossible to get clean, reliable data. So I switched to GPT. Each comment is passed into OpenAI using a structured prompt that extracts the relevant fields in JSON. Here’s the core of the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENAI_PROMPT = (
    "You are a structured data parser for Hacker News job posts. "
    "Extract the following fields as plain strings (no quotes, arrays, or brackets unless necessary):\n"
    "- company: the name of the hiring company\n"
    "- role: the job title or position name\n"
    "- location: city/state/country or 'Remote' if applicable\n"
    "- salary: salary range or note (e.g. '$120k–$150k', 'Competitive', etc.)\n"
    "- contact: email address or direct application link (cleaned, no obfuscation like [at] or [dot])\n"
    "- description: a cleaned-up version of the full job post, useful for search\n\n"
    "Requirements:\n"
    "- Output a flat JSON object using the keys above\n"
    "- If any field is missing or not available, use null (not 'n/a', 'none', or empty string)\n"
    "- Do not include any markdown, HTML, or formatting characters\n"
    "- Fix obfuscated emails (e.g. convert 'john [at] domain [dot] com' to 'john@domain.com')\n"
    "- Do not include any commentary or explanation, only output the JSON object.\n"
    "- No trailing commas in the JSON.\n\n"
    "Job post:\n"
    '"""{job_text}"""'
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store in PostgreSQL&lt;br&gt;
Parsed results are saved in a jobs table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE SCHEMA IF NOT EXISTS hn;

CREATE TABLE hn.jobs (
  hn_comment_id bigint primary key,
  company text,
  role text,
  location text,
  salary text,
  contact text,
  description text,
  posted_at timestamp with time zone,
  created_at timestamp with time zone default now(),
  updated_at timestamp with time zone default now()
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Explore in Metabase
&lt;/h2&gt;

&lt;p&gt;Metabase connects directly to the Postgres database and gives you a clean UI to explore the data.&lt;/p&gt;

&lt;p&gt;I made a little table that lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter by company, role, location, and salary&lt;/li&gt;
&lt;li&gt;Link to thread ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqp26y48vvub6nmhlx9vs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqp26y48vvub6nmhlx9vs.png" alt="Metabase Dashboard" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Potential enhancements
&lt;/h2&gt;

&lt;p&gt;Here are a few ideas to take this project to the next level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automate monthly updates:&lt;/strong&gt; Use GitHub Actions to fetch and parse new “Ask HN: Who is hiring?” threads automatically each month.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deploy a public dashboard:&lt;/strong&gt; Share a live Metabase dashboard for anyone to explore job listings without needing to run the code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add notification alerts:&lt;/strong&gt; Set up email or Slack notifications for new job postings that match specific criteria, like role or location.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Take it for a spin and contribute
&lt;/h2&gt;

&lt;p&gt;Check out the full code and detailed setup instructions on GitHub. Whether you’re searching for your next role or simply curious, feel free to clone the repo and try it out. Better yet, consider contributing. If you have a feature idea, a bug fix, or any improvements, submit a pull request or open an issue.&lt;/p&gt;

&lt;p&gt;Thanks for reading, and good luck with your job search!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
