<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Radosław</title>
    <description>The latest articles on Forem by Radosław (@arondaron).</description>
    <link>https://forem.com/arondaron</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891416%2F386e2f47-a8ad-4b89-8cda-06ba1082965c.png</url>
      <title>Forem: Radosław</title>
      <link>https://forem.com/arondaron</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/arondaron"/>
    <language>en</language>
    <item>
      <title>Dataset Generator v1.0.3-beta ships local LLM support — fine-tune your model without paying a cent for API</title>
      <dc:creator>Radosław</dc:creator>
      <pubDate>Sun, 03 May 2026 17:06:07 +0000</pubDate>
      <link>https://forem.com/arondaron/dataset-generator-v103-beta-ships-local-llm-support-fine-tune-your-model-without-paying-a-cent-3f0o</link>
      <guid>https://forem.com/arondaron/dataset-generator-v103-beta-ships-local-llm-support-fine-tune-your-model-without-paying-a-cent-3f0o</guid>
      <description>&lt;p&gt;A some time ago I shipped a desktop app to generate LLM fine-tuning datasets. It worked: my Qwen2.5-Coder-7B fine-tune jumped from &lt;strong&gt;55.5% → 72.3% on HumanEval&lt;/strong&gt;. Whole pipeline ran on OpenRouter — pick a model, click Generate, get JSONL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1.0.3-beta&lt;/strong&gt; ships &lt;strong&gt;multi-provider LLM support&lt;/strong&gt; — Ollama, LM Studio, llama.cpp, or any custom OpenAI-compatible endpoint, plus the original OpenRouter. Mix and match: generate on your local Qwen3-14B, judge on a cheap cloud model. Or stay fully offline.&lt;/p&gt;

&lt;p&gt;Here's what shipped, what was harder than I expected, and what I learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's new in v1.0.3-beta
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One-click local LLM&lt;/strong&gt;. Open Settings → Providers → "Auto-detect local". The app probes &lt;code&gt;localhost:11434&lt;/code&gt; (Ollama), &lt;code&gt;1234&lt;/code&gt; (LM Studio), &lt;code&gt;8080&lt;/code&gt; (llama.cpp). Anything that answers gets a one-click "Add" button. Onboarding for an offline-first user takes ~30 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed mode&lt;/strong&gt;. Each category can use its own provider. Gen on local Qwen2.5-Coder:14B, judge on cloud example GPT-4-mini. Or different generators per category — algorithm category on a code-specialised local model. The pipeline routes each call to the right backend automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom endpoints&lt;/strong&gt;. Any OpenAI-compatible URL works: vLLM, TGI, your buddy's self-hosted gateway. Paste base URL + optional bearer token, done.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8pyfz0r5u9dv3obeo64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8pyfz0r5u9dv3obeo64.png" alt=" " width="658" height="861"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instant cancel for local jobs&lt;/strong&gt;. Cloud APIs answer in seconds — cooperative cancel between calls is fine. Local 14B can sit on a single chat completion for &lt;em&gt;minutes&lt;/em&gt;. v1.0.3-beta wires &lt;code&gt;asyncio.Task.cancel()&lt;/code&gt; straight into the in-flight HTTP request, so cancel feels instant (~1s) instead of "wait 8 minutes for the chat call to time out".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-handling for reasoning models&lt;/strong&gt;. Qwen3, DeepSeek-R1, and friends emit &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; blocks that eat the whole token budget before any actual output. The pipeline detects "reasoning starvation" (empty content + finish=length + reasoning present) and auto-retries with a 4× budget. No manual fiddling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Three concrete user types this unlocks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-conscious&lt;/strong&gt; — corporate data, NDA'd code, anything you can't send to a third-party API. Now stays on your laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-conscious&lt;/strong&gt; — generating 5000 multi-turn examples on cloud GPT-4 is $$$ . On local Qwen3-14B it's electricity. Mixed mode (cheap local gen + cloud judge for quality control) is roughly 1/10th the bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-cloud-account&lt;/strong&gt; — regulations, no credit card, country without payment methods. The whole pipeline now runs without a single API call to anyone.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What was harder than expected
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Token accounting across providers.&lt;/strong&gt; OpenRouter cleanly breaks out &lt;code&gt;reasoning_tokens&lt;/code&gt; in the usage payload. Ollama doesn't — &lt;code&gt;completion_tokens&lt;/code&gt; is the full think+content figure. So when DeepSeek-R1 via Ollama generates 80 tokens of actual output after 800 tokens of &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;, the bill says 880, the dataset preview says 80, and the budget check trips constantly.&lt;/p&gt;

&lt;p&gt;Fix: detect &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; blocks (Format A) or &lt;code&gt;message.reasoning&lt;/code&gt; field (Format B), strip the reasoning, recount the kept content with tiktoken, write the corrected number back into &lt;code&gt;usage.completion_tokens&lt;/code&gt;. tiktoken is an estimate (not the model's native tokeniser), but it's the only signal available when the provider doesn't surface a breakdown. Quality Report and per-example token counts now agree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LM Studio uses &lt;em&gt;yet another&lt;/em&gt; field name.&lt;/strong&gt; Same idea as Ollama's &lt;code&gt;message.reasoning&lt;/code&gt;, but they call it &lt;code&gt;message.reasoning_content&lt;/code&gt;. Discovered this with a curl, double-checked with another model, sigh. Pipeline still works because LM Studio &lt;em&gt;does&lt;/em&gt; surface &lt;code&gt;reasoning_tokens&lt;/code&gt; in &lt;code&gt;completion_tokens_details&lt;/code&gt; (more like OpenRouter), so the subtract path catches it. But the per-provider response shape table grew another row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capability-driven branching, not provider-kind switches.&lt;/strong&gt; First draft of the integration had &lt;code&gt;if provider.kind == "ollama"&lt;/code&gt; peppered through the pipeline. That doesn't scale — the next user wants TabbyAPI, the one after wants their custom corporate gateway. So I refactored to &lt;code&gt;ProviderCapabilities&lt;/code&gt; flags: &lt;code&gt;supports_provider_routing&lt;/code&gt;, &lt;code&gt;supports_reasoning&lt;/code&gt;, &lt;code&gt;requires_api_key&lt;/code&gt;, &lt;code&gt;has_pricing&lt;/code&gt;, &lt;code&gt;supports_embeddings&lt;/code&gt;. Adding a new backend is now one class + one registry entry. Zero changes to &lt;code&gt;job_runner.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default reassignment UX.&lt;/strong&gt; User clicks "Disable" on the OpenRouter provider (which happens to be the default). Old behaviour: silent orphan state, next job hits "Provider 'openrouter-default' is disabled" 422 error, user has no idea why. New behaviour: backend auto-promotes the next enabled provider to default, frontend shows a 4-second toast "Default switched to Ollama (local)". Annoyingly small bug to find, easy to fix once seen.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&amp;lt;14B local models aren't worth it for dataset generation.&lt;/strong&gt; I tested 7B and 9B variants for a week. The output is technically valid but constantly drifts off-topic, repeats patterns, or misunderstands category descriptions. The tokens you save on cloud you spend 5× over on rejected examples. &lt;strong&gt;14B is the floor; 32B is the comfortable middle.&lt;/strong&gt; If you have the VRAM, use it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The judge model still matters more than the generator.&lt;/strong&gt; Same lesson as the original post, doubled by local LLM availability. I tried using small local judges to "save the judge cost too". Some 8B judges rubber-stamp 95-100 across the board. Some 14B judges skip 70% of perfectly good examples because they don't understand the category. Spend cloud money on the judge — or use a 32B+ local judge if you have the hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed mode is the killer feature.&lt;/strong&gt; I expected "fully offline" to be the win. Turns out the workflow most people actually want is: cheap local model for the volume work (7000 examples), strong cloud model as judge (because rubber-stamp judges silently kill dataset quality). v1.0.3-beta makes this a one-line config — pick gen model from one provider, judge from another, ship it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Per-provider concurrency limits.&lt;/strong&gt; I prototyped this — you'd configure "Ollama: 1, OpenRouter: 10" so the global semaphore doesn't drown your local GPU. Turned out to be enterprise-flavoured complexity for ~zero real-world benefit (single-user single-GPU setup, which is 99% of users i guess). Cut from v1.0.3-beta, parked for someone with multi-GPU vLLM who actually needs it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider badge in the model picker.&lt;/strong&gt; When two providers serve the same model name (&lt;code&gt;llama-3.1-8b&lt;/code&gt; on both Ollama and OpenRouter), the picker shows two identical-looking entries. I sketched a small badge UI to differentiate them, then realised typical setups don't have name collisions (you know which models you put where). Punted to a future polish pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack updates
&lt;/h2&gt;

&lt;p&gt;Same foundations, new layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js 16 (static export) + Tailwind + base-ui — added &lt;code&gt;ProvidersSection&lt;/code&gt; for CRUD + auto-detect + per-row connection test&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: FastAPI + SQLite (WAL) + Pydantic — added &lt;code&gt;app/services/llm/&lt;/code&gt; provider abstraction (LLMProvider ABC + ProviderCapabilities) and &lt;code&gt;app/routers/providers.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema migration&lt;/strong&gt;: &lt;code&gt;providers&lt;/code&gt; table added in v6 with backfill of the legacy single OpenRouter key — your existing setup migrates silently on first launch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests&lt;/strong&gt;: 460 passing (up from 329 in the previous release) — full coverage for the four backends, registry resolution, auto-detect, mixed-mode jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same AGPL-3.0 license. Same one-binary distribution (Linux AppImage, Windows exe).&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Everything is open source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;App repo (v1.0.3-beta)&lt;/strong&gt;: &lt;a href="https://github.com/AronDaron/dataset-generator" rel="noopener noreferrer"&gt;github.com/AronDaron/dataset-generator&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Original release post (with HumanEval +16pp benchmark)&lt;/strong&gt;: [previous dev.to post]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset (2,248 examples)&lt;/strong&gt;: &lt;a href="https://huggingface.co/datasets/AronDaron/OctoBench-2.2k" rel="noopener noreferrer"&gt;huggingface.co/datasets/AronDaron/OctoBench-2.2k&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuned model&lt;/strong&gt;: &lt;a href="https://huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k-Fine-tune" rel="noopener noreferrer"&gt;huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;System tray version&lt;/strong&gt;. Long generation runs (5000+ examples on local hardware = hours) deserve a quieter UX than a permanent open window. Tray icon, "next job ready" notification, click to bring back the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding provider picker&lt;/strong&gt;. Right now dedup works multi-provider on the backend, but the UI only exposes the OpenRouter embedding models. Adding a small dropdown so local users can run dedup on &lt;code&gt;nomic-embed-text&lt;/code&gt; via Ollama too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two new categories targeting LiveCodeBench and BigCodeBench&lt;/strong&gt;. The previous post explained why those benchmarks barely moved (format mismatch on LCB, too-generic library category for BCB). Both fixes are in progress — algorithmic drill with edge-case coverage for LCB, library-API-precise taxonomy for BCB.&lt;/p&gt;

&lt;p&gt;If you generate datasets locally — what model size are you using and what's your accept rate? Especially curious if anyone got real value out of &amp;lt;14B local models for dataset gen, because my tests said no but I'd love to be wrong.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclosure: I drafted this post with AI help — same way I built the app.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>vibecoding</category>
      <category>python</category>
    </item>
    <item>
      <title>Desktop app to generate LLM fine-tuning datasets — got +16pp on HumanEval</title>
      <dc:creator>Radosław</dc:creator>
      <pubDate>Wed, 29 Apr 2026 09:37:46 +0000</pubDate>
      <link>https://forem.com/arondaron/desktop-app-to-generate-llm-fine-tuning-datasets-got-16pp-on-humaneval-4ng3</link>
      <guid>https://forem.com/arondaron/desktop-app-to-generate-llm-fine-tuning-datasets-got-16pp-on-humaneval-4ng3</guid>
      <description>&lt;p&gt;I'm not a professional developer. I learned by doing — vibe-coding with AI assistance — and a few months ago I wanted to fine-tune Qwen2.5-Coder-7B on my own data. The problem: there's no good way to generate a quality dataset without writing custom scripts every time, and existing tools are either CLI-heavy or built for researchers, not curious tinkerers.&lt;/p&gt;

&lt;p&gt;So I built one. It actually worked: my fine-tuned model went from &lt;strong&gt;55.5% to 72.3% on HumanEval&lt;/strong&gt; (5 runs averaged, Q4_K_M GGUF via Ollama).&lt;/p&gt;

&lt;p&gt;Here's what I built, what I learned, and what didn't work in this finetune example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;A no-code desktop app (Linux, Windows) that automates the full dataset generation pipeline — topic planning, multi-turn example generation, quality scoring via LLM Judge, deduplication, and HuggingFace Hub upload. Pick categories, set proportions, click Generate, get a ready-to-train JSONL.&lt;/p&gt;

&lt;p&gt;Under the hood it runs a three-stage engine: topics → outlines → examples. Instead of a naive "generate 100 examples" prompt, the app decomposes the job first, which kills the repetitive patterns you get from one-shot generation. Everything stays local; all model traffic goes through OpenRouter (~300 models, one key).&lt;/p&gt;

&lt;p&gt;PS: I know there are similar apps for generating fine-tuning data — but as always, I build the tools I want to use myself.&lt;/p&gt;

&lt;p&gt;A few features that made my life easier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-category models&lt;/strong&gt; — different generators for different example types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge&lt;/strong&gt; — every example gets scored, low-quality ones rejected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding deduplication&lt;/strong&gt; — cosine similarity removes near-duplicates before export&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace upload&lt;/strong&gt; — push straight to the Hub when done&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Report&lt;/strong&gt; — score histograms, token stats, per-category accept rates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resume on crash&lt;/strong&gt; — interrupted jobs restart from where they stopped (this saved me hours)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Everything is open source and reproducible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset (2,248 examples)&lt;/strong&gt;: &lt;a href="https://huggingface.co/datasets/AronDaron/OctoBench-2.2k" rel="noopener noreferrer"&gt;huggingface.co/datasets/AronDaron/OctoBench-2.2k&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuned model&lt;/strong&gt;: &lt;a href="https://huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k-Fine-tune" rel="noopener noreferrer"&gt;huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;App repo&lt;/strong&gt;: &lt;a href="https://github.com/AronDaron/dataset-generator" rel="noopener noreferrer"&gt;github.com/AronDaron/dataset-generator&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;I generated 2,248 examples across 8 categories targeting different code skills, then fine-tuned Qwen2.5-Coder-7B-Instruct (QLoRA via Unsloth, Q4_K_M GGUF served via Ollama).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Base&lt;/th&gt;
&lt;th&gt;Fine-tuned&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval (5 runs avg, n=164, t=0.2)&lt;/td&gt;
&lt;td&gt;55.5% (±2.1)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.3% (±2.0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.8pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval+ (5 runs avg, n=164, t=0.2)&lt;/td&gt;
&lt;td&gt;49.0% (±1.9)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65.1% (±1.6)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.1pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BigCodeBench full instruct (1 run, n=1140)&lt;/td&gt;
&lt;td&gt;39.3%&lt;/td&gt;
&lt;td&gt;39.7%&lt;/td&gt;
&lt;td&gt;+0.4pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiveCodeBench v6 (1 run, n=1055, t=0.0)&lt;/td&gt;
&lt;td&gt;29.0%&lt;/td&gt;
&lt;td&gt;26.9%&lt;/td&gt;
&lt;td&gt;-2.1pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0a45q9xpey9mvn16vk4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0a45q9xpey9mvn16vk4y.png" alt=" " width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;HumanEval and HumanEval+ were the win. BigCodeBench barely moved and LiveCodeBench actually regressed. Both led to interesting lessons.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LCB regressed because of a format mismatch, not a knowledge gap.&lt;/strong&gt; I checked the fail cases — model output had correct logic but the wrong wrapper. My training data said "return only the function" while LCB tests need full programs with &lt;code&gt;input()&lt;/code&gt; / &lt;code&gt;print()&lt;/code&gt;. Format mismatches show up as "wrong answer" on benchmarks, but they're way easier to fix than actual missing knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge model matters more than generator model.&lt;/strong&gt; I tested several judges. Some flash-tier models rubber-stamped almost everything (scores 95-100 across the board), while smaller models skipped 70% of examples they didn't understand. Pick the wrong judge and your "quality dataset" is just noise with a fancy filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concise prompts beat elaborate ones.&lt;/strong&gt; I started with detailed multi-paragraph category descriptions. Generation quality got &lt;em&gt;worse&lt;/em&gt;. Stripped them down to 2-3 sentences with a 4-6 item judge criteria — accept rate jumped, output got cleaner.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;I tried to be clever with judge criteria. I added more and more filters trying to catch every edge case I noticed in pilot runs. Accept rate dropped from ~85% to 10%. The filters were technically correct, but the generator couldn't deliver against all of them. Lesson: it's better to accept some noise than to over-constrain and stall the whole pipeline.&lt;/p&gt;

&lt;p&gt;I also wasted time on BigCodeBench. My "Data Libraries" category was too generic — "any 2+ libs from this list" — and BCB tests precise library API usage with concrete kwargs. Result: +0.4pp. To actually move BCB, I'd need a category seeded from BCB's own taxonomy of ~139 libraries with specific signature drilling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;p&gt;Nothing exotic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js 16 (static export) + Tailwind + shadcn/ui&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: FastAPI + SQLite (WAL mode) + Pydantic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Desktop&lt;/strong&gt;: pywebview (WebKit2 on Linux, WebView2 on Windows)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Packaging&lt;/strong&gt;: PyInstaller — Linux AppImage works (~73 MB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM access&lt;/strong&gt;: OpenRouter (no vendor lock-in, switch models freely)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedup&lt;/strong&gt;: OpenRouter embeddings + numpy cosine similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License is AGPL-3.0 — I picked it over MIT on purpose. If someone wraps this as SaaS, I want the changes to come back to the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Local LLM support (Ollama / LM Studio) so people can generate datasets without paying for API calls. After that, a system tray version for quieter long-running jobs.&lt;/p&gt;

&lt;p&gt;Already in progress: two new categories targeting LiveCodeBench (algorithmic drill with edge-case coverage) and BigCodeBench (API-precise library taxonomy). Goal is to lift the two benchmarks where this run fell flat.&lt;/p&gt;

&lt;p&gt;If you've fine-tuned a model on a synthetic dataset, I'd love to hear what worked for you — especially around judge model selection and category design. Drop a comment.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclosure: I drafted this post with AI help — same way I built the app.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>python</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
