Forem: Radosław

Dataset Generator v1.0.3-beta ships local LLM support — fine-tune your model without paying a cent for API

Radosław — Sun, 03 May 2026 17:06:07 +0000

A some time ago I shipped a desktop app to generate LLM fine-tuning datasets. It worked: my Qwen2.5-Coder-7B fine-tune jumped from 55.5% → 72.3% on HumanEval. Whole pipeline ran on OpenRouter — pick a model, click Generate, get JSONL.

v1.0.3-beta ships multi-provider LLM support — Ollama, LM Studio, llama.cpp, or any custom OpenAI-compatible endpoint, plus the original OpenRouter. Mix and match: generate on your local Qwen3-14B, judge on a cheap cloud model. Or stay fully offline.

Here's what shipped, what was harder than I expected, and what I learned along the way.

What's new in v1.0.3-beta

One-click local LLM. Open Settings → Providers → "Auto-detect local". The app probes localhost:11434 (Ollama), 1234 (LM Studio), 8080 (llama.cpp). Anything that answers gets a one-click "Add" button. Onboarding for an offline-first user takes ~30 seconds.

Mixed mode. Each category can use its own provider. Gen on local Qwen2.5-Coder:14B, judge on cloud example GPT-4-mini. Or different generators per category — algorithm category on a code-specialised local model. The pipeline routes each call to the right backend automatically.

Custom endpoints. Any OpenAI-compatible URL works: vLLM, TGI, your buddy's self-hosted gateway. Paste base URL + optional bearer token, done.

Instant cancel for local jobs. Cloud APIs answer in seconds — cooperative cancel between calls is fine. Local 14B can sit on a single chat completion for minutes. v1.0.3-beta wires asyncio.Task.cancel() straight into the in-flight HTTP request, so cancel feels instant (~1s) instead of "wait 8 minutes for the chat call to time out".

Auto-handling for reasoning models. Qwen3, DeepSeek-R1, and friends emit <think>...</think> blocks that eat the whole token budget before any actual output. The pipeline detects "reasoning starvation" (empty content + finish=length + reasoning present) and auto-retries with a 4× budget. No manual fiddling.

Why this matters

Three concrete user types this unlocks:

Privacy-conscious — corporate data, NDA'd code, anything you can't send to a third-party API. Now stays on your laptop.
Cost-conscious — generating 5000 multi-turn examples on cloud GPT-4 is $$$ . On local Qwen3-14B it's electricity. Mixed mode (cheap local gen + cloud judge for quality control) is roughly 1/10th the bill.
No-cloud-account — regulations, no credit card, country without payment methods. The whole pipeline now runs without a single API call to anyone.

What was harder than expected

Token accounting across providers. OpenRouter cleanly breaks out reasoning_tokens in the usage payload. Ollama doesn't — completion_tokens is the full think+content figure. So when DeepSeek-R1 via Ollama generates 80 tokens of actual output after 800 tokens of <think>, the bill says 880, the dataset preview says 80, and the budget check trips constantly.

Fix: detect <think> blocks (Format A) or message.reasoning field (Format B), strip the reasoning, recount the kept content with tiktoken, write the corrected number back into usage.completion_tokens. tiktoken is an estimate (not the model's native tokeniser), but it's the only signal available when the provider doesn't surface a breakdown. Quality Report and per-example token counts now agree.

LM Studio uses yet another field name. Same idea as Ollama's message.reasoning, but they call it message.reasoning_content. Discovered this with a curl, double-checked with another model, sigh. Pipeline still works because LM Studio does surface reasoning_tokens in completion_tokens_details (more like OpenRouter), so the subtract path catches it. But the per-provider response shape table grew another row.

Capability-driven branching, not provider-kind switches. First draft of the integration had if provider.kind == "ollama" peppered through the pipeline. That doesn't scale — the next user wants TabbyAPI, the one after wants their custom corporate gateway. So I refactored to ProviderCapabilities flags: supports_provider_routing, supports_reasoning, requires_api_key, has_pricing, supports_embeddings. Adding a new backend is now one class + one registry entry. Zero changes to job_runner.py.

Default reassignment UX. User clicks "Disable" on the OpenRouter provider (which happens to be the default). Old behaviour: silent orphan state, next job hits "Provider 'openrouter-default' is disabled" 422 error, user has no idea why. New behaviour: backend auto-promotes the next enabled provider to default, frontend shows a 4-second toast "Default switched to Ollama (local)". Annoyingly small bug to find, easy to fix once seen.

What I learned

<14B local models aren't worth it for dataset generation. I tested 7B and 9B variants for a week. The output is technically valid but constantly drifts off-topic, repeats patterns, or misunderstands category descriptions. The tokens you save on cloud you spend 5× over on rejected examples. 14B is the floor; 32B is the comfortable middle. If you have the VRAM, use it.

The judge model still matters more than the generator. Same lesson as the original post, doubled by local LLM availability. I tried using small local judges to "save the judge cost too". Some 8B judges rubber-stamp 95-100 across the board. Some 14B judges skip 70% of perfectly good examples because they don't understand the category. Spend cloud money on the judge — or use a 32B+ local judge if you have the hardware.

Mixed mode is the killer feature. I expected "fully offline" to be the win. Turns out the workflow most people actually want is: cheap local model for the volume work (7000 examples), strong cloud model as judge (because rubber-stamp judges silently kill dataset quality). v1.0.3-beta makes this a one-line config — pick gen model from one provider, judge from another, ship it.

What didn't work

Per-provider concurrency limits. I prototyped this — you'd configure "Ollama: 1, OpenRouter: 10" so the global semaphore doesn't drown your local GPU. Turned out to be enterprise-flavoured complexity for ~zero real-world benefit (single-user single-GPU setup, which is 99% of users i guess). Cut from v1.0.3-beta, parked for someone with multi-GPU vLLM who actually needs it.

Provider badge in the model picker. When two providers serve the same model name (llama-3.1-8b on both Ollama and OpenRouter), the picker shows two identical-looking entries. I sketched a small badge UI to differentiate them, then realised typical setups don't have name collisions (you know which models you put where). Punted to a future polish pass.

Tech stack updates

Same foundations, new layer:

Frontend: Next.js 16 (static export) + Tailwind + base-ui — added ProvidersSection for CRUD + auto-detect + per-row connection test
Backend: FastAPI + SQLite (WAL) + Pydantic — added app/services/llm/ provider abstraction (LLMProvider ABC + ProviderCapabilities) and app/routers/providers.py
Schema migration: providers table added in v6 with backfill of the legacy single OpenRouter key — your existing setup migrates silently on first launch
Tests: 460 passing (up from 329 in the previous release) — full coverage for the four backends, registry resolution, auto-detect, mixed-mode jobs

Same AGPL-3.0 license. Same one-binary distribution (Linux AppImage, Windows exe).

Resources

Everything is open source:

App repo (v1.0.3-beta): github.com/AronDaron/dataset-generator
Original release post (with HumanEval +16pp benchmark): [previous dev.to post]
Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k
Fine-tuned model: huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k

What's next

System tray version. Long generation runs (5000+ examples on local hardware = hours) deserve a quieter UX than a permanent open window. Tray icon, "next job ready" notification, click to bring back the dashboard.

Embedding provider picker. Right now dedup works multi-provider on the backend, but the UI only exposes the OpenRouter embedding models. Adding a small dropdown so local users can run dedup on nomic-embed-text via Ollama too.

Two new categories targeting LiveCodeBench and BigCodeBench. The previous post explained why those benchmarks barely moved (format mismatch on LCB, too-generic library category for BCB). Both fixes are in progress — algorithmic drill with edge-case coverage for LCB, library-API-precise taxonomy for BCB.

If you generate datasets locally — what model size are you using and what's your accept rate? Especially curious if anyone got real value out of <14B local models for dataset gen, because my tests said no but I'd love to be wrong.

Disclosure: I drafted this post with AI help — same way I built the app.

Desktop app to generate LLM fine-tuning datasets — got +16pp on HumanEval

Radosław — Wed, 29 Apr 2026 09:37:46 +0000

I'm not a professional developer. I learned by doing — vibe-coding with AI assistance — and a few months ago I wanted to fine-tune Qwen2.5-Coder-7B on my own data. The problem: there's no good way to generate a quality dataset without writing custom scripts every time, and existing tools are either CLI-heavy or built for researchers, not curious tinkerers.

So I built one. It actually worked: my fine-tuned model went from 55.5% to 72.3% on HumanEval (5 runs averaged, Q4_K_M GGUF via Ollama).

Here's what I built, what I learned, and what didn't work in this finetune example.

What it is

A no-code desktop app (Linux, Windows) that automates the full dataset generation pipeline — topic planning, multi-turn example generation, quality scoring via LLM Judge, deduplication, and HuggingFace Hub upload. Pick categories, set proportions, click Generate, get a ready-to-train JSONL.

Under the hood it runs a three-stage engine: topics → outlines → examples. Instead of a naive "generate 100 examples" prompt, the app decomposes the job first, which kills the repetitive patterns you get from one-shot generation. Everything stays local; all model traffic goes through OpenRouter (~300 models, one key).

PS: I know there are similar apps for generating fine-tuning data — but as always, I build the tools I want to use myself.

A few features that made my life easier:

Per-category models — different generators for different example types
LLM-as-judge — every example gets scored, low-quality ones rejected
Embedding deduplication — cosine similarity removes near-duplicates before export
HuggingFace upload — push straight to the Hub when done
Quality Report — score histograms, token stats, per-category accept rates
Resume on crash — interrupted jobs restart from where they stopped (this saved me hours)

Resources

Everything is open source and reproducible:

Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k
Fine-tuned model: huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k
App repo: github.com/AronDaron/dataset-generator

The result

I generated 2,248 examples across 8 categories targeting different code skills, then fine-tuned Qwen2.5-Coder-7B-Instruct (QLoRA via Unsloth, Q4_K_M GGUF served via Ollama).

Benchmark	Base	Fine-tuned	Δ
HumanEval (5 runs avg, n=164, t=0.2)	55.5% (±2.1)	72.3% (±2.0)	+16.8pp
HumanEval+ (5 runs avg, n=164, t=0.2)	49.0% (±1.9)	65.1% (±1.6)	+16.1pp
BigCodeBench full instruct (1 run, n=1140)	39.3%	39.7%	+0.4pp
LiveCodeBench v6 (1 run, n=1055, t=0.0)	29.0%	26.9%	-2.1pp

HumanEval and HumanEval+ were the win. BigCodeBench barely moved and LiveCodeBench actually regressed. Both led to interesting lessons.

What surprised me

LCB regressed because of a format mismatch, not a knowledge gap. I checked the fail cases — model output had correct logic but the wrong wrapper. My training data said "return only the function" while LCB tests need full programs with input() / print(). Format mismatches show up as "wrong answer" on benchmarks, but they're way easier to fix than actual missing knowledge.

Judge model matters more than generator model. I tested several judges. Some flash-tier models rubber-stamped almost everything (scores 95-100 across the board), while smaller models skipped 70% of examples they didn't understand. Pick the wrong judge and your "quality dataset" is just noise with a fancy filter.

Concise prompts beat elaborate ones. I started with detailed multi-paragraph category descriptions. Generation quality got worse. Stripped them down to 2-3 sentences with a 4-6 item judge criteria — accept rate jumped, output got cleaner.

What didn't work

I tried to be clever with judge criteria. I added more and more filters trying to catch every edge case I noticed in pilot runs. Accept rate dropped from ~85% to 10%. The filters were technically correct, but the generator couldn't deliver against all of them. Lesson: it's better to accept some noise than to over-constrain and stall the whole pipeline.

I also wasted time on BigCodeBench. My "Data Libraries" category was too generic — "any 2+ libs from this list" — and BCB tests precise library API usage with concrete kwargs. Result: +0.4pp. To actually move BCB, I'd need a category seeded from BCB's own taxonomy of ~139 libraries with specific signature drilling.

Tech stack

Nothing exotic:

Frontend: Next.js 16 (static export) + Tailwind + shadcn/ui
Backend: FastAPI + SQLite (WAL mode) + Pydantic
Desktop: pywebview (WebKit2 on Linux, WebView2 on Windows)
Packaging: PyInstaller — Linux AppImage works (~73 MB)
LLM access: OpenRouter (no vendor lock-in, switch models freely)
Dedup: OpenRouter embeddings + numpy cosine similarity

License is AGPL-3.0 — I picked it over MIT on purpose. If someone wraps this as SaaS, I want the changes to come back to the project.

What's next

Local LLM support (Ollama / LM Studio) so people can generate datasets without paying for API calls. After that, a system tray version for quieter long-running jobs.

Already in progress: two new categories targeting LiveCodeBench (algorithmic drill with edge-case coverage) and BigCodeBench (API-precise library taxonomy). Goal is to lift the two benchmarks where this run fell flat.

If you've fine-tuned a model on a synthetic dataset, I'd love to hear what worked for you — especially around judge model selection and category design. Drop a comment.

Disclosure: I drafted this post with AI help — same way I built the app.