Forem: qcrao

5 LoRA training pitfalls when you're trying to lock down a comic character

qcrao — Thu, 07 May 2026 09:31:29 +0000

TLDR: Most "my LoRA works in test prompts but breaks the second I put it in a comic panel" problems are caused at training time, not at inference. Here are the five training-side mistakes that ate the most weekends for me.

I've spent the last eight months building Comicory, an AI comic generator where the entire pitch is "your character looks the same on page 1 and page 12." That sentence is easy to say. It is grindingly hard to ship.

Almost every fix I shipped in those eight months traced back to LoRA training, not the prompt or the sampler or the seed. This post is the list I wish someone had given me on day one.

Pitfall 1: Your training set has too many "same shot"

The first character LoRA I trained had 32 images. 28 of them were 3/4 portrait, neutral lighting, looking slightly off-camera. It was the dataset I had, scraped from concept-art-style references.

The LoRA trained beautifully. Then I tried to use it in an actual comic panel — wide shot, side profile, character mid-action — and the output looked nothing like the reference. The model had memorized the pose, not the character.

Fix: aim for pose, framing, and lighting diversity before you aim for image count. My current target for a character is roughly:

30% close-up faces (multiple angles)
30% medium shots (waist-up, multiple angles)
25% full-body shots
15% "weird" shots — back of head, dramatic angle, partial occlusion

Quality of coverage matters more than count. A 25-image set with this distribution beats a 70-image set of nothing-but-portraits, every single time.

Pitfall 2: You captioned the character into the wallpaper

This one is sneaky. In my early datasets, every caption looked like:

ck_character standing in a forest, anime style, soft lighting, high detail

The model learned ck_character as inseparable from "standing in a forest, soft lighting." When I prompted ck_character on a spaceship bridge, the LoRA pulled in foliage and warm light because those concepts had been bound to the trigger token.

Fix: caption away the things you want to vary, leave only what is invariant about the character. If your character is supposed to be wearable in any setting, your caption should look like:

ck_character, red jacket, short black hair, freckles

That's it. No setting, no lighting, no mood. Those are the variables you'll set at inference time.

# What I do during caption preprocessing now
INVARIANT_TAGS = ["red_jacket", "short_black_hair", "freckles"]
STRIPPED_TAGS = ["forest", "soft_lighting", "high_detail", "outdoor", "indoor"]

def clean_caption(raw_tags, trigger="ck_character"):
    keep = [t for t in raw_tags if t in INVARIANT_TAGS]
    return f"{trigger}, " + ", ".join(keep)

This change alone gave me the single biggest jump in cross-scene consistency.

Pitfall 3: You trained at one resolution and then panel-rendered at another

Stable Diffusion 1.5 LoRAs trained at 512×512 fall apart at 768×1152 panel aspect ratios. SDXL is more forgiving but not immune. The model has not seen the character at the panel aspect ratio you actually need.

Fix: bucketed training across the aspect ratios you'll actually render at. kohya-ss supports this out of the box. My current bucket config covers:

512×768 (portrait panel)
768×512 (landscape panel)
768×768 (splash square)
1024×1536 (full-page hero)

Image counts in each bucket should roughly match how often you'll render at that aspect. If 70% of your panels are landscape, 70% of your training images should be landscape — even if it means cropping the same source image into multiple buckets.

Pitfall 4: Your learning rate is fighting your dataset size

There is no universal "good" LR. Tiny datasets (15-25 images) want a lower LR and more steps so the model doesn't overfit on the handful of examples. Bigger sets (60+) tolerate a higher LR and fewer epochs.

What I use as a starting point now (kohya-ss, SDXL LoRA, rank 16):

Dataset size	unet_lr	text_encoder_lr	epochs
15-25 images	1e-4	5e-5	12-15
25-50 images	2e-4	1e-4	8-10
50-100 images	3e-4	1e-4	6-8

These are starting points, not laws. But they will save you from the two failure modes I kept hitting: undertraining ("LoRA does nothing") and overcooking ("LoRA always renders the same expression").

Check loss curves. If validation loss bottoms out around epoch 4 and rises after, your LR is too high or you have too few images. If it's still falling at the last epoch, train longer.

Pitfall 5: You skipped regularization images and now the LoRA bleeds into everything

You ship the LoRA. You prompt a coffee shop, no characters, photorealistic. Your character shows up anyway, faintly haunting the espresso machine.

This is the LoRA "leaking" into general concepts because it has no contrast set. The model has no examples of "what a person who is NOT this character looks like" during training, so the LoRA's identity bleeds into the base model's "person" concept.

Fix: regularization images. During training, alongside your character set, include a folder of generic "person" images (200-300, captioned simply as person) generated by the base model itself. These tell the LoRA "this is what NOT-the-character looks like."

In kohya-ss config:

[[datasets]]
  [[datasets.subsets]]
    image_dir = "/data/ck_character"
    class_tokens = "ck_character"
    num_repeats = 10

  [[datasets.subsets]]
    image_dir = "/data/reg_person"
    class_tokens = "person"
    num_repeats = 1
    is_reg = true

The leaking effect drops to near-zero. Your background characters look like background characters again.

Closing

Character consistency is, in practice, a checklist of these five training-time decisions plus a workflow that uses the resulting LoRA correctly. The inference side (ControlNet, IP-Adapter, reference-only) only matters once your LoRA is solid. If your LoRA is bad, no amount of inference scaffolding will save it.

I built Comicory because I wanted a comic generator that didn't make me re-prompt the character on every panel. The five fixes above are the spine of how it works under the hood.

What I learned squeezing the YouTube Data API v3 quota for a side project

qcrao — Thu, 07 May 2026 09:12:47 +0000

TLDR: The default 10,000 unit/day quota will burn through in ~10 naive user requests. Three tricks pulled my per-user cost down 50× and let me ship TubeVocab on the free tier.

When I started building TubeVocab — an ESL learning tool that turns any YouTube video into a clickable, vocab-learning interactive transcript — I assumed the YouTube Data API v3 would be the cheap, easy part. "It's Google. It scales. The free tier is generous." That kind of gut feeling.

I was wrong. The free tier is generous, but only if you understand how quota math actually works. Most public tutorials skip this. Here's what I learned the hard way.

The quota arithmetic nobody puts in the quickstart

Default daily quota: 10,000 units. Sounds like a lot.

Then you start reading the cost table and realize:

search.list — 100 units per call. That's how you find a video by query.
videos.list — 1 unit per call. That's how you fetch metadata once you have an ID.
captions.list — 50 units. Thumbnails of available subtitles.
captions.download — 200 units. The actual subtitle data.

If your user-facing flow is "search a YouTube channel → pick a video → load subtitles → render the interactive player," you're looking at roughly 100 + 1 + 50 + 200 = 351 units per single user session. The 10,000 free units evaporate in 28 sessions/day.

That's not a side project. That's a 30-DAU launch and you're paying for quota expansion the next morning.

Three tricks that cut my per-user cost ~50×

1. Don't use `search.list` for known IDs

This sounds obvious in hindsight but it took me a week to see. If a user pastes a YouTube URL, the video ID is right there in the URL. Parse it. Skip search.list entirely.

// Bad: 100 units per pasted URL
const result = await youtube.search.list({ q: pastedUrl, type: 'video', part: 'snippet' });

// Good: 0 units, regex the ID
const id = pastedUrl.match(/(?:v=|youtu\.be\/)([\w-]{11})/)?.[1];
const result = await youtube.videos.list({ id, part: 'snippet,contentDetails' }); // 1 unit

This one change took the average pasted-URL flow from 351 units → 251 units.

2. Skip the official `captions.*` endpoints entirely

The captions.download endpoint costs 200 units per video AND requires OAuth (the user has to be the video owner). For non-owner subtitle access — i.e. the actual ESL use case — you need a different path.

The trick: YouTube serves the auto-generated and uploader-provided subtitles through an undocumented but stable XML endpoint that doesn't count against your quota at all. You can get the timed transcript via https://video.google.com/timedtext?lang=en&v=VIDEO_ID, parse the XML, and you're done. 0 quota units.

(Caveat: this endpoint is undocumented, so it can break. I have a fallback path that uses youtube-transcript-api style scraping. The combined approach gets ~95% subtitle hit rate without touching the official caption quota.)

After this, my "load subtitles" cost dropped from 250 → 1 unit per session.

3. Cache aggressively at the video-ID level

Every time someone watches a video on TubeVocab, the metadata + subtitle + thumbnail set is the same until the video itself changes. I run a per-video-ID cache (just SQLite — overkill is fine) with no expiry. Subsequent views of the same video cost zero quota, regardless of how many users watch it.

Once I had ~500 popular videos cached, my marginal cost per session was effectively zero. The quota is now spent only on first-time-seen videos.

What actually shipped

After these three optimizations:

Average new-video session: ~2 units (videos.list + occasional fallback)
Average cached-video session: 0 units
Daily ceiling on the free tier: ~5,000 unique new videos/day before I'd need to start budgeting

That's enough headroom for the foreseeable lifetime of a side project.

If you're building anything in the YouTube + content-analysis space — vocabulary tools, accessibility, search, analytics — the playbook is roughly: assume search.list is poison, route around captions.*, and cache by video ID forever. The free tier becomes more than generous once you stop fighting it.

For context: I built TubeVocab using exactly this stack — it's a click-to-flashcard ESL tool that turns any YouTube video into vocabulary practice. The quota math was the single most underestimated technical risk of the whole project. Hope this saves someone a week.

The Engineering Challenge of Turning YouTube Into an ESL Corpus

qcrao — Fri, 24 Apr 2026 03:34:47 +0000

Language learning apps have spent a decade chasing the same pattern: curate a 2,000-word "high-frequency vocabulary" list, wrap it in spaced repetition, ship. Users grind, retention looks great in the app, and then they meet an actual English speaker and freeze, because recognizing a word on a flashcard is not the same skill as catching it in running speech. The information is in their head but it is not wired to sound, pace, register, or context.

The intuition behind context-based acquisition — learning words in situ, inside real discourse — is old and well supported in second-language acquisition research. The problem has always been that the "real discourse" part is hard to deliver at scale. Textbook dialogues are not real. Classroom tapes are not real. Even podcasts are a curated subset.

YouTube is real. It is also the single largest corpus of native-speaker content in every register you care about: casual vlogs, lectures, interviews, comedy, news, gameplay commentary, technical talks. For ESL specifically, the fact that speakers vary in accent, speed, and slang is a feature, not a bug.

The engineering question is: what would it take to turn YouTube into a usable ESL corpus?

The interactivity problem

Watching YouTube with auto-subtitles on is already useful for listening comprehension. The gap is that subtitles are read-only. A learner hits an unfamiliar word, pauses the video, tabs to a dictionary, types the word, gets a translation, tabs back, loses their place. After three such interruptions in a 10-minute video most learners give up and either:

stop pausing (and therefore stop learning from the unfamiliar words), or
abandon the video entirely.

The right interaction is click-a-word → instant translation + pronunciation + example sentence → optionally save as flashcard, all without leaving the player. That turns a 10-minute video into a vocab-building session instead of a comprehension test.

Why this is harder than it looks

A few things get in the way:

Subtitle alignment. YouTube auto-subs are word-timed for about 80% of videos; manual subs are sentence-timed. A click-a-word UI has to handle both gracefully, ideally highlighting the clicked word with <50ms latency.
Tokenization across languages. Clicking "running" should map to the lemma "run" for dictionary lookup. Clicking "auf" in a German phrase should resolve to the correct sense given context. Clicking "不好意思" in Chinese should resolve as a multi-character idiom, not char-by-char.
Disambiguation. "Bank" in a finance video is different from "bank" in a kayaking video. A naive dictionary lookup gives the most common sense; a better system checks surrounding context.
Personalization. A B2 learner does not want to be interrupted every time "the" appears. The system needs to model what the learner already knows and surface only likely-unknown words — ideally inferred from past clicks, not a placement test.
Flashcard hygiene. Saving raw dictionary entries produces terrible flashcards. The good ones include the word in its original sentence, the speaker, optionally a short audio clip. This turns retention from "definition recall" into "episodic recall," which is massively stronger.

What it looks like when it works

I have been using tubevocab.com for a month as a hosted implementation of the click-a-word-on-YouTube pattern. Drop in a video URL, watch with interactive subtitles, click a word to see the translation and an AI-generated example sentence, save it to a flashcard deck with the original sentence attached, let spaced repetition handle scheduling. UI is in 10 languages which matters for learners whose L1 is not English.

What I noticed over the month:

Retention is visibly better than flat Anki decks, because you remember the speaker and the scene along with the word.
Listening comprehension improves faster than raw vocab count. You start catching phrases you would have missed before, including phrases you never actually studied.
The cost of saving a card is near zero — one click, inline — which is what makes the workflow stick. Anki's friction cost is why most learners quit it.

Free tier covers the dictionary, the flashcards, and the spaced repetition, which is enough to evaluate whether the loop works for a given learner without committing.

Why I am bringing this up

From an engineering standpoint, "interactive learning layer on top of YouTube" is a genuinely interesting systems problem: you are doing real-time NLP on streaming caption data, building a personalized word-knowledge model, and rendering a low-latency overlay on a player you do not control. Most of the research attention in language-learning tech has gone to generative tutors and chatbots; the infrastructure for exposure-driven acquisition is comparatively under-built.

For ESL learners specifically, the payoff is pragmatic: the gap between "I studied 3,000 words" and "I can follow a normal conversation" closes a lot faster when the 3,000 words were learned from real speakers saying real things, and the sentences attached to them when you hit review.

Not a pitch for any particular tool — mostly an argument that the "click-a-word-on-real-native-content" pattern is underbuilt in this space, and the tools that get it right are worth the 10 minutes to evaluate.

Why Character Consistency Is Hard in AI Comic Generation

qcrao — Fri, 24 Apr 2026 03:31:47 +0000

When you feed a story prompt into a generic image AI — say, "a detective with a red scarf walks into a neon-lit bar, then sits down at the counter, then pulls out a notebook" — you will usually get three images back where the detective has three different faces, two different scarves, and in one panel the scarf has become a tie. This is the character consistency problem, and it is the single biggest reason why text-to-image tools are bad at comics.

This post is a short walk through why it happens, what the current workarounds look like, and where the FLUX.1-Kontext-based approach fits in.

Why do characters drift?

Every text-to-image inference is in effect a fresh sample from a very high-dimensional distribution. The model has no state between generations. Prompt A and prompt B may both say "detective with red scarf," but the specific pixel arrangement that the sampler lands on is governed by the noise seed, the scheduler, and a thousand tiny decisions inside the U-Net. Two calls that share a prompt but not a seed will produce two different people who both roughly match the description.

Put differently: the model does not have a character. It has a prompt. Every panel is a new roll of the dice against the same loose description.

Classical diffusion workflows try to fix this with three tricks, none of which are great:

Seed locking. Use the same random seed for every panel. Works only if the prompt is essentially unchanged — the moment you add "sitting down" or "pulling out a notebook," the composition changes and the seed lock stops helping.
Textual inversion / DreamBooth. Fine-tune a small adapter on reference photos of the character. Effective but slow, expensive, and brittle — you are training a new adapter for every character in your comic.
Multi-image prompting. Paste the previous panel into the prompt as a reference. Some models accept it; most do not; when they do, they often regress to the mean face after a few hops.

What FLUX.1-Kontext adds

FLUX.1-Kontext is Black Forest Labs' image-to-image-conditioned variant of FLUX. The relevant design choice is that it treats the reference image not as "inspiration" (loose style transfer) but as hard conditioning during the denoising process. You pass in a reference sheet — the character's face, outfit, key features — and the generation is pulled toward that reference, not just textually but pixel-wise, through cross-attention.

For comics this is almost exactly the right primitive. The workflow becomes:

Generate a reference sheet for each character once (face, outfit, distinctive props).
For every panel, pass the relevant character's sheet + the scene description.
The model respects the sheet as a constraint, not a suggestion.

The same detective now has the same face, the same red scarf, and the scarf actually stays a scarf.

What breaks and what does not

In practice the approach works well for:

Frontal and three-quarter faces. The reference sheet is usually a clean portrait; panels that echo that framing stay on-model.
Distinctive clothing and props. A red scarf, a specific hat, a tattoo — these get preserved reliably.
Short stories (6–12 panels). Drift is minimal within a single story.

It still struggles with:

Extreme poses. A character leaping mid-air from behind is a composition the reference sheet does not cover, so the model extrapolates and sometimes loses the face.
Background characters. Secondary characters without their own reference sheet still drift. You either sheet them too or accept drift.
Long-form continuity across chapters. After 50+ panels the accumulated small variations become visible. Re-anchoring to the sheet every 10 panels helps.

A practical note on tooling

You can run this stack yourself — the FLUX.1-Kontext weights are open — but assembling the pipeline (reference sheet generator, scene scripter, panel renderer, single-panel regenerator, style picker) is a fair amount of plumbing.

I have been using comicory.com as a hosted implementation of roughly this architecture. Drop in a story paragraph, the system handles the scripting and reference-sheet step, and the multi-panel output keeps the same character recognizable. Eight art styles available (manga, Western comic, watercolor, ink wash, etc.), and critically, single-panel regeneration is supported — if panel 4 drifts, you redo only that panel without rebuilding the rest of the story. Free tier is 30 images per month which is enough to evaluate the workflow.

Not a pitch; mostly flagging it because I spent a couple of weeks trying to glue the same pipeline together locally and it was a lot of YAML.

Closing thought

The character consistency problem is a nice example of how architectural fixes beat clever prompting. For the first three years of diffusion-for-comics, the whole field was trying to solve consistency at the prompt level — longer prompts, locked seeds, character templates, multi-image prompting. None of it really worked. The real unlock was a model class that takes a reference image as first-class conditioning.

When a generation problem resists prompt engineering for long enough, the answer is usually that the model architecture is wrong for the task, and someone will eventually ship a new one. FLUX.1-Kontext is that ship for multi-panel comics. I am curious what the equivalent "right architecture" looks like for the remaining hard cases — long-form continuity, multi-character scenes with physical interaction, and expressive pose variation.