Forem: 汪小春

One gpt-image-2 call, 9 hairstyle variants: prompt engineering for grid layouts

汪小春 — Sat, 16 May 2026 02:15:23 +0000

The first version of our hairstyle preview tool made 8 separate gpt-image-2 API calls — one per hairstyle. It worked. It was also $0.32 per preview, took 40 seconds, and the faces drifted between calls (each generation re-derived the face from the prompt + uploaded image).

This post is about how we cut that to a single API call producing a 9-grid (1 reference + 8 variants) — same face, lower cost, faster, and weirdly easier to prompt.

The 8-call problem

Naive architecture:

for hairstyle in ['crew cut', 'mid fade', ...]:
    img = gpt_image_2.generate(
        prompt=f"User's face with {hairstyle} hairstyle",
        reference=user_selfie,
    )
    grid.add(img)

Three problems compound:

Cost. 8 calls × $0.04 each = $0.32. We're selling at $0.99/test — margin is fine but eats fast at scale.

Latency. 8 sequential calls = ~40s. Parallel cuts to ~5s if you can, but rate limits and queue priority mean parallelization is unreliable. Users see a spinner.

Face drift. Each call independently interprets "user's face with X." The model re-imagines facial proportions slightly differently each time. Side-by-side, the 8 outputs don't look like the same person. UX killer for a "compare hairstyles on YOUR face" tool.

The single-call fix

We rewrote the prompt to request a 9-grid in one shot:

A 3x3 grid showing the same person with 9 different hairstyles.

Grid positions:
[1] reference: original photo, unchanged
[2] Crew Cut
[3] Mid Fade
[4] Wavy Side Part
[5] Caesar Cut
[6] Long Straight
[7] Quiff
[8] Surfer Waves
[9] Buzz Cut

Constraints:
- Same person in all 9 cells (consistent face, age, skin)
- Same lighting and angle across cells
- Only hair varies between cells
- Each cell separated by a thin white border

Three benefits:

1 API call = $0.04, not $0.32. 8x cost reduction.

~6s vs ~40s. Single-call latency, no parallel-queue gambling.

Face consistency by construction. The model treats all 9 cells as one coherent image, so facial features stay identical. No drift.

Prompt-engineering challenges

It wasn't free. Three things we had to work out:

Layout discipline. Without explicit "3x3 grid" + "separate cells", gpt-image-2 would blend or overlap. The thin white border instruction was crucial.

Cell ordering. First attempt was "list hairstyles in row-major order" and we got random placement. Switching to "Grid positions: [N] hairstyle" with numbered slots gave deterministic placement (which we needed for the UI to label cells correctly).

Hairstyle distinctiveness. Some styles (Crew Cut vs Buzz Cut) look similar at 1/9th of an image. We had to swap in more visually-distinct sets so user choices were meaningful.

What we'd do differently

The 9-grid is locked at 8 variants. If the model could accept "show me 16 styles", we'd offer that. Current cap is real — gpt-image-2 maintains identity well at 9 cells, less reliably at 16+. (The model is doing more work in less canvas space per cell.)

Long-term: per-cell quality + identity preservation will improve as models scale. For now, 8 is the sweet spot.

Try it

If you want to see what 9-grid hairstyle previews look like in practice, AI Omoggle is the tool — single test from $0.99, no photos stored.

I'd love to hear from anyone doing similar single-call multi-variant prompts. The "compose in one image, slice in UI" pattern feels like it generalizes to other AI image use cases.

Why we hardcoded 8 niche presets instead of letting GPT generate slide layouts

汪小春 — Sat, 16 May 2026 01:49:39 +0000

Most AI slide tools let GPT decide everything — the layout, the typography hierarchy, the section structure. Each generation is a new design lottery. We shipped the opposite approach: 8 hardcoded niche presets that GPT can fill but not redesign.

This post is about why constraint won over creativity for our slide-generation pipeline.

The problem with letting LLM design layouts

Early prototype: pure GPT layout generation. The model decides:

How many sections per slide
Title vs subtitle hierarchy
Bullet vs paragraph structure
Color emphasis
Asset placement

Result: each deck looked different from the next. "Different" sounds good until users started telling us:

"Why does the second slide have a 5-point list and the third has a 3-bullet hierarchy?"
"The font on slide 7 is huge, but slide 8 is tiny."
"Looks AI-generated."

The third complaint was the killer. When variance is visible, users default to "this AI doesn't know what it's doing."

The fix: pre-pick layouts, let LLM only fill content

We hardcoded 8 vertical-specific presets:

Career: hierarchy of pain → frame → action sections
Finance: chart-heavy with bullet clarifications
Reading: book cover + chapter quotes + 3-takeaway template
Beauty: image-led with overlay captions
Health: stats-forward with citation footers
Culture: timeline-style with accent imagery
Travel: map + photo grid + itinerary breakdown
Knowledge: 3-column comparison + "key insight" callout

Each preset is a deterministic layout system. GPT picks the right one based on the input topic, then fills slot content. The structural variance disappears.

Why niche-first, not generic-first

We considered the obvious alternative: 5 universal templates ("clean", "minimal", "playful"). It failed in user testing because:

"Clean" doesn't tell you what content goes where
The same "minimal" template applied to a finance deck and a travel deck both look generic

Niche-specific templates encode domain assumptions:

A finance deck's first slide should be a chart
A reading deck's first slide should be a book cover
A travel deck's last slide should be a map

These assumptions ride for free with the niche selection — no need to teach the LLM what each genre expects.

What we lose

We lose:

Flexibility for niches we didn't anticipate (business pitch, scientific paper, etc.)
The ability to experiment with novel layouts mid-deck

For both, our answer is "we'll add presets when there's clear demand" rather than "let the LLM figure it out". The latter is what failed in v0.

What I'd do differently

Each preset has a single "voice" — the same layout system applied throughout the deck. In hindsight, voice should vary by slide position (cover vs body vs CTA) within a preset, not just by niche. We'd ship "preset families" with intra-deck variation rather than treating each preset as a single template.

Try it

If you want to see what 8-niche-preset architecture looks like in practice, AnySlide ships the v1 of this. Free to start (60 credits at signup, daily +10 reset, no credit card).

I'd love to hear from anyone who took the opposite bet (full LLM creativity) — did it pay off?

Why we run two scoring tracks (LLM + Mediapipe) for our AI face-rating tool

汪小春 — Sat, 16 May 2026 00:27:51 +0000

A user tested our face-rating tool five times in a row with the same photo. They got scores of 6.2, 7.5, 6.8, 7.1, 5.9. That's a ±0.8 spread on supposedly the same input.

That email was the death of single-LLM scoring for us.

This is a short post about the architecture decision we ended up making — running two parallel scoring tracks and taking the geometric one as an anchor against LLM hallucination.

The variance problem

Subjective face scoring with an LLM is fundamentally non-deterministic. Each call re-samples the latent space. For a deterministic-feeling task like "rate this face 1-10," that variance is a UX killer. Users expect their face to have ONE score, not a probability distribution.

Common fixes that didn't work for us:

Lower temperature: helped at temperature=0, but the model still varied across calls because internal vector representations differ slightly.
Self-consistency (5 calls + majority): 5x the API cost for a 30% variance reduction. Not enough.
Few-shot anchoring with calibration faces: helped on average score but not on individual variance.

The dual-track fix

What worked: stop using LLMs for the parts where geometry is decidable.

We added a parallel geometric track using Mediapipe Face Mesh:

Canthal tilt (corner-of-eye angle): measurable to ±2 degrees from face landmarks.
Jaw angle (mandibular angle from chin to ear): consistent across calls.
Symmetry (Hausdorff distance between left/right halves): pure arithmetic.

These three measures map to a 0-10 sub-score that's deterministic for a given input image. It doesn't capture taste, but it captures geometry.

The LLM track stays — but now it's responsible for the aesthetic-judgment layer: skin quality assessment, hairstyle compatibility, facial harmony perception. Things that genuinely require pattern recognition over training data, not measurement.

The combination

We don't average the two. We compose:

final_score = 0.6 * geometric_score + 0.4 * llm_aesthetic_score

if abs(geometric - llm) > 2.0:
    flag_for_review(f"disagreement: G={geometric}, L={llm}")
    use_lower_score()  # be conservative

The 0.6/0.4 weighting was found empirically — geometric carries more weight because it's the deterministic anchor. The disagreement detection catches edge cases (e.g., the LLM rates someone high on "presence" but geometry is rough — usually a charisma photo we're not equipped to score correctly).

Results

Variance per identical input: from ±0.8 (single LLM) to ±0.5 (dual-track). Not zero, but much closer to what users expect.

Bonus: the geometric scores let us give actionable feedback. "Canthal tilt -3°, consider an angled selfie" beats "your eyes look closed" from a black-box LLM.

What I'd do differently

The 0.6/0.4 weighting should be per-axis, not global. A high-resolution close-up of skin should shift weight toward LLM aesthetic perception. A poorly-lit small selfie should shift toward geometric (because LLM judgment on bad photos is mostly noise).

We're refactoring this now — per-axis dynamic weighting based on photo quality signals.

Try it

If you want to see what dual-track scoring feels like in practice, you can try AI Omoggle — single test from $0.99, no subscription, no photos stored.

I'd genuinely love to hear how other people have tackled the LLM-variance problem in subjective tasks.

Two engines for AI slide decks: HTML output vs gpt-image-2 (and how we solved CJK rendering)

汪小春 — Wed, 13 May 2026 08:04:52 +0000

A few months ago, a user emailed us with a screenshot. They'd generated a Chinese-language slide deck with our tool — and every Chinese character was either missing, replaced with a square, or warped into something that wasn't quite the right glyph.

The screenshot was bad. The fix was harder than it looked.

This post is about the architectural decision we ended up making: running two different rendering engines for the same product, and why neither one alone was enough.

The problem with AI slides + CJK

Most AI slide generators do this:

LLM writes the content (text + structure)
A template engine (HTML/CSS or PPTX) lays it out
Done

This works fine for English. The text is a string; the font is whatever the template specifies. The user sees what they expect.

CJK breaks step 2 in two ways:

Font fallback. When the template's font doesn't include Chinese / Japanese / Korean glyphs, browsers fall back to whatever's available. The result is typographically inconsistent — half your slide is in your designed font, half is in something Noto-ish that the browser found.

Image-based generation. If you skip the template and ask an AI image model to "make a slide with this Chinese text", you'll get the garbled-CJK problem most generative image tools have — the model produces something that looks like Chinese but isn't actually any specific character. (Try this in DALL·E or Midjourney with any non-Latin script. You'll see what I mean.)

Two engines, two trade-offs

We ended up shipping both:

Engine 1: HTML path

The LLM produces a structured spec, we render it with a reveal.js / Slidev-style template. Output is an inline-editable web slide deck.

Pros: users can tweak content after generation (it's just HTML); fast; smaller file size for exports.
Cons: CJK looks acceptable but never great; visual variety is constrained by what the template supports.

Engine 2: gpt-image-2 path

OpenAI's gpt-image-2 (released April 2026) is the first image model where text rendering is genuinely usable for CJK. We compose a "slide-as-prompt" — layout description, content, style — and the model renders the entire slide as a single image.

Pros: typography is sharp and consistent; CJK characters render correctly; visual variety is essentially unlimited.
Cons: the user can't tweak content post-generation without re-rendering; ~5x slower than the HTML path; PPTX export has each slide as one image (not editable in PowerPoint).

The decision: ship both

We let the user pick. Default to HTML for fast iteration; switch to gpt-image-2 when CJK accuracy matters more than editability.

User flow:
  Article / link / PDF → LLM extracts structure
                         ↓
            ┌────────────┴────────────┐
   HTML path                      gpt-image-2 path
   (Slidev-style template)       (full-image render)
            ↓                            ↓
     Editable web slides         Image-per-page export

Why this isn't obviously the right architecture

Two engines means more code, more bugs, more decisions for the user. It also means our "What does the tool do?" elevator pitch has two halves — which is harder to sell than a single clean story.

But for CJK users, the HTML path alone wasn't acceptable, and dropping the HTML path entirely was a regression for everyone who wanted editable output. So: both.

What I'd do differently

In hindsight, we should have made the engine choice per-slide instead of per-deck. Some slides need editing (talking points, agenda); some need typography fidelity (a single Chinese headline on a chart). Forcing the user to pick one engine for the whole deck is the wrong granularity. We're fixing this now.

Try it

If you want to see what gpt-image-2 looks like as a slide engine — especially with CJK — you can sign up at AnySlide (60 free credits, no card). I'd genuinely love feedback on the engine switch UX; it's the part I'm least sure about.
ai, showdev, typography, i18n