<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: 汪小春</title>
    <description>The latest articles on Forem by 汪小春 (@xspring1982).</description>
    <link>https://forem.com/xspring1982</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3438219%2Fd386d39d-f7f6-47a0-ba8b-dbc44cbc7b0a.jpg</url>
      <title>Forem: 汪小春</title>
      <link>https://forem.com/xspring1982</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/xspring1982"/>
    <language>en</language>
    <item>
      <title>One gpt-image-2 call, 9 hairstyle variants: prompt engineering for grid layouts</title>
      <dc:creator>汪小春</dc:creator>
      <pubDate>Sat, 16 May 2026 02:15:23 +0000</pubDate>
      <link>https://forem.com/xspring1982/one-gpt-image-2-call-9-hairstyle-variants-prompt-engineering-for-grid-layouts-23ke</link>
      <guid>https://forem.com/xspring1982/one-gpt-image-2-call-9-hairstyle-variants-prompt-engineering-for-grid-layouts-23ke</guid>
      <description>&lt;p&gt;The first version of our hairstyle preview tool made 8 separate gpt-image-2 API calls — one per hairstyle. It worked. It was also $0.32 per preview, took 40 seconds, and the faces drifted between calls (each generation re-derived the face from the prompt + uploaded image).&lt;/p&gt;

&lt;p&gt;This post is about how we cut that to a single API call producing a 9-grid (1 reference + 8 variants) — same face, lower cost, faster, and weirdly easier to prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 8-call problem
&lt;/h2&gt;

&lt;p&gt;Naive architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hairstyle&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;crew cut&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mid fade&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]:&lt;/span&gt;
    &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gpt_image_2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s face with &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hairstyle&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; hairstyle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;reference&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_selfie&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three problems compound:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; 8 calls × $0.04 each = $0.32. We're selling at $0.99/test — margin is fine but eats fast at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency.&lt;/strong&gt; 8 sequential calls = ~40s. Parallel cuts to ~5s if you can, but rate limits and queue priority mean parallelization is unreliable. Users see a spinner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Face drift.&lt;/strong&gt; Each call independently interprets "user's face with X." The model re-imagines facial proportions slightly differently each time. Side-by-side, the 8 outputs don't look like the same person. UX killer for a "compare hairstyles on YOUR face" tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The single-call fix
&lt;/h2&gt;

&lt;p&gt;We rewrote the prompt to request a 9-grid in one shot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A 3x3 grid showing the same person with 9 different hairstyles.

Grid positions:
[1] reference: original photo, unchanged
[2] Crew Cut
[3] Mid Fade
[4] Wavy Side Part
[5] Caesar Cut
[6] Long Straight
[7] Quiff
[8] Surfer Waves
[9] Buzz Cut

Constraints:
- Same person in all 9 cells (consistent face, age, skin)
- Same lighting and angle across cells
- Only hair varies between cells
- Each cell separated by a thin white border
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three benefits:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1 API call = $0.04, not $0.32.&lt;/strong&gt; 8x cost reduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;~6s vs ~40s.&lt;/strong&gt; Single-call latency, no parallel-queue gambling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Face consistency by construction.&lt;/strong&gt; The model treats all 9 cells as one coherent image, so facial features stay identical. No drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt-engineering challenges
&lt;/h2&gt;

&lt;p&gt;It wasn't free. Three things we had to work out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layout discipline.&lt;/strong&gt; Without explicit "3x3 grid" + "separate cells", gpt-image-2 would blend or overlap. The thin white border instruction was crucial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cell ordering.&lt;/strong&gt; First attempt was "list hairstyles in row-major order" and we got random placement. Switching to "Grid positions: [N] hairstyle" with numbered slots gave deterministic placement (which we needed for the UI to label cells correctly).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hairstyle distinctiveness.&lt;/strong&gt; Some styles (Crew Cut vs Buzz Cut) look similar at 1/9th of an image. We had to swap in more visually-distinct sets so user choices were meaningful.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we'd do differently
&lt;/h2&gt;

&lt;p&gt;The 9-grid is locked at 8 variants. If the model could accept "show me 16 styles", we'd offer that. Current cap is real — gpt-image-2 maintains identity well at 9 cells, less reliably at 16+. (The model is doing more work in less canvas space per cell.)&lt;/p&gt;

&lt;p&gt;Long-term: per-cell quality + identity preservation will improve as models scale. For now, 8 is the sweet spot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you want to see what 9-grid hairstyle previews look like in practice, &lt;a href="https://aiomoggle.com" rel="noopener noreferrer"&gt;AI Omoggle&lt;/a&gt; is the tool — single test from $0.99, no photos stored.&lt;/p&gt;

&lt;p&gt;I'd love to hear from anyone doing similar single-call multi-variant prompts. The "compose in one image, slice in UI" pattern feels like it generalizes to other AI image use cases.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>performance</category>
      <category>python</category>
    </item>
    <item>
      <title>Why we hardcoded 8 niche presets instead of letting GPT generate slide layouts</title>
      <dc:creator>汪小春</dc:creator>
      <pubDate>Sat, 16 May 2026 01:49:39 +0000</pubDate>
      <link>https://forem.com/xspring1982/why-we-hardcoded-8-niche-presets-instead-of-letting-gpt-generate-slide-layouts-14b7</link>
      <guid>https://forem.com/xspring1982/why-we-hardcoded-8-niche-presets-instead-of-letting-gpt-generate-slide-layouts-14b7</guid>
      <description>&lt;p&gt;Most AI slide tools let GPT decide everything — the layout, the typography hierarchy, the section structure. Each generation is a new design lottery. We shipped the opposite approach: 8 hardcoded niche presets that GPT can fill but not redesign.&lt;/p&gt;

&lt;p&gt;This post is about why constraint won over creativity for our slide-generation pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with letting LLM design layouts
&lt;/h2&gt;

&lt;p&gt;Early prototype: pure GPT layout generation. The model decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many sections per slide&lt;/li&gt;
&lt;li&gt;Title vs subtitle hierarchy&lt;/li&gt;
&lt;li&gt;Bullet vs paragraph structure&lt;/li&gt;
&lt;li&gt;Color emphasis&lt;/li&gt;
&lt;li&gt;Asset placement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: each deck looked different from the next. "Different" sounds good until users started telling us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Why does the second slide have a 5-point list and the third has a 3-bullet hierarchy?"&lt;/li&gt;
&lt;li&gt;"The font on slide 7 is huge, but slide 8 is tiny."&lt;/li&gt;
&lt;li&gt;"Looks AI-generated."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The third complaint was the killer. When variance is visible, users default to "this AI doesn't know what it's doing."&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: pre-pick layouts, let LLM only fill content
&lt;/h2&gt;

&lt;p&gt;We hardcoded 8 vertical-specific presets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Career&lt;/strong&gt;: hierarchy of pain → frame → action sections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance&lt;/strong&gt;: chart-heavy with bullet clarifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reading&lt;/strong&gt;: book cover + chapter quotes + 3-takeaway template&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Beauty&lt;/strong&gt;: image-led with overlay captions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health&lt;/strong&gt;: stats-forward with citation footers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Culture&lt;/strong&gt;: timeline-style with accent imagery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Travel&lt;/strong&gt;: map + photo grid + itinerary breakdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge&lt;/strong&gt;: 3-column comparison + "key insight" callout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each preset is a deterministic layout system. GPT picks the right one based on the input topic, then fills slot content. The structural variance disappears.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why niche-first, not generic-first
&lt;/h2&gt;

&lt;p&gt;We considered the obvious alternative: 5 universal templates ("clean", "minimal", "playful"). It failed in user testing because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Clean" doesn't tell you what content goes where&lt;/li&gt;
&lt;li&gt;The same "minimal" template applied to a finance deck and a travel deck both look generic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Niche-specific templates encode domain assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A finance deck's first slide should be a chart&lt;/li&gt;
&lt;li&gt;A reading deck's first slide should be a book cover&lt;/li&gt;
&lt;li&gt;A travel deck's last slide should be a map&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These assumptions ride for free with the niche selection — no need to teach the LLM what each genre expects.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we lose
&lt;/h2&gt;

&lt;p&gt;We lose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flexibility for niches we didn't anticipate (business pitch, scientific paper, etc.)&lt;/li&gt;
&lt;li&gt;The ability to experiment with novel layouts mid-deck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For both, our answer is "we'll add presets when there's clear demand" rather than "let the LLM figure it out". The latter is what failed in v0.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Each preset has a single "voice" — the same layout system applied throughout the deck. In hindsight, voice should vary by slide &lt;em&gt;position&lt;/em&gt; (cover vs body vs CTA) within a preset, not just by niche. We'd ship "preset families" with intra-deck variation rather than treating each preset as a single template.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you want to see what 8-niche-preset architecture looks like in practice, &lt;a href="https://anyslide.app" rel="noopener noreferrer"&gt;AnySlide&lt;/a&gt; ships the v1 of this. Free to start (60 credits at signup, daily +10 reset, no credit card).&lt;/p&gt;

&lt;p&gt;I'd love to hear from anyone who took the opposite bet (full LLM creativity) — did it pay off?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>product</category>
      <category>ux</category>
    </item>
    <item>
      <title>Why we run two scoring tracks (LLM + Mediapipe) for our AI face-rating tool</title>
      <dc:creator>汪小春</dc:creator>
      <pubDate>Sat, 16 May 2026 00:27:51 +0000</pubDate>
      <link>https://forem.com/xspring1982/why-we-run-two-scoring-tracks-llm-mediapipe-for-our-ai-face-rating-tool-4j0n</link>
      <guid>https://forem.com/xspring1982/why-we-run-two-scoring-tracks-llm-mediapipe-for-our-ai-face-rating-tool-4j0n</guid>
      <description>&lt;p&gt;A user tested our face-rating tool five times in a row with the same photo. They got scores of 6.2, 7.5, 6.8, 7.1, 5.9. That's a ±0.8 spread on supposedly the same input.&lt;/p&gt;

&lt;p&gt;That email was the death of single-LLM scoring for us.&lt;/p&gt;

&lt;p&gt;This is a short post about the architecture decision we ended up making — running two parallel scoring tracks and taking the geometric one as an anchor against LLM hallucination.&lt;/p&gt;

&lt;h2&gt;
  
  
  The variance problem
&lt;/h2&gt;

&lt;p&gt;Subjective face scoring with an LLM is fundamentally non-deterministic. Each call re-samples the latent space. For a deterministic-feeling task like "rate this face 1-10," that variance is a UX killer. Users expect their face to have ONE score, not a probability distribution.&lt;/p&gt;

&lt;p&gt;Common fixes that didn't work for us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower temperature&lt;/strong&gt;: helped at temperature=0, but the model still varied across calls because internal vector representations differ slightly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-consistency (5 calls + majority)&lt;/strong&gt;: 5x the API cost for a 30% variance reduction. Not enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Few-shot anchoring with calibration faces&lt;/strong&gt;: helped on average score but not on individual variance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The dual-track fix
&lt;/h2&gt;

&lt;p&gt;What worked: stop using LLMs for the parts where geometry is decidable.&lt;/p&gt;

&lt;p&gt;We added a parallel geometric track using Mediapipe Face Mesh:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Canthal tilt&lt;/strong&gt; (corner-of-eye angle): measurable to ±2 degrees from face landmarks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jaw angle&lt;/strong&gt; (mandibular angle from chin to ear): consistent across calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symmetry&lt;/strong&gt; (Hausdorff distance between left/right halves): pure arithmetic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three measures map to a 0-10 sub-score that's &lt;em&gt;deterministic for a given input image&lt;/em&gt;. It doesn't capture taste, but it captures geometry.&lt;/p&gt;

&lt;p&gt;The LLM track stays — but now it's responsible for the &lt;em&gt;aesthetic-judgment&lt;/em&gt; layer: skin quality assessment, hairstyle compatibility, facial harmony perception. Things that genuinely require pattern recognition over training data, not measurement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The combination
&lt;/h2&gt;

&lt;p&gt;We don't average the two. We compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;final_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;geometric_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;llm_aesthetic_score&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;geometric&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;flag_for_review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;disagreement: G=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;geometric&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, L=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;use_lower_score&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# be conservative
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 0.6/0.4 weighting was found empirically — geometric carries more weight because it's the deterministic anchor. The disagreement detection catches edge cases (e.g., the LLM rates someone high on "presence" but geometry is rough — usually a charisma photo we're not equipped to score correctly).&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Variance per identical input: from ±0.8 (single LLM) to &lt;strong&gt;±0.5 (dual-track)&lt;/strong&gt;. Not zero, but much closer to what users expect.&lt;/p&gt;

&lt;p&gt;Bonus: the geometric scores let us give &lt;em&gt;actionable&lt;/em&gt; feedback. "Canthal tilt -3°, consider an angled selfie" beats "your eyes look closed" from a black-box LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;The 0.6/0.4 weighting should be per-axis, not global. A high-resolution close-up of skin should shift weight toward LLM aesthetic perception. A poorly-lit small selfie should shift toward geometric (because LLM judgment on bad photos is mostly noise).&lt;/p&gt;

&lt;p&gt;We're refactoring this now — per-axis dynamic weighting based on photo quality signals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you want to see what dual-track scoring feels like in practice, you can try &lt;a href="https://aiomoggle.com" rel="noopener noreferrer"&gt;AI Omoggle&lt;/a&gt; — single test from $0.99, no subscription, no photos stored.&lt;/p&gt;

&lt;p&gt;I'd genuinely love to hear how other people have tackled the LLM-variance problem in subjective tasks.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Two engines for AI slide decks: HTML output vs gpt-image-2 (and how we solved CJK rendering)</title>
      <dc:creator>汪小春</dc:creator>
      <pubDate>Wed, 13 May 2026 08:04:52 +0000</pubDate>
      <link>https://forem.com/xspring1982/two-engines-for-ai-slide-decks-html-output-vs-gpt-image-2-and-how-we-solved-cjk-rendering-2h85</link>
      <guid>https://forem.com/xspring1982/two-engines-for-ai-slide-decks-html-output-vs-gpt-image-2-and-how-we-solved-cjk-rendering-2h85</guid>
      <description>&lt;p&gt;A few months ago, a user emailed us with a screenshot. They'd generated a Chinese-language slide deck with our tool — and every Chinese character was either missing, replaced with a square, or warped into something that wasn't quite the right glyph.&lt;/p&gt;

&lt;p&gt;The screenshot was bad. The fix was harder than it looked.&lt;/p&gt;

&lt;p&gt;This post is about the architectural decision we ended up making: &lt;strong&gt;running two different rendering engines for the same product&lt;/strong&gt;, and why neither one alone was enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with AI slides + CJK
&lt;/h2&gt;

&lt;p&gt;Most AI slide generators do this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM writes the content (text + structure)&lt;/li&gt;
&lt;li&gt;A template engine (HTML/CSS or PPTX) lays it out&lt;/li&gt;
&lt;li&gt;Done&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works fine for English. The text is a string; the font is whatever the template specifies. The user sees what they expect.&lt;/p&gt;

&lt;p&gt;CJK breaks step 2 in two ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Font fallback.&lt;/strong&gt; When the template's font doesn't include Chinese / Japanese / Korean glyphs, browsers fall back to whatever's available. The result is typographically inconsistent — half your slide is in your designed font, half is in something Noto-ish that the browser found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image-based generation.&lt;/strong&gt; If you skip the template and ask an AI image model to "make a slide with this Chinese text", you'll get the garbled-CJK problem most generative image tools have — the model produces something that looks like Chinese but isn't actually any specific character. (Try this in DALL·E or Midjourney with any non-Latin script. You'll see what I mean.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Two engines, two trade-offs
&lt;/h2&gt;

&lt;p&gt;We ended up shipping both:&lt;/p&gt;

&lt;h3&gt;
  
  
  Engine 1: HTML path
&lt;/h3&gt;

&lt;p&gt;The LLM produces a structured spec, we render it with a reveal.js / Slidev-style template. Output is an inline-editable web slide deck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; users can tweak content after generation (it's just HTML); fast; smaller file size for exports.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; CJK looks acceptable but never great; visual variety is constrained by what the template supports.&lt;/p&gt;
&lt;h3&gt;
  
  
  Engine 2: gpt-image-2 path
&lt;/h3&gt;

&lt;p&gt;OpenAI's &lt;code&gt;gpt-image-2&lt;/code&gt; (released April 2026) is the first image model where text rendering is genuinely usable for CJK. We compose a "slide-as-prompt" — layout description, content, style — and the model renders the entire slide as a single image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; typography is sharp and consistent; CJK characters render correctly; visual variety is essentially unlimited.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; the user can't tweak content post-generation without re-rendering; ~5x slower than the HTML path; PPTX export has each slide as one image (not editable in PowerPoint).&lt;/p&gt;
&lt;h2&gt;
  
  
  The decision: ship both
&lt;/h2&gt;

&lt;p&gt;We let the user pick. Default to HTML for fast iteration; switch to gpt-image-2 when CJK accuracy matters more than editability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User flow:
  Article / link / PDF → LLM extracts structure
                         ↓
            ┌────────────┴────────────┐
   HTML path                      gpt-image-2 path
   (Slidev-style template)       (full-image render)
            ↓                            ↓
     Editable web slides         Image-per-page export
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why this isn't obviously the right architecture
&lt;/h2&gt;

&lt;p&gt;Two engines means more code, more bugs, more decisions for the user. It also means our "What does the tool do?" elevator pitch has two halves — which is harder to sell than a single clean story.&lt;/p&gt;

&lt;p&gt;But for CJK users, the HTML path alone wasn't acceptable, and dropping the HTML path entirely was a regression for everyone who wanted editable output. So: both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;In hindsight, we should have made the engine choice &lt;strong&gt;per-slide&lt;/strong&gt; instead of per-deck. Some slides need editing (talking points, agenda); some need typography fidelity (a single Chinese headline on a chart). Forcing the user to pick one engine for the whole deck is the wrong granularity. We're fixing this now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you want to see what gpt-image-2 looks like as a slide engine — especially with CJK — you can sign up at &lt;a href="https://anyslide.app" rel="noopener noreferrer"&gt;AnySlide&lt;/a&gt; (60 free credits, no card). I'd genuinely love feedback on the engine switch UX; it's the part I'm least sure about.&lt;br&gt;
ai, showdev, typography, i18n&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
