Forem: 고광웅

Cross-posting to Three Platforms Forced Me to Rethink What 'The Same Post' Means

고광웅 — Thu, 23 Apr 2026 14:18:43 +0000

The Simple Mental Model That Failed

Last week I finished writing an essay for this newsletter. I wrote it once. I published it three times — Substack (the original), Dev.to (English cross-post), Tistory (Korean rewrite).

My mental model going in was: write once, redistribute. Source of truth on Substack, automation handles the rest. The posts on Dev.to and Tistory are "the same post," just on different platforms.

By the end of the week I had written five essays, run the full three-channel pipeline on each one, and learned something I probably should have anticipated: there is no such thing as "the same post." Each platform rendered the same starting content into a different object, and the pipeline only started working when I accepted that and built for each platform's native primitive.

This post is what I noticed.

What "Primitive" Means Here

Each platform has an atomic unit the reader is actually consuming. Get the unit right, the rest of the post works. Get it wrong, and the post technically publishes but lands flat.

For the three platforms I ran this week:

Substack's primitive is the sentence, delivered to an email inbox. A subscriber opens their inbox in the morning. Their posture is sit-and-read. They committed to hearing from this author in this voice. What matters: the opening hook that survives "do I open this email?", the rhythm of the prose, the feeling of being written to rather than at.
Dev.to's primitive is the snippet, discovered in a feed. A developer opens Dev.to during a work break and scrolls. Their posture is scan-first, save-for-later, read-if-it-looks-useful. What matters: the first-paragraph payoff, the code blocks that can be screenshotted, the tags that make the post findable a week later.
Tistory's primitive is the local context, arrived at via Naver search. A Korean reader searches for something like "AI agent persona" in Korean and clicks the third result. Their posture is compare-with-other-sources, skim-for-answer. What matters: Korean sentence rhythm that doesn't feel translated, reference points a Korean builder would recognize, canonical URL so the SEO credit routes correctly.

"The same post" collapsed those three primitives into one thing in my head. The pipeline I built assumed that's what they were. It wasn't.

Where the Pipeline Broke

The breaks weren't dramatic. They were a series of small misfires that, taken together, forced me to redesign each platform's adaptation step.

Substack's default avatar became my cover image, once.
When an already-published Substack post has no cover image set, the social-card og:image pulls the generic publication avatar — a small gray sphere that, scaled up to 1456×819, looks like an out-of-focus moon. I tried retroactively updating the cover via the API; the draft state updated, but the public page rendering didn't. Substack appears to snapshot the post's social card at first publish. The cover decision has to happen before publish. I added cover generation to the pre-publish pipeline after losing one post's cover to this.

Dev.to rejected my default Python User-Agent with "Forbidden Bots."
The Dev.to API works fine with a valid token — but only if the request includes a non-default User-Agent header. python urllib sends Python-urllib/X.Y by default, and Cloudflare in front of Dev.to returns 403 for that string. The fix is a one-line header addition, but it cost me an hour of debugging "why is my valid token being rejected?" when the problem was never the token.

Tistory's editor has two editors, and I was writing to the wrong one.
The Tistory write page renders a TinyMCE iframe as the main editor. A CodeMirror instance also exists in the DOM as a backup for HTML-mode users. I found CodeMirror first and wrote my injected content there. The UI saved the TinyMCE content, which was empty. I got eleven empty posts published as drafts before I noticed. Now the injection routine tries TinyMCE's activeEditor.setContent(html) first, and CodeMirror is a fallback for HTML-mode users.

Tistory's visibility radio was clicking but not selecting.
The publish modal has three visibility radio labels — public, public-protected, private — each written in Korean. My Playwright script searched for the Korean word for "public" and clicked the first match. The click registered in the event log. The saved post was still private. The issue was that my selector was finding the first instance of the word — which turned out to be a header label elsewhere on the page, not the radio option. Fixing it meant scoping the text search to the modal and preferring [role="radio"] / <label> elements with exact text match.

The Korean translation came out technically correct but translation-y.
The first-pass Korean rewriter produced grammatically fine Korean that a native reader would immediately clock as translated. Common patterns — awkward passive constructions, the same noun-phrase ending repeated across three consecutive sentences, English word order showing through in Korean syntax. I added a second-pass editor that takes the first pass as input and specifically targets these translation-y patterns. I also added a smell score — a cheap regex-based heuristic counting six known patterns — so I could measure whether the second pass was actually improving output. On one post the first pass scored 14, the second scored 3. On another post, the first pass was already clean (scored 3), and the editor responsibly left it alone. I take the second result as more important than the first: the pipeline knows when not to edit.

What Each Platform Actually Needed

Once I accepted "same content" was a category error, the adaptations started looking like three different products.

Substack adaptation

Title treated as email subject line, not blog headline. Punchy, specific, survives inbox clutter.
Cover image generated before first publish. Typographic, dark navy with accent color, consistent series aesthetic.
Paywall section markers — none yet, but placeholder structure so I can add them later without rewriting.
Markdown → ProseMirror node tree (Substack's body format isn't raw markdown; the API needs it serialized to their doc structure).

Dev.to adaptation

Tags normalized to lowercase alphanumeric (Dev.to strips anything else).
canonical_url pointed to the Substack post so search engines credit the original.
main_image sourced from the already-uploaded Substack CDN URL. Dev.to has no image upload API — their parser fetches external URLs — so reusing Substack's CDN saved a redundant upload step.
Filter in the image URL extractor skips the Substack subscribe-card avatar so Dev.to doesn't pick up a blurry placeholder as the hero image.

Tistory adaptation

Two-pass Korean rewrite, with the second pass measurable and skippable if the first pass is already clean.
canonical (as a blockquote link in the body) pointing back to Substack. Naver ignores canonical URLs for ranking, but Google still respects them, so the tag is more about cross-platform SEO hygiene than Naver ranking.
Netscape cookies merged from tistory.com and kakao.com (Tistory auth goes through Kakao).
Idempotency check: if blog_drafts.tistory_url is already set, skip re-publish unless --force. Eleven duplicate test posts taught me this one.
Playwright UI automation as the transport layer. Tistory's Open API shut down in February 2024 and isn't coming back.

The Platforms Where I Stopped

Medium was on my list. I wrote the publisher module, added the env token field, and then discovered the Integration Token feature now requires Medium Partner Program membership — which isn't available to all accounts. Shipping it behind a paywall wasn't worth it for one channel.

Naver Blog has an official OpenAPI, but the content format is constrained enough (limited HTML, external link penalties) that automating it would require another rewrite pass — a third-pass "Naver blog format" rewriter. That's on the list for later, not this week.

I note both of these because the multi-platform question isn't just "which platforms work?" It's "which platforms are worth the adaptation cost?" A platform with a non-trivial rewrite pass costs more than the same reach on a platform that already speaks my primitive.

What I Actually Learned

Same content is a category error. The words are the same input. The artifact produced by each platform is a different object. Treating them as one job — "publish the post everywhere" — hides the fact that each adaptation is non-trivial and each platform's primitive is different.

Pipelines should speak each platform's native language. A Tistory-shaped post is not a machine-translated Substack post. It's a different artifact with different idioms, different reader context, different SEO concerns. The pipeline that glues them has to make the translation at the platform's level, not the language's level.

Measure the adaptations, not just the publishes. I almost shipped a Korean rewrite that was technically fluent but read as translated. The only reason I caught it was the smell-score regex I added as a sanity check. The pipeline's quality gate has to be at the rendered output, not at the API status code.

APIs die. Plan for UI automation. Medium and Tistory both used to have Open APIs that worked. Neither does now. Playwright-based publishing is uglier than API-based, but it survives policy changes that break APIs. Anything publishing-adjacent should have a Playwright fallback path.

For Other Builder-Writers Considering Multi-Platform

Three things I'd do differently if I were starting this week rather than ending it:

1. Write down each platform's primitive before adapting for it.
What is the reader's posture? Where did they arrive from? What format does the platform natively render well? The answer to those three questions determines most of the adaptation work.

2. Build idempotency from the first post, not the twelfth.
I published eleven duplicates on Tistory before I added "if already published, skip" logic. Ten minutes of upfront design would have prevented ninety minutes of cleanup.

3. Treat each platform's automation as a separate product with its own failure modes.
The Dev.to 403, the Substack cover re-render quirk, the Tistory editor ambiguity — none of these share a root cause. They each required platform-specific debugging. Pretending the automation is one system creates the illusion of a single code path where there are actually three.

The Close

The instinct to cross-post from one source to many channels is correct. The hidden cost is the adaptation work that you don't see until you ship. Same words in, different objects out.

After a week of running this in anger, my conclusion is that cross-posting is really cross-rendering. The same source, rendered by different primitives, into different platforms' native formats. The pipeline that makes this pleasant to run respects each platform's primitive rather than forcing uniformity across them.

If your source content is generic enough that the rendering difference doesn't matter — short announcements, product launches, pull-quotes — the naive approach works. For anything essay-length, opinion-driven, or audience-differentiated, the primitive shift is real, and the pipeline has to know about it.

I Wrote Four Posts. Then I Let Them Decide My Roadmap. Here's Why I Stopped.

고광웅 — Wed, 22 Apr 2026 13:28:26 +0000

The Week That Felt Productive

Last week I published four posts on this newsletter. They were observation pieces about AI agent products — what current platforms get wrong about persona, about conversation-quality monitoring, about the relationship between AI and runtime, about optimizing for proxy metrics instead of real outcomes.

Each post stood on its own. They also connected — four angles on the same underlying claim: most AI agent products today ship simple primitives (static personas, per-call logs, workflow builders) when the more accurate primitives are runtime-level (situational steering, trajectory-level observability, content-aware design).

By the fifth day, I noticed I wasn't just writing. I was building a worldview.

Yesterday I opened a new document to plan product direction for the agent platform I run. Within twenty minutes I had drafted a framework with four improvement tracks, each mapped one-to-one onto those four posts. The plan felt right. It was logically consistent, evidence-backed, and matched my product's current architecture.

I asked the person I work with most closely — who sometimes functions as my critical second opinion — whether this was the right direction.

The answer: technically sound, strategically premature.

That stopped me. And it taught me something I want to write down before I forget it.

The Trap Has a Name

The pattern I walked into isn't new. It has a name. I'll call it content-to-product alignment trap.

It works like this:

You write essays about how the world should be.
The essays are logically coherent and emotionally satisfying to produce.
You build a worldview from the act of writing them.
You then try to build your product around that worldview.
You mistake the coherence of the worldview for evidence that the product is right.

The coherence is real. But coherence is not demand. Your essays are a theory of what's broken in the industry. Your product has to solve what's actually breaking for the users you're trying to reach. These can overlap. They can also diverge entirely.

Here's the part I didn't expect: the trap gets stronger when you write a series. A single essay is easier to hold at arm's length. Four essays in a week, all reinforcing the same thesis, feel like a position paper. The more consistent the series, the more the author mistakes its internal consistency for external validity.

I wrote posts that argued AI agent platforms should have situation-aware personas, trajectory-level observability, content-first design, and outcome-based evaluation. Each claim is defensible. But I have no user interview data showing my actual users are blocked on any of those four problems. I just wrote about them compellingly, and four posts later I was ready to bet a roadmap on them.

Why This Fails Quietly

The tricky part of this trap is that it doesn't feel like a mistake while you're in it.

When you build a product based on real user pain, there's friction. Users complain, things don't make sense, hypotheses get falsified. The dissonance is productive. You update.

When you build a product based on your own essays, there's no dissonance. You wrote the essays. You agree with yourself. Every decision confirms the worldview you already built. The feedback loop collapses into a loop of one.

And because the essays are public, there's a second reinforcing mechanism: public commitment. You've told readers the world is shaped a certain way. Pivoting your product away from that shape can feel like a reputational cost. So you don't pivot. You build.

The product that results might even be well-engineered. It will just be well-engineered for the wrong problem.

The Evidence I Wasn't Using

When I sat with the pushback, I made a small but important list.

Users of my platform don't write to me saying "my agent's conversation drifts structurally and we lack trajectory-level observability." They write to me saying things like:

it's too slow
this integration broke
the output missed the point
I can't figure out what the agent actually did

These aren't the same problems I wrote about this week. There's a relationship — trajectory observability would help diagnose "can't figure out what the agent did" — but "relationship" is not "direct match."

When I mapped my four-track plan against these actual complaints, I found that one of the four tracks directly addressed a user-facing pain, and the other three were internal quality infrastructure. The three would help my team operate better. They wouldn't be felt by users unless I explicitly surfaced them.

I had written a roadmap where 75% of the work was invisible to the users it was supposedly for.

Author Voice vs. Builder Voice

The fix isn't to stop writing about product observations. The fix is to recognize that the author and the builder are different roles, and they need to hear different things before they commit.

As an author, I'm allowed — encouraged — to write from observation. I can say "here's what I notice about the shape of this category" without being accountable to whether the observation solves anyone's concrete problem. Essays are for sense-making, hypothesis-forming, provocation. They don't need user research to be valuable.

As a builder, I'm not allowed to skip user research. I don't get to substitute my essays for it. A product decision grounded in "I wrote about this and it felt true" is a decision grounded in one person's sense-making. One person's sense-making is a terrible distribution to sample user need from.

The trap is that these two voices live in the same head. The essay I wrote yesterday becomes the premise for the product decision I make today. Unless I consciously split the two, they blur.

In my case the conscious split looks like this:

Essays go on Substack. They're observations. They commit me to thinking publicly, not to building accordingly.
Roadmap decisions go through a different filter: user interviews, retention data, competitive positioning, pricing experiments. Essays can be an input to that filter, but not the filter itself.

I had collapsed those two tracks last week without noticing.

The Counter-Example I Should Have Learned From Earlier

There's an irony here. One of the posts I wrote last week was about Claude Design — the product Anthropic launched, which I analyzed by reading source code of the skill bundles it produces. The deepest claim of that piece was: Claude Design is mostly clever runtime engineering with a model at the entry point. The runtime is the product.

What I missed, while writing enthusiastically about that claim, is how Claude Design actually got built. Anthropic didn't start by writing essays about "the AI design category needs a runtime layer." They built a product that users wanted (fast, good-looking design artifacts), and the runtime emerged as the architecture that made that product work well.

Runtime-as-product is a retrospective description of Claude Design. It is not a prescription for how to decide what to build.

If I had actually internalized that lesson, my own plan wouldn't have led with "let's build runtime infrastructure because it's the more accurate primitive." It would have led with "what user outcome is currently broken, and does a runtime change fix it?"

The worldview from my essay was seductive. The worldview from how the product I admired actually came to exist was more useful, and I walked past it.

What I Changed

After the conversation, I rewrote the plan.

The four improvement tracks stayed. What changed was how they're classified and sequenced.

One track — the one that directly changes what users feel (agent behavior shifts with user state) — became the center. It is the only track that gets to carry product positioning.

Three tracks — the observability layer, the artifact bundle format, the evaluation feedback loop — got reclassified as internal engineering investments. They might make my team's operations better. They don't show up in external messaging. No user cares about "we added trajectory-level scoring" as a product pitch.

I also added a positioning confirmation gate. Before any of this gets committed to as product positioning, the user-facing track has to be validated with actual users. If three out of five test users can point to specific moments where the agent felt like it was reading their state, positioning is confirmed. If they can't, the work was still useful internally, but the positioning claim doesn't get made.

That last part feels uncomfortable, which is how I know it's right. The honest product position is "we think this direction matters but we don't know yet if users will feel it." That's how the position should sit until the evidence shows up.

A Practical Test for Other Builder-Writers

If you're both publishing and building, here's the test I now run before letting a piece of writing inform a product decision:

1. Would this observation survive contact with actual user interviews?

If your essay says users need X, and you haven't heard a user articulate X, that's a hypothesis, not a decision.

2. Is the elegance of the essay doing argumentative work the evidence isn't?

Coherent writing is a skill. Coherence inside the essay is not coherence with reality. Test whether the essay's strength is its logic or its evidence.

3. If you had to remove the essay from consideration, would the product decision still look the same?

If no, the essay is the load-bearing reason for the decision. That's dangerous unless the essay is backed by something beyond the author's own sense-making.

4. What is your essay actually an output of — user research, or your own pattern-matching?

Both are valid writing inputs. Only one is a valid product input.

None of these tests disqualify writing from influencing product direction. They just force you to name which track the essay belongs on.

What Stays

I'm going to keep writing these essays. The observation work is valuable on its own. It shapes how I think, it clarifies what I can articulate, it creates accountability that makes the thinking sharper. A weekly essay is not a time waste, even if it doesn't steer the roadmap.

What I'm not going to do, anymore, is let the essay be the plan.

The series I wrote last week — four posts on agent product primitives — is still a real series about real observations. Those observations probably point at real gaps. I just don't know yet which of them are my users' gaps versus the industry's abstract gaps. That distinction matters more than I treated it.

If you're reading this as a builder-writer yourself, the thing I'd watch for is the fifth day of a week when the series feels especially coherent. That's the moment when the author in your head tries to hand the product direction to the builder in your head. The handoff feels productive. It's actually a step sideways into fiction.

Notice when it happens. Make the split conscious.

Response Quality Is Not Conversation Quality. A Paper Quantifies the Gap.

고광웅 — Tue, 21 Apr 2026 16:09:14 +0000

The Metric Most Agent Products Are Missing

Most AI evaluation work you see on agent products measures the same thing: was this response good? You get a score per output, you track it over time, you look for regressions. That's the pattern whether you're using LLM-as-judge, user thumbs-up/down, or hand-graded samples.

This is a reasonable thing to measure. It's also an incomplete thing to measure, in a way that matters more for multi-turn agents than the industry has quite caught up to.

Here's what we're missing: a conversation can be full of individually good responses and still be structurally broken. The agent contradicts itself across turns. It shifts topic in ways the user didn't cue. It answers the current message fine but has stopped tracking what was agreed three messages ago. Each response passes a quality bar. The conversation fails anyway.

A paper uploaded to arXiv last week tries to formalize this gap — and more interestingly, proposes a way to measure it in production without embeddings, judges, or access to model internals. I want to walk through what it shows, because I think the measurement framing is more important than the specific method.

The paper is Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction (Hafez, Nazeri, v2 April 17).

The Number That Made Me Read It Twice

Across 4,574 conversational turns spanning 34 conditions, three frontier teacher models and one student model, the authors report:

Their proposed signal aligns with structural consistency in 85% of conditions.
It aligns with semantic quality in only 44% of conditions.

Put those two numbers next to each other and they tell a story.

Response quality and conversational consistency are not the same thing. They can be measured to diverge. And the tools most teams use — LLM-as-judge on outputs, user feedback on individual responses — are measuring the 44% side of the gap, not the 85% side.

If your agent is deployed in anything that looks like an ongoing interaction — support, coaching, tutoring, sales, therapy-adjacent use cases, long-form research, gaming — the side you're not measuring is the side where the trust breaks.

What the Paper Actually Proposes

The authors define a metric they call Bipredictability (P). Their description of it, taken from the abstract: it "measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty."

In plain terms: across a conversation turn, there's information that the context predicts about the response, information that the response predicts about the next prompt, and information that the next prompt predicts about the context. How much those loops overlap, relative to how uncertain each turn is overall, is what they track.

The implementation is a lightweight auxiliary component they call the Information Digital Twin (IDT) — running alongside the agent and computing P from the token stream. No embeddings. No auxiliary evaluator model. No white-box model access.

That engineering profile matters. It means the signal can, in principle, sit in a production deployment at trivial cost. Most "measure what's happening in your LLM agent" proposals involve an LLM judge or a vector DB query per turn. This one is token frequency statistics.

I haven't built their system, and the abstract doesn't go deep on implementation details, so I'm hedging on whether the engineering will be as clean in practice as the abstract implies. But the design choice is pointing at something real: if you want a monitoring signal that can run continuously in production, it has to be cheap enough to not change your deployment economics. Bipredictability, at least as described, fits that constraint.

What Their IDT Caught

The detection result reported in the abstract: 100% sensitivity for contradictions, topic shifts, and non-sequiturs in their tested set.

Sensitivity claims at that level always deserve scrutiny — it means every tested failure was caught, not that every failure in every real deployment will be. The authors are testing against constructed conversations with known failures. Production distributions will be messier.

Still, even accepting the number at face value for the test conditions, three failure types are worth naming because they map directly to real user complaints:

Contradictions — agent says A in turn 3, says not-A in turn 12, and the user has been quietly losing faith since turn 7.
Topic shifts — agent pivots away from the user's thread without a cue. Feels "off" in a way users rarely articulate.
Non-sequiturs — response that's individually coherent but doesn't actually engage with what just happened.

If you've ever had a user say "I don't know, it just stopped feeling right" — they're usually describing one of these three. None of them are caught by "rate this response 1-5" dashboards.

Why This Is a Different Measurement Problem Than Response Quality

I think the piece builders most often miss is that response-level evaluation and conversation-level evaluation are structurally different problems.

Response quality is a pointwise judgment. You can sample, score, aggregate. LLM-as-judge does a decent job of this. It's the kind of evaluation that fits neatly into existing observability tooling — each output is a discrete event with a score attached.

Conversation-level consistency is a sequence problem. You can't score it by looking at any single turn. You need to look at relationships between turns. The measurement surface is the conversation trajectory, not individual messages.

The tools haven't caught up. Agent observability platforms like Langfuse, LangSmith, Helicone are doing better-than-ever work on per-call metrics — latency, cost, tool usage, response sampling. Very little in that category instruments conversation-level structural properties. Which is the level where multi-turn agents mostly fail.

The paper's contribution, from a tools-thinking perspective, is identifying that there's a cheap signal at this level if you know where to look.

Three Audit Questions For Your Multi-Turn Agents

If you're shipping any kind of multi-turn agent, three questions are worth sitting with:

1. Do you measure anything about the conversation as a whole, or only about individual turns?

Most teams I know answer "only individual turns" after thinking about it. The shape of current dashboards enforces this — each row is a request.

2. If a user tells you "the conversation got weird around message 15," can you go find what happened?

Most production agents don't retain full conversation state in a way that makes this analyzable after the fact. Or they retain it, but nothing about the trajectory is indexed or searchable.

3. Have you instrumented topic shifts, contradictions, or non-sequiturs in any form?

If you haven't, you're outsourcing detection of these failures to your users. They'll notice, but you won't — and by the time they tell you, attrition has happened.

These aren't theoretical failures. They're the most common complaint pattern in post-cancellation interviews I've seen for multi-turn AI products: "It worked fine at first but then kind of drifted."

What This Paper Leaves Open

A few things the abstract doesn't settle that I'd want to know before building on it:

Exactly how Bipredictability is computed. The verbal description is suggestive but not precise enough to reimplement from. Need to read the full paper.
Which frontier models were used. The abstract says "three frontier teacher models" without naming them. Worth checking whether the signal transfers across model families.
Whether code or data is public. The arXiv page doesn't list code or dataset resources. For a proposal that's essentially "add this runtime monitor to your system," reference implementation availability will determine how fast this gets adopted.
False positive behavior at production scale. 100% sensitivity on a curated test set is a different claim from "works reliably at scale without flooding you with false flags." The abstract doesn't report specificity in a form I can quote.

I'm flagging these not to dismiss the paper but because the distance between "this is the right idea" and "this is deployable" is where most interesting research lives, and it's worth staying honest about that distance.

The Builder Takeaway

The specific metric matters less to me than the framing. The framing is:

Response quality is a property of individual outputs. Conversational reliability is a property of the trajectory. If you only measure the first, you're blind to failure modes that happen at the second level — and those are the failure modes that drive user churn in multi-turn products.

Whatever the eventual best implementation turns out to be — Bipredictability, embeddings-based, something else — the thing worth internalizing is that there's a measurement gap here, and closing it probably requires rethinking what your agent observability stack is watching.

For me, the immediate action from reading this isn't "implement IDT." It's closer to: audit what dashboards my team and I are actually looking at, and note how many of them measure conversation-level properties at all. The answer for most of us is going to be close to zero. That's the gap worth working on before worrying about which specific metric to adopt.

The Throughline

I've been writing this week about AI agent primitives — persona that's actually runtime steering rather than a static string, design that's content-structure-first, and now measurement that's trajectory-level rather than pointwise.

There's a pattern connecting them. The abstractions we shipped first for AI agents were the ones that were easy to build: persona as string, design as template, evaluation as per-output score. In each case, the more accurate primitive is a little harder and a little more runtime-y: persona as steering, design as content-reading, evaluation as trajectory-watching.

That's not a coincidence. Early AI product design has been constrained by what was cheap and easy to instrument at the call site. What I'm watching the research space do, right now, is build the tooling that lets the harder and more accurate primitives become cheap and easy too. When they do, the products that got shipped on the easier abstractions will look more brittle than they currently do.

If you're building, the question worth asking isn't just "what do I ship now?" It's also "which of my current primitives is an early-days hack that I'll want to replace when better measurement lands?" For multi-turn agents, my guess is that evaluation is one of those — and this paper is a pointer toward where the replacement starts.

Your AI's Persona Is a String. A New Paper Argues It Should Be a Steering Vector.

고광웅 — Sun, 19 Apr 2026 14:08:10 +0000

The Mismatch Most Persona Products Live With

If you've built any kind of AI agent product in the last two years, you've probably shipped a "persona" feature. Usually it looks like this: a text field where the user (or the product) writes "You are a witty, slightly sarcastic assistant who loves climbing," and that string gets stitched into a system prompt. Done. Persona complete.

The thing is, nobody who has ever worked with real people thinks of personality that way. Actual humans don't have a single mode. The friendly coworker is different at 2am on a deadline. The patient teacher is different when a student is being deliberately obtuse. Situation changes behavior, and most of the time it changes it a lot.

A paper that went up on arXiv this week formalizes that mismatch and proposes something interesting about how to fix it. It's not the kind of paper that'll get quoted in keynote slides — there are no dramatic benchmarks in the abstract — but the conceptual move is, I think, more important than the specific method.

The paper is Beyond Static Personas: Situational Personality Steering for Large Language Models (Wei, Li, Wang, Deng, April 15). Short version: instead of treating personality as a string you define once, treat it as a runtime steering signal over the model's neurons — one that shifts with the situation.

What the Paper Actually Argues

The technical contribution is a framework the authors call IRIS — Identify, Retrieve, Steer. It's training-free and operates at the neuron level. Three parts:

Situational persona neuron identification — find the specific neurons whose activation patterns correspond to personality traits in context.
Situation-aware neuron retrieval — given a new situation, retrieve the relevant neuron set for the desired persona expression under that situation.
Similarity-weighted steering — apply a steering vector to those neurons at inference time, weighted by how similar the current situation is to the retrieved references.

What I find more interesting than the method is the empirical claim underneath it: the authors argue (and their analysis attempts to demonstrate) that situation-dependency and situation-behavior patterns already exist inside LLM personalities, at the neuron level. Personality isn't just an artifact of the system prompt — it's something the model has internalized structurally, and that structure is responsive to context.

If that holds up under replication, the implication is bigger than IRIS itself. It means the right abstraction for "persona" in an LLM might not be a description you write but a manifold you steer.

I'm hedging because the abstract doesn't give specific win margins and I haven't dug into the full paper. The method could under-perform cleaner approaches in practice. But the framing is worth thinking about regardless of whether IRIS turns out to be the method that wins.

Why This Is a Design Problem, Not Just a Method Problem

Here's the thing I keep coming back to. Most of the persona code I've written — and most of what I see shipped in agent products — treats persona as a compile-time primitive. You write it once, it goes into the system prompt, and from that point forward the agent's "character" is whatever that text produces in combination with whatever comes after.

What this paper is pointing at is that persona is arguably a runtime primitive. It's not a fixed definition. It's a behavior modulation that should respond to context — and the model already has the internal machinery to do that if you know where to apply the signal.

Those are two different things, and I don't think the industry has fully reckoned with the difference. We're selling "custom AI personas" while implementing static strings. The user-facing story is "you can make this agent sarcastic" but the implementation is a shim that barely survives contact with an adversarial user.

What Game Designers Have Been Saying For Decades

I spent a decade designing games before I started building AI agents. The paper's framing feels very familiar to me — it's arriving, through a different path, at something the game AI community has treated as common knowledge for a long time.

Static NPC personalities get old in a session. A guard who always says the same thing in the same tone at the same time regardless of what the player has been doing is immediately legible as a set piece, not a character. The guards players remember are the ones that modulated — the ones whose threat level shifted with how many times you'd returned to the same area, the ones whose dialogue tree branched based on tension state.

The vocabulary was different. We didn't say "steering vectors." We said mood systems, faction relationships, dynamic difficulty, dialog branching by tension. But the underlying insight is the same: behavior is a function of state × situation × character, not just character.

The novelty of a paper like IRIS, from a game designer's lens, isn't the idea. It's the discovery that the scaffolding for this kind of behavior is already latent in LLM weights and can be activated without retraining. That part is genuinely new.

Three Questions to Ask About Your Own Persona Implementation

If you ship a product where users can define or tune an AI's personality, it's worth auditing what you actually built against what you probably told users you built. Some specific questions:

1. What happens to your persona when the user asks something hostile?

Static-string personas tend to collapse under adversarial pressure. The "patient teacher" prompt starts talking like a base model the moment someone pushes hard. If your persona is a product promise, you need a mechanism beyond a string — otherwise the promise is broken the first time someone tests it.

2. Does your persona change register with conversation length?

Real teachers get firmer as a session drags on. Real assistants get more efficient as trust is established. If your agent sounds the same in message 1 and message 40, you've got a behavior rigidity that will eventually feel wrong to users.

3. What does your persona do when the topic shifts to something the "character" wouldn't know about?

This is the case where static personas fail most visibly. A persona designed around "warm emotional support" doesn't gracefully handle a user suddenly asking for tax advice. A situational model would know to shift register without dropping character. A string-based model can only either stay in character and refuse, or break character and help. Neither is right.

These aren't theoretical. They're the three places where persona products routinely fail in ways that erode user trust.

The Part That Matters for Builders

I don't think the takeaway from this paper is that everyone should rewrite their agents to do neuron-level steering next week. The infrastructure to do that at production scale doesn't really exist outside research labs yet.

The takeaway is more structural. The "persona" primitive most of us are using is probably a UI convenience over a more correct runtime mechanism. The more correct mechanism isn't accessible yet, but the mismatch is worth being honest about in how we design around persona features today.

Some implications I'm thinking through:

Treat persona as a layered system rather than a single string. Core traits at one level, situational modifiers at another, tone adjustments at a third. This is messier in the UX but closer to what's actually happening.
Build instrumentation for persona drift. How does your agent's tone change across a long conversation? Across different user emotional states? You probably don't measure this and should.
Be wary of "custom persona" as a feature promise. If your implementation is a text field and the model is doing the rest, you're selling something the mechanism can't reliably deliver. Setting user expectations honestly is better than overselling.

What This Paper Doesn't Settle

I want to name a few things the paper, as I understand it from the abstract, does not resolve:

The specific benchmarks (PersonalityBench and the authors' new SPBench) aren't standard in the field yet. Situational personality benchmarks are hard to construct well, and it's possible a different benchmark would tell a different story.
Training-free methods are appealing for deployment but sometimes undersell what you'd get from even a small amount of targeted fine-tuning. IRIS may be the right research contribution but not the right engineering choice for a given product.
Neuron-level steering is interpretability-adjacent territory, and that field has been notably humble about what its findings mean. Identifying "persona neurons" is a strong claim that deserves scrutiny before anyone builds on it as foundational.

I'm flagging these not to pick fights with the paper but because conceptual takeaways are more portable than methodological ones, and conflating them is how builders end up chasing implementations that don't actually help their products.

The Close

What I'm sitting with, after reading this paper alongside the last few days of working on agent products, is that a lot of the primitives we use are shaped by what was easy to build rather than what is actually the right model of the thing we're building.

Persona-as-string is easy. Persona-as-neural-steering-signal is hard. So we shipped the easy one. That's fair — you ship what works today. But it's worth occasionally asking whether the abstraction you shipped is actually the right abstraction, or just the one that was available.

For persona specifically, my current guess is that the right abstraction is situational and runtime, not descriptive and static. The paper arrives at that conclusion through empirical analysis of neuron activations. Game designers arrived there through twenty years of making NPCs that didn't suck. Different paths, convergent answer.

Whether IRIS is the specific mechanism that ends up winning is almost beside the point. What matters is the reframe: behavior is a function of situation, and persona is a steering problem, not a description problem.

If you're building in this space, it's worth checking which one your product actually implements.

Claude Design Looks Like AI Magic. Reading the Source, It's Four Engineering Patterns.

고광웅 — Sun, 19 Apr 2026 04:34:45 +0000

Before the Hype Settles

Anthropic shipped Claude Design on April 17, and most of the discussion has framed it as a Figma-challenging AI design tool. I used it differently. Instead of treating it as a design tool, I treated it as a specimen — generated a handful of skill bundles, exported the output, then spent more time reading the source than tweaking the design. I was more interested in how the product works than in what it can design.

What I found is more interesting than "AI can design now." The product appears to be mostly four discrete engineering patterns that happen to have a model at the entry point. The model doesn't feel like the magic — it's writing into a carefully structured runtime that almost any team could build.

This post walks through those four patterns, what they actually do, and what I'm taking into my own stack. Caveat up front: these are observations from a source read of a small number of bundles, not a rigorous evaluation of the product's full behavior. I'll flag assumptions as I go.

What I Inspected

A Claude Design skill bundle we looked at contained:

Wireframes.html — a 72KB single-file wireframe document with five navigation variations across three screens each, plus a live Tweaks engine.
IR Deck - Hi-fi.html and IR Deck - Wireframes.html — 1920×1080 slide decks wrapped in a custom Web Component.
deck-stage.js — a 621-line Web Component that provides the slide runtime.
colors_and_type.css — a 160-line design token sheet organized into seven categories.
SKILL.md — a 20-line skill manifest with frontmatter.
README.md — a 223-line brand and voice guide.
preview/ — twelve single-file "at-a-glance" cards, one per token category.
ui_kits/web/ — a React 19 UMD clickable prototype.

The total footprint is small. What's striking is how the pieces fit together — and how few moving parts there actually are.

Pattern 1 — Tweaks: `data-*` Attributes with CSS Variables

The Tweaks panel is what makes the output feel interactive: click "dusk" and the whole design shifts to a warm dark palette. Click "compact" and the layout tightens. No regeneration. No API round-trip.

The mechanism is mundane. Every theme, accent, layout, and density option is a :root CSS variable override keyed to a data-* attribute on the root element:

:root, [data-theme="paper"] { --ink:#1a1a1a; --paper:#f4efe6; --accent:#c53b1e; }
[data-theme="dusk"]         { --ink:#eae3d2; --paper:#2a2620; --accent:#e77c5f; }
[data-theme="midnight"]     { --ink:#f0ebde; --paper:#14110d; --accent:#ff8a6a; }
[data-accent="gold"]        { --accent:#d4a017; }
[data-layout="stack"] .flow { grid-template-columns: 1fr !important; }
[data-density="compact"] main { padding: 14px 18px; }

A click handler sets document.documentElement.dataset.theme = "dusk", persists to localStorage, and postMessages the host window so it can save the selection against the artifact. That's the entire switching layer.

Four axes, three-to-four options each, roughly 144 combinations available without regenerating anything. The design-system work is done at token definition time, not at runtime.

The takeaway I'm sitting with: what feels like "AI variant generation" in interactive design tools may be mostly static CSS token switching. The AI wrote the tokens once. The switching is attribute swapping.

Pattern 2 — `deck-stage.js`: A 621-Line Web Component That Replaces a Slide Tool

The decks in the output aren't Reveal.js or a bespoke React app. They're a custom Web Component, <deck-stage>, containing plain <section> children as slides.

What the component does:

Fits a design-size canvas (default 1920×1080) to whatever viewport it's rendered in, using transform: scale().
Handles keyboard navigation (arrows, PgUp/PgDn, Space, Home, End, 0–9, R).
Adds mobile tap zones (left third / right third).
Persists the current slide to localStorage, keyed by document path, so a refresh restores position.
Renders a floating overlay with previous/next/reset controls and a slide counter.
Injects a <style> tag into document.head with @page { size: 1920px 1080px; margin: 0; } so the browser's native "Print to PDF" produces one page per slide.
Emits a slidechange CustomEvent with bubbles: true, composed: true and a reason field (init/keyboard/click/tap/api) — listenable cleanly from outside the shadow DOM.
Reads a <script type="application/json"> block of speaker notes and posts them to the host window.
Honors a noscale attribute for PPTX export cases where the CSS transform is undesirable.

Two implementation details stood out:

The @page rule has to be injected into the outer document because shadow DOM ignores @page. So the component walks up and writes into document.head during connectedCallback. This is the kind of detail that gets no documentation credit but separates "works in print" from "falls apart at export time."

Slides are hidden, not unmounted, with visibility: hidden; opacity: 0. That preserves the state of videos, iframes, form inputs, and React subtrees across navigation. If you're building a slide system in React with conditional rendering, you're quietly discarding state every time the user hits the arrow key. A cheap fix with meaningful UX consequences.

Pattern 3 — `SKILL.md`: A Manifest Format, Not a System Prompt

The skill manifest is smaller than I expected. Three frontmatter fields:

---
name: <skill-kebab-case>
description: "Use this skill to generate well-branded interfaces and assets for <domain>. Contains <key files> for <style context>."
user-invocable: true
---

The body reads like a protocol, not a persona:

"Read README.md within this skill first."
"Then look at colors_and_type.css."
"If creating visual artifacts (slides, mocks): copy assets out, produce static HTML."
"If working on production code: treat <path>/frontend/app/ as canonical."
"If invoked with no other guidance: ask 3–5 questions about scope and audience, then act as an expert designer."
"Always flag: font substitutions, chart color choices, and any deviation from the documented color contract."

Three things about this format stood out:

Description is the routing signal. The orchestrator decides when to invoke the skill by reading the description alone. So the description has to encode domain, output type, and stylistic signal in one paragraph — different from how most agent frameworks define a role.

The body is a branched protocol. "If A, do X. If B, do Y." Not a soft persona, not a goal statement. Concrete execution paths keyed to invocation context.

"Always flag" is mandatory self-reporting at the manifest level. Fonts were substituted? Flag it. Deviated from the color contract? Flag it. It's an anti-hallucination pattern written into the skill definition rather than left to the model to remember.

I don't think the manifest format itself is novel — it's structurally close to how Claude Code's existing SKILL.md works. But its use as an agent interface for a consumer-facing design product is a concrete shape I haven't seen written down this cleanly before.

Pattern 4 — Output = Self-Contained HTML Bundle

The artifact isn't stored in a proprietary database. It's a folder:

IR/
├── assets/
│   └── colors_and_type.css
├── IR Deck - Hi-fi.html
└── deck-stage.js

The HTML references the CSS by relative path and the JS by relative path. Everything is physically co-located.

Zip it, upload to any static host, it works. No build step. No framework runtime. No server-side rendering required.

There's a small interesting trick: the same colors_and_type.css file appears as copies in multiple subfolders — one for the deck, one for the UI kit, one for the preview cards. The bundle is optimized for survival, not deduplication. If a user downloads just the deck folder, they don't lose styling.

More bytes, no broken links. For a consumer product where users will definitely cut-and-paste the wrong subset of files, that tradeoff probably earns itself back quickly.

Why This Shape Is Interesting

Going in, my mental model was roughly: "Claude Design is a big AI system that generates design output."

What I came away with is closer to: "It looks like a thin AI orchestration layer over a carefully engineered runtime and manifest format, and the interesting work is in the runtime."

The model writes HTML, CSS, and SKILL.md files into this system. The system is what makes the output interactive, exportable, and robust across environments. If Anthropic swapped the model tomorrow for a comparable one, my guess is the user experience would barely change — because the experience is mostly the runtime.

That reframes the build-vs-buy question for anyone working in this space. You may not need a design-specialized model to get most of the user-facing value. What seems to matter more is:

A token CSS file with tight, opinionated choices.
A data-* attribute theming layer.
A Web Component (or equivalent) that handles presentation concerns: scale, navigation, print.
A manifest format for the skills that generate into the runtime.
A bundle format that survives being zipped and sent around.

Build that scaffold, then any capable LLM plausibly becomes your design engine. If this read is right, the hard work isn't the AI — it's the runtime the AI writes into. I'm hedging because one bundle inspection isn't enough to generalize.

What I'm Taking

Four patterns I'm adapting into our own stack:

Four-axis Tweaks (theme / accent / layout / density). Roughly fifty lines of CSS and JavaScript for a meaningful UX upgrade. Low risk, high visibility.
@page dynamic injection for PDF export. Potentially removes the need for a separate PDF library in slide-style outputs.
SKILL.md manifest format for our agents. Three-field frontmatter, branched body, mandatory "Always flag" section. Structural improvement on how we currently define agent behavior.
Self-contained HTML bundles as the default artifact. No server dependency, zippable, survives cut-and-paste. Lowers the support surface dramatically for client-facing deliverables.

I'm leaving aside porting the slide Web Component for now, because we already have a working runtime and the license review cost isn't obviously worth the marginal gain. The patterns above are portable with a day or two of work each.

The Open Question

The specific shape I'm describing — four patterns around a model — isn't obviously unique to design. It's roughly the shape of a lot of AI-labeled products right now. The product does something useful and the model is the visible new thing, but most of what makes it work seems to be conventional engineering inside a well-thought-out structure.

If that holds, a lot of "AI products" are really platform products where the AI is the entry point rather than the engine. The scarce skill in that world isn't prompting. It's designing the runtime the prompts write into.

I'm not sure whether that's a durable observation or a snapshot of where we are in this early phase of AI product maturity. But it's the one I came away from this source read with, and it's shifted how I'm thinking about our own architecture.

If you're building in this space, reading the bundles your tools produce might teach you more than reading the marketing. That's the part I'd put the most confidence on.

Forem: 고광웅

Cross-posting to Three Platforms Forced Me to Rethink What 'The Same Post' Means

The Simple Mental Model That Failed

What "Primitive" Means Here

Where the Pipeline Broke

What Each Platform Actually Needed

The Platforms Where I Stopped

What I Actually Learned

For Other Builder-Writers Considering Multi-Platform

The Close

I Wrote Four Posts. Then I Let Them Decide My Roadmap. Here's Why I Stopped.

The Week That Felt Productive

The Trap Has a Name

Why This Fails Quietly

The Evidence I Wasn't Using

Author Voice vs. Builder Voice

The Counter-Example I Should Have Learned From Earlier

What I Changed

A Practical Test for Other Builder-Writers

What Stays

Response Quality Is Not Conversation Quality. A Paper Quantifies the Gap.

The Metric Most Agent Products Are Missing

The Number That Made Me Read It Twice

What the Paper Actually Proposes

What Their IDT Caught

Why This Is a Different Measurement Problem Than Response Quality

Three Audit Questions For Your Multi-Turn Agents

What This Paper Leaves Open

The Builder Takeaway

The Throughline

Your AI's Persona Is a String. A New Paper Argues It Should Be a Steering Vector.

The Mismatch Most Persona Products Live With

What the Paper Actually Argues

Why This Is a Design Problem, Not Just a Method Problem

What Game Designers Have Been Saying For Decades

Three Questions to Ask About Your Own Persona Implementation

The Part That Matters for Builders

What This Paper Doesn't Settle

The Close

Claude Design Looks Like AI Magic. Reading the Source, It's Four Engineering Patterns.

Before the Hype Settles

What I Inspected

Pattern 1 — Tweaks: data-* Attributes with CSS Variables

Pattern 2 — deck-stage.js: A 621-Line Web Component That Replaces a Slide Tool

Pattern 3 — SKILL.md: A Manifest Format, Not a System Prompt

Pattern 4 — Output = Self-Contained HTML Bundle

Why This Shape Is Interesting

What I'm Taking

The Open Question

Pattern 1 — Tweaks: `data-*` Attributes with CSS Variables

Pattern 2 — `deck-stage.js`: A 621-Line Web Component That Replaces a Slide Tool

Pattern 3 — `SKILL.md`: A Manifest Format, Not a System Prompt