Forem: Bongho Tae

When Code Stopped Being a Vibe and Started Being a Job

Bongho Tae — Fri, 01 May 2026 15:03:38 +0000

There's a specific kind of late-night feeling familiar to anyone who has tried to build something with a chatbot. You describe what you want — "make me a little app that tracks how much water I drink" — and the model fires back fifty lines of code that look impressive, almost work, and silently break in three places. You paste the error back. You get another fifty lines. By 2 a.m. you're not building software anymore; you're playing a kind of haunted telephone game with a system that keeps confidently handing you broken tools and walking away.

Inside the AI world, this experience earned a name: vibe coding. You describe the vibe. The model conjures a snippet. You patch the snippet. Nobody, strictly speaking, is doing engineering. It's closer to commissioning a sketch from a street artist — fast, occasionally beautiful, and absolutely not a load-bearing structure.

The team behind GLM-5, a new open-weights large language model from the Chinese research group Z.ai, says the era of vibe coding should be ending. Their pitch, captured in the paper's title — From Vibe Coding to Agentic Engineering — is that the next leap isn't about producing better snippets. It's about producing a model that can act like an actual junior engineer: one who reads a ticket, plans, edits files across a codebase, runs the tests, fixes what broke, and keeps going for hours without losing the plot.

That's a much harder claim than "we beat the benchmark." So it's worth slowing down and asking what they actually changed under the hood, and what each of those changes is really trying to do.

The Difference Between a Snippet and a Shift

Picture the difference between two requests at a hardware store.

The first: "I need a piece of wood about this long." A clerk hands you a board. You take it home, cut it slightly wrong, come back, get another. This is vibe coding. Each interaction is short. Each output is small. Every mistake costs you another trip.

The second: "I need to build a deck out back. Here's a photo of the yard. Can you handle it?" The contractor walks the site, pulls permits, schedules concrete, orders lumber, supervises the crew, fixes the railing when it splinters, and hands you keys two weeks later. This is agentic engineering: not a single output, but a sustained process of planning, acting, observing, and self-correcting toward a goal that takes hundreds of small decisions to reach.

Most chat-style AI today, even the best, is essentially the lumber clerk. It hands you boards. The GLM-5 team's central wager is that an AI that can act as the contractor — that can hold a goal in mind across a long project — is a genuinely different category of tool, and it requires changes deeper than just making the model bigger.

Why the Old Architecture Started to Buckle

To understand why GLM-5 is built the way it is, it helps to understand what was breaking.

Modern language models work, very roughly, by reading every word in their context window and figuring out how each word relates to every other word. This is called attention, and the easiest way to picture it is a meeting room where every participant has to make eye contact with every other participant before anyone can speak. With ten people, that's manageable. With a thousand people — say, a thousand pages of source code — it becomes absurd. The room grinds to a halt under the sheer combinatorial weight of who-must-look-at-whom.

This is the long-context problem. If you want a model to keep an entire codebase, an entire bug report, and an entire history of previous attempts in mind at once — which a real engineer does effortlessly — the standard attention mechanism gets ruinously expensive. Training such a model costs a fortune. Running it costs almost as much.

GLM-5 attacks this with something the paper calls DSA, a form of sparse attention. The intuition is simple: in a real meeting, you don't need everyone to lock eyes with everyone else. You need the right people to listen to the right people at the right moment. Sparse attention is like a meeting facilitator who quietly tells most participants to ignore most other participants, and only routes the genuinely relevant connections. The math gets cheaper. The room can hold more people. The model can keep more of the codebase in its head at once without melting the GPUs.

Alongside that, GLM-5 keeps a structural choice from its predecessor: a Mixture of Experts layout. Instead of one enormous model that has to know everything, imagine a hospital. A patient walks in; a triage nurse looks at the symptoms and routes them to cardiology, or dermatology, or psychiatry. Only that specialist examines the patient. The other specialists don't burn energy on a case that isn't theirs. Mixture of Experts works the same way: when a question comes in, a router quietly picks a small handful of internal "specialists" to do the work, while the rest of the model rests. You get the breadth of a giant institution at roughly the cost of running one department.

Stack DSA on top of MoE and you get a model that is paradoxically large and cheap — wide enough to know a lot, but lean enough that the running costs don't strangle the project.

Teaching It to Finish a Job

Architecture is only half the story. The other half is how you train a model to behave like an engineer who finishes things, rather than a model that confidently hallucinates a half-finished function and shrugs.

The standard tool for this kind of behavioral shaping is reinforcement learning. The image to hold in your head is a child learning to ride a bike. You don't teach them by lecture; you let them try, and you cheer when they balance and steady the handlebars when they wobble. Over thousands of attempts, they develop something like a feel — not a rule, but a learned instinct.

For an AI, "trying" means generating a candidate solution, and "cheering or steadying" means scoring whether the solution worked and nudging the model's internal weights in response. The trouble is that, as conventionally implemented, this is brutally slow. The model produces an attempt. The training system grades it. The model produces the next attempt. Grade. Attempt. Grade. It's like a single piano teacher who insists on watching every note their student plays, in real time, before allowing the next note. Nobody learns very fast that way, and the teacher is exhausted.

GLM-5's team rebuilt this into what they call asynchronous reinforcement learning, which is less mysterious than it sounds. Asynchronous, in everyday English, just means "not in lockstep." Imagine instead of one piano teacher with one student, you have a music school. Some rooms are full of students improvising. Other rooms are full of teachers reviewing recordings of yesterday's sessions and writing feedback. Students don't have to wait for feedback to keep practicing. Teachers don't have to wait for students to keep grading. The whole school flows in parallel.

In machine learning terms: the part of the system that generates attempts and the part that learns from those attempts no longer have to take turns. They run side by side. The training keeps moving while new attempts are still being produced. The result is dramatically more efficient post-training, and the team says they had to build new algorithms specifically to keep this asynchronous setup from going off the rails — because in the music-school analogy, you really don't want students learning from feedback so old that their own playing has already evolved past the criticisms.

That last part matters more than it sounds. Most reinforcement learning is good at short tasks: did you get this single answer right? But agentic engineering isn't a single answer. It's a chess game's worth of decisions: open this file, edit this function, run these tests, read the failure, change strategy, try again, ask for help, refactor. The reward — "did the project actually work?" — only arrives at the very end, after dozens or hundreds of moves. This is what researchers mean by long-horizon tasks, and traditional RL handles them about as gracefully as trying to teach someone chess by only telling them whether they won the whole game and never saying anything about individual moves.

The asynchronous agent algorithms in GLM-5 are designed, broadly, to give the model better feedback over those long stretches — to act, in our analogy, like a chess coach who can replay a 60-move game and point out which decisions twenty moves ago set up the loss. That capacity to learn from messy, drawn-out, multi-step interactions is what they argue actually moves the needle from "writes nice snippets" to "completes nontrivial software tasks."

Figure 5: Overall training pipeline of GLM-5.

How They Claim to Know It Worked

Whenever an AI lab releases a model, the next ritual is the benchmark photo-op: a chart showing your model winning. GLM-5's team produces theirs, and the headline numbers are real. Across eight major tests of agentic skill, reasoning, and coding — ranging from a notoriously hard exam called Humanity's Last Exam to SWE-bench Verified, a kind of standardized engineering practical — GLM-5 lands roughly on par with Claude Opus 4.5 and GPT-5.2, the leading proprietary systems, and ahead of Gemini 3 Pro on average.

Figure 1: Results of GLM-5, GLM-4.7, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) across eight agentic, reasoning, and coding benchmarks.

The number that the open-source community will care about most is on a separate index. On the Artificial Analysis Intelligence Index v4.0 — a composite that blends ten different evaluations — GLM-5 scores 50. Its predecessor scored 42. More importantly, no openly-released model had ever crossed the 50 line before. That's the difference between a model you have to rent from somebody else's API and one you can, in principle, run on your own hardware and inspect.

Figure 2: Artificial Analysis Intelligence Index v4.0 incorporates ten evaluations spanning reasoning, knowledge, coding, and instruction following.

But static benchmarks have a known weakness. They are essentially the SAT for AIs: useful, but gameable, and increasingly suspected of leaking into training data. The more interesting test is one where humans, in the wild, judge the outputs head-to-head. That's what LMArena, run out of UC Berkeley, does — millions of real users posting real prompts, blindly comparing answers from competing models. On both the text and code leaderboards there, GLM-5 is the top open model, and roughly tied with the leading closed systems. This is a softer kind of evidence than benchmark scores, but in some ways more honest. It's the difference between a chef winning a competition with a tasting menu and that same chef getting consistently ranked highest by ten thousand random diners eating dinner.

Figure 3: On LMArena, GLM-5 is the #1 open model in both Text Arena and Code Arena.

The most interesting numbers, though, sit in the long-horizon tests. Vending-Bench 2, for instance, asks a model to run a small simulated business — making decisions over many turns, pricing, restocking, and adapting — for a sustained period. CC-Bench-V2 evaluates it on extended coding tasks that can't be solved in a single shot. These are the tests that come closest to the contractor-versus-clerk distinction. GLM-5's gains here, more than its score on any one-shot exam, are what the paper frames as the real evidence that something has shifted.

Figure 4: Results on long-horizon tasks. Left: Vending-Bench 2; Right: CC-Bench-V2.

What This Looks Like If It's Real

Strip away the leaderboards and imagine the everyday consequences if a model like this works as advertised — not in a demo, but in someone's actual workflow.

A solo developer who maintains a small open-source library files her own bug report on Sunday night. She doesn't open her editor. She forwards the report to an agent and goes to bed. By morning, the agent has reproduced the bug, traced it through three files she barely remembers writing, drafted a fix, run the test suite, noticed that the fix breaks an unrelated module, written a second patch, written tests for the regression it caught, and opened a pull request with a clear human-readable explanation. She reviews it over coffee. Most of the work she would have done on a Saturday is gone — not because the AI wrote slick code, but because it sustained attention across a small, real, multi-step task.

Or imagine a hospital IT team that has spent years putting off the migration of an internal scheduling system from a deprecated language. The migration isn't hard, exactly. It's tedious and long and requires reading thousands of files and not making mistakes that quietly delete patient appointments. An agent can chip away at that for weeks under supervision, with each step inspectable, where a human would burn out by day four. The point isn't that AI replaces the engineers; it's that it absorbs the kind of grinding work that has historically forced human engineers to either skip it or quit.

This is what "agentic engineering" actually means in a non-hype register: not magic, not autonomy in any deep sense, but a system that can keep at a defined job long enough to make a dent.

Where I'd Stay Skeptical

There are several things this paper does not, and probably cannot, fully prove.

The first is the gap between benchmarks and the messy world. Even Vending-Bench-style long-horizon tests are simulations. Real software engineering involves codebases full of undocumented quirks, broken conventions, and tribal knowledge that lives only in a Slack channel from 2022. A model that can finish a clean ticket on a clean repo may still flounder on the ones that actually pile up in working teams. The paper is honest about this, framing its long-horizon results as movement in the right direction rather than arrival.

The second is that "agentic" is, frankly, a word being asked to do an enormous amount of work right now. There's a meaningful difference between an agent that completes long tasks because it's been carefully coached on millions of examples, and an agent that is in any general sense reasoning about goals. The paper's framing leans toward the more impressive interpretation. The evidence supports the more modest one.

The third — and this is the open-source community's standing skepticism — is that benchmark dominance is a moving target, and the last few years have shown that the gap between proprietary leaders and open challengers can shrink in months and reopen just as fast. GLM-5 sitting at the top today says less about a permanent shift than about a particular team executing well at a particular moment. What matters more, long-term, is whether the architectural and training ideas — sparse attention to manage long contexts, mixture-of-experts for cheap breadth, asynchronous reinforcement learning to teach long-horizon behavior — propagate. They probably will, because they're sensible, and because the cost pressures pushing toward them aren't going away.

What's worth taking seriously, regardless of where the leaderboards land in six months, is the quiet repositioning of what an AI is for. The vibe coder asks for a snippet and gets one. The agentic engineer takes a task and brings it back finished. Whether or not GLM-5 is the model that proves that distinction matters, the distinction itself is likely the one the field will be organized around for the next several years. The clerks have been useful. The contractors, if they really are coming, are going to change what "asking for help" means.

📄 https://arxiv.org/abs/2602.15763

tags: ai, llm, coding, agents

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/xtbp2x13

The Machine That Reads, Watches, Listens — All at Once

Bongho Tae — Thu, 30 Apr 2026 15:06:42 +0000

Imagine you're trying to help a friend over the phone book a flight. They send you a screenshot of the airline's website, then a voice memo describing what they want, then a short video they recorded of the confusing pop-up that keeps blocking the booking button. You glance at the picture, listen to the memo, watch the clip, and tell them: Click the small "X" in the upper right of the gray box, then scroll down past the seat selector.

You did something extraordinary in that moment, and you didn't notice. You combined four different streams of information — pixels, sound waves, motion over time, and the memory of the conversation up to that point — and you produced one coherent answer. You did it in maybe ten seconds.

For most of the past decade, getting a computer to do this has been the kind of problem that quietly drives researchers to drink. Not because any single piece is impossible — there are systems that read text well, systems that recognize images, systems that transcribe audio. The hard part is the all at once. Each of those streams arrives in a completely different format. Pixels are arranged in grids. Sound is a wiggling line over time. Text is a sequence of symbols. Video is pixels arranged in a grid that also changes over time. Forcing one system to understand all of them, and to reason across them coherently, has felt a bit like asking a panel of monolingual translators to collaboratively write a poem, when none of them speak the same language.

NVIDIA's new paper, Nemotron 3 Nano Omni, is an attempt at that poem. It is one model — one set of digital "brain weights" — that natively accepts text, images, video, and now audio, and tries to reason across them as a single unified mind. The paper is, in a sense, a progress report from the front lines of a research direction that has slowly been changing what computers can do for ordinary people. It's worth understanding, because the limitation it solves is one we've all bumped into.

The Old Way: Four Specialists in Four Rooms

Before models like this one, the standard approach to a multimodal problem was assembly-line specialization. You'd have a model that "saw" the image and described it in words. Another model that transcribed the audio into text. Another that summarized the video. And then you'd hand all those translated descriptions to a fourth model — a language model — and ask it to make sense of the situation.

The analogy is a courtroom where the judge is blindfolded, and four witnesses each describe what they saw, heard, or read. The judge can only work with their secondhand reports. If a witness misses something, or paraphrases away an important nuance, the judge has no way to go back to the original. "He pointed angrily" doesn't capture whether the gesture was at someone in particular or just generally toward the door. The judge is stuck with the summary.

What researchers wanted instead was a judge who could see, hear, and read the evidence directly — the same mind processing all of it.

That's what omni-modal models are. The "omni" part is the ambition: one mind, every channel.

A Crowd of Specialists, but Only the Right One Speaks

The first big architectural choice in Nemotron 3 Nano Omni is something called a Mixture-of-Experts, or MoE, backbone. The label sounds technical, but the idea is intuitive once you put it in the right setting.

Picture a small-town doctor's office where the family physician has to handle everything — broken bones, allergic reactions, mental health, pediatric concerns, geriatric medicine. They do their best, but they're stretched thin. Now picture a hospital with twenty specialists. When you walk in with a sore ankle, the cardiologist, the dermatologist, and the psychiatrist don't bother coming to your appointment. The receptionist routes you to the orthopedist, who handles the case efficiently because that's all they do all day. The hospital, as a whole, has more accumulated expertise than the family doctor — but you only consume the time of the relevant specialist.

Mixture-of-Experts works exactly this way inside the model. The "30B-A3B" name in the paper is a coded way of saying: there are roughly 30 billion total parameters of expertise stored in the model, but for any given chunk of work, only about 3 billion get activated. There's a built-in router — the receptionist — that looks at each piece of incoming data and sends it to the experts most equipped to handle it. The result is a model that has the knowledge of a much larger system but the speed of a smaller one.

This matters enormously for things like video, where the model is processing thousands of small pieces of information per second. If every piece woke up every neuron, the system would crawl. With MoE, only the relevant neurons fire.

How the Model Sees: Through Two Translators

Before a model can reason about a photograph, the photograph has to be turned into something the model's mathematical machinery can chew on. The Nemotron paper uses two specialized "translators" sitting at the gates: a vision encoder called C-RADIOv4-H and an audio encoder called Parakeet-TDT.

Think of these as people who have spent their entire careers learning how to describe one specific kind of thing. The vision encoder is an art critic who has stared at hundreds of millions of photographs over their training and can convert any new image into a kind of compact poetic shorthand — not English words, exactly, but a vector of numbers that captures the essence of what's in the picture. The audio encoder is the same idea but for sound: a music historian who has listened to millions of recordings and can compress any clip into a similar compact summary.

The cleverness is that both translators write their summaries in the same code. So when the main reasoning model receives the encoded image and the encoded audio, it doesn't see them as foreign objects. They've been pre-translated into the model's native tongue — the language of high-dimensional vectors. The reasoning brain in the middle can then think about the sound of someone saying "look at this" and the image they're pointing at as one continuous thought, instead of two separate problems.

Don't Cut the Painting Into Puzzle Pieces

One of the more elegant changes in this version of the model is in how it handles images of varying sizes — particularly the extremely tall, narrow, or wide images we encounter constantly: a screenshot of a long article, a panoramic photo, a scan of a legal document.

The previous standard approach was called tiling. Imagine you have a massive painting and your eyes can only focus on a small square at a time. Tiling means cutting the painting into many small squares and inspecting each one separately. The problem is obvious: you lose the composition. The relationship between the figure in the lower left and the doorway in the upper right is gone, because you never saw them together. For a document, this might mean the model loses track of how a chart at the top of page three relates to the paragraph that introduces it.

Nemotron 3 replaces tiling with dynamic resolution, which is closer to how a human actually looks at things. Instead of slicing the image into uniform pieces, the model adapts its viewing strategy to the image's actual shape — looking at the whole thing while preserving the relationships between regions. It's the difference between studying a painting through a keyhole, square by square, and stepping back to take it in.

The result, the paper claims, is significant gains on real-world document understanding — exactly the situations where awkward aspect ratios matter most.

Time, Compressed

Video is uniquely difficult because it adds a fourth dimension: time. A one-minute video at thirty frames per second is 1,800 separate images. Naively, the model would have to process all of them, which is wasteful — most adjacent frames are nearly identical. The wasted effort is enormous.

Nemotron 3 Nano Omni introduces a technique using something called Conv3D-based temporal compression. The technical name doesn't help; the analogy does. Think of a film editor who has been told to summarize a long take. They watch the eight seconds of an actor walking down a hallway and they don't store all 240 frames — they store the gist. The walk. The pace. The direction. They compress eight seconds of nearly redundant images into a single conceptual summary. A 3D convolution does the mathematical version of this: it slides over short windows of consecutive frames and merges them into a single representation that captures both the visual content and the motion.

The model also uses something called Efficient Video Sampling, which is the more aggressive cousin of the same idea: it identifies sequences of frames where almost nothing is changing — a static shot of a speaker's face, say — and quietly skips over the redundancy. The paper reports that this halves the number of "tokens" (the model's smallest units of attention) it needs to process a video, which translates directly into faster, cheaper responses.

The everyday analogy is reading a long email where someone has copy-pasted the same disclaimer three times. A practiced reader skips the repeats. The model is learning to do the same.

A Working Memory You Can Actually Use

Earlier language models had a working memory of around 4,000 to 8,000 tokens — enough for, say, a short essay. The previous version of this model could hold 128,000 tokens, roughly equivalent to a short novel. Nemotron 3 Nano Omni now holds 256,000 tokens — closer to a long novel, or a feature-length film with all of its dialogue, or hours of conversation, or a full software codebase.

The reason this matters is subtle. Once a model can hold an entire long document in mind at once, the nature of what you can ask it changes. You're no longer asking it to summarize chunks; you're asking it to reason across the whole thing. Where in this 200-page contract does the indemnification clause contradict the warranty section? That question requires holding both sections simultaneously, and remembering everything between them. With a small working memory, the model has to be fed pieces sequentially and forgets the start by the time it reaches the end.

A useful comparison: trying to follow the plot of The Wire if every five minutes someone wiped your memory of the previous five minutes. You could describe individual scenes beautifully. You could not, in any meaningful sense, talk about the show.

The Curriculum: Teach Slowly, Don't Erase

Training a single model to handle text, images, video, and audio without it getting confused — and without it suddenly becoming worse at what it used to do well — is a quietly enormous problem. There's a phenomenon researchers call catastrophic forgetting, where teaching a neural network something new causes it to overwrite what it knew before. Imagine a polyglot who learns Italian and then realizes their Spanish has mysteriously evaporated. The brain has limited storage and the new lessons crowded out the old ones.

The Nemotron team's answer is a multi-stage training recipe. They don't dump every modality on the model at once. They start with a strong text-only foundation — language, reasoning, math — and then add modalities in stages, like a curriculum. First text. Then images. Then video. Then audio. Each new stage is introduced gently, with careful balancing, so the older skills aren't trampled.

The closest human analogy is the way medical schools teach. You don't begin a first-year medical student in surgery. They learn anatomy, then physiology, then pathology, then clinical examination, and only after a long ramp do they pick up a scalpel. By the time they're operating, the foundational knowledge is consolidated and won't get knocked loose. The Nemotron training schedule is doing the same thing — protecting older capabilities while layering on new ones.

Quantization, or: How to Shrink the Model Without Lobotomizing It

The paper also releases versions of the model in three numerical formats: BF16, FP8, and FP4. The numbers refer to how many bits are used to store each parameter — basically, how precisely the model's "knowledge" is recorded.

The analogy here is image compression. A high-resolution photograph is sharp and detailed but enormous. Compress it to a smaller file and you keep most of the image, but very fine details get rounded off. Push compression too far and the image becomes blocky and unrecognizable.

Quantization is the same trade-off applied to a model's brain. BF16 is the high-resolution version: precise, but heavy. FP4 is the aggressively compressed version: it fits onto cheaper hardware and runs faster, with some careful loss of nuance. The paper's contribution is showing that, with the right techniques, the FP4 version retains nearly all the capability of the larger version. That matters because it means people without supercomputers can run the model on more modest hardware.

What Becomes Possible

The paper's headline claim — three times higher single-stream throughput than a competitor model, nine times higher per-GPU throughput at a fixed responsiveness target — is the kind of number that doesn't quite land emotionally. So let me translate.

Picture an accessibility tool for blind users: a phone app that processes the live camera feed, recognizes text on signs, identifies the voice of a friend approaching, and quietly narrates the scene in real time. The bottleneck right now isn't whether such a system can be built. It's whether it can run smoothly on a phone, in real time, without melting the battery. The kind of efficiency gains in this paper are what move that experience from a research demo to a product.

Or picture a customer-support agent that watches you struggle with a screenshare, listens to your frustrated voice, reads the error log you've pasted, and walks you through a fix — all without handing you off to a human. The "agentic GUI use" benchmark in the paper, which includes navigating real software environments, is precisely about that capability.

Or picture a model that watches an entire two-hour surgery video, listens to the surgeon's narration, and produces a teaching annotation: here, at minute 47, the tissue is unusual; the surgeon makes an adjustment that would not be in the textbook. Long audio-video comprehension, which the paper highlights, is what makes this realistic rather than fantasy.

The Honest Limits

A few things should temper the excitement.

The benchmarks the paper reports are, by design, snapshots. A model can score brilliantly on document-understanding benchmarks and still trip over a real document with a coffee stain or an unusual layout. Real-world performance always lags benchmark performance, sometimes by a lot.

The paper claims leadership on certain comparisons but is also competing in a fast-moving field. By the time you read this, other research groups will have shipped their own models with their own clever tricks. The architectural choices here — Mixture-of-Experts, dynamic resolution, temporal compression — are widely seen as the right direction, but no one has converged yet on a single winning recipe.

And while the paper's release of model weights and training data is generous, "open" in this corner of AI is always a partial gesture. The compute required to train models at this scale remains the province of a small number of labs. Researchers can fine-tune and study what NVIDIA has released; very few can replicate it from scratch.

Still, the trajectory is clear. We are slowly building machines that can perceive the world the way we do — through every channel at once — and reason about what they perceive without first translating everything into one impoverished medium. The poem is, finally, starting to take shape.

📄 https://arxiv.org/abs/2604.24954

tags: ai, multimodal, nvidia, nemotron

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/xdlpod51

Learning to See a Human Being

Bongho Tae — Wed, 29 Apr 2026 15:06:31 +0000

A Glance and All That It Contains

Imagine you're the costume designer for a major film, and the director has just handed you a single photograph from the 1940s — a black-and-white still of the lead actress. Your job is to recreate her exact look: not just the silhouette of the dress, but the precise way the fabric catches the light, the specific shade where her collarbone meets her neckline, the way individual strands of her hair fall across her shoulder. You pore over that photograph for hours, mentally answering dozens of separate questions: Where does her left arm end and the fabric begin? What angle does her wrist make? Is that texture wool or silk?

Now imagine asking a machine to answer all of those questions simultaneously, for any photo of any person, in less than a second.

This is the challenge at the heart of Sapiens2, a new system released by researchers at Meta. It belongs to a category of software called "human-centric vision models" — programs designed specifically to understand images of people, at a level of detail that borders on the forensic. But what makes it genuinely interesting is less what it does than how it was taught, and the insight about learning itself that makes the approach work.

Two Kinds of Knowing, and Why Each Fails Alone

Before you can appreciate what's clever here, you need to understand a tension that runs through most of modern AI research: the difference between knowing details and knowing meaning.

Consider two ways you might study a language you don't speak. The first method: spend years doing crossword puzzles in that language. No translations, no dictionaries — just fill in missing letters, guided by structure, repetition, and pattern. Eventually, you'd develop a deep feel for how letters combine, which syllables cluster at word endings, what tends to follow a certain prefix. Your knowledge would be granular, intimate, almost tactile.

The second method: spend years looking at photographs with labels written in that language. A dog, a tree, a celebration. You'd gradually learn what the words mean — the semantic content — but you might remain vague on fine distinctions, having never wrestled with the internal texture of the written form.

Modern AI uses both methods, and each has a formal name. "Masked Image Modeling," sometimes abbreviated as MIM or MAE, is the crossword approach. The system is shown images with random patches blanked out — as though a photograph had 75 percent of its pixels replaced by gray squares — and asked to reconstruct what's missing. To do this well, it must develop extraordinarily precise intuitions about how visual details relate to each other: if the surrounding area shows a particular skin tone and hair texture, the missing patch probably contains something consistent with those clues.

The other approach, "Contrastive Learning," is more like the labeled photograph method, but with a specific twist. The system is shown pairs of images and asked, in effect: are these two views of the same thing, or different things? If shown a person from two different angles, it should say "same." If shown two different people, "different." To succeed at this game, the system must develop higher-level concepts — identity, posture, context. It learns meaning rather than texture.

The problem is that each method, practiced alone, develops a specific blind spot.

The crossword-learner becomes expert at the fine grain of images but can struggle to make higher-level sense of them. It might fill in a missing patch of a hand with perfect accuracy while remaining confused about whether the hand is raised in greeting or threat. The concept-sorter, meanwhile, builds meaning at the expense of detail.

The contrastive approach has a subtler hazard as well. To teach a system that two views of the same person are "the same," researchers typically show it deliberately distorted versions of the same image — colors shifted, portions cropped, contrast altered. The model learns to treat these distortions as irrelevant noise. But "learns to ignore" and "learns not to notice" are the same operation. Train a system to discount color variation, and it loses the ability to register that someone's jacket is a very particular shade of burgundy — which matters enormously if your application is photo-realistic avatar creation, where that shade is the entire point. The researchers call this hazard "representation drift": the model's learned sense of an image gradually drifts away from the actual visual evidence, like a portrait painter who has been told so many times that "lighting doesn't matter" that they stop seeing light at all.

The Solution: Make the Two Approaches Keep Each Other Honest

Sapiens2's core insight is to run both forms of learning simultaneously and let each one constrain the other.

The reconstruction task — the crossword — keeps the model tethered to actual pixels, actual textures, the real visual evidence in the photograph. The contrastive task pulls it toward meaning, organizing its observations into concepts that persist across different views of the same thing. Running them together prevents either form of blindness from taking hold.

Crucially, the researchers avoided aggressive color distortions in their contrastive training. Rather than teaching the model to be indifferent to color by showing it wildly recolored versions of the same scene, they used more conservative transformations. The logic is almost ethical in its simplicity: don't train the model to ignore what you later need it to notice.

One further ingredient is borrowed from recent advances in large language models: a "teacher-student" architecture in which the model essentially teaches itself through accumulated experience. Think of a student who, when encountering a new problem, can consult a running archive of everything they've understood so far — not just their original textbook, but the notes from every problem they've previously worked through. The student's current perceptions and their accumulated prior understanding are kept in productive tension, each sharpening the other. The technical term for this is "self-distilled contrastive objectives," which sounds forbidding, but the underlying logic is simply the productive friction between fresh perception and settled understanding.

One Billion Human Photographs

The other dimension of Sapiens2's advance is simpler to describe but staggering in scale. Before being specialized for any particular task, the system was trained on one billion images of people.

One billion photographs. If you viewed them at one per second, without sleeping, it would take thirty-one years. The dataset spans ages from infancy to old age, every ethnicity and body type, every imaginable setting — weddings and construction sites, hospital beds and festival crowds — capturing the enormous variety of human appearance as it actually occurs in the world, not as it looks in controlled studio conditions.

This matters because AI systems are only as general as the data they've seen. A model trained on studio-lit photographs of professional athletes would struggle with an image of an elderly woman gardening in late-afternoon shadow. By ingesting a billion photographs of people in genuine conditions, Sapiens2 builds the kind of rough familiarity that allows it to handle almost anything that walks in front of a camera — without being given any explicit rules about what a human being looks like.

What the System Actually Sees

The outputs Sapiens2 can produce span a remarkable range, and each demands its own form of precision.

"Pose estimation" — detecting body position — sounds modest until you learn that the system tracks 308 specific points simultaneously. Not just elbows and knees: each individual finger joint, the corner of each eye, the precise tilt of the nose. Resolving 308 distinct points accurately within a single photograph means making spatial distinctions of just a few pixels, repeatedly, without error.

"Body-part segmentation" is different again: rather than marking specific points, it labels every single pixel in the image by what body part it depicts. Hair, lips, individual fingernails, earrings — each pixel receives a category. The performance numbers here are the paper's most dramatic improvement, with Sapiens2 roughly doubling the accuracy of all previous dedicated approaches.

"Normal estimation" addresses something more abstract. For every point on a surface — every patch of skin, every fold of fabric — the model estimates the direction that surface is facing. Imagine pressing a tiny compass needle perpendicular to every point along a curved cheek: the needles point outward in slightly different directions as they trace the contour, rotating as you move across the bridge of the nose, swiveling differently around each nostril. Getting this right is essential for any application that places virtual objects convincingly into real scenes, because realistic lighting requires knowing exactly which way each surface is angled relative to the light source.

"Pointmap estimation" goes further still. Instead of relative depth — a simple "this is in front of that" — it asks for absolute three-dimensional coordinates for every pixel. Where in actual space is this fingertip? This requires the model to implicitly reason about camera geometry, reverse-engineering how the camera was positioned and how far it was zoomed, from the image alone. Sapiens2 outperformed all existing methods at this task, including systems built specifically for geometry.

"Albedo estimation" is the most philosophically interesting capability. Light interacts with surfaces in complex ways: the same red fabric looks vivid under sunlight and muddy under fluorescent tubes. Albedo is the intrinsic color of a surface — what it would look like if lighting were perfectly neutral, its true reflective identity. Estimating albedo from a photograph means separating "what color is this surface really?" from "what light was falling on it when the photo was taken?" This matters enormously for CGI and augmented reality: to insert a digital character convincingly into a real scene, you need to know not just how the scene is currently lit, but what the character's skin would genuinely look like standing there.

The Resolution Problem, and a Structural Solution

One of the paper's less-discussed contributions involves image resolution. Earlier systems worked at "1K" — roughly 1,024 pixels per side. Sapiens2 includes variants that operate at "4K," four times finer in each dimension, meaning sixteen times more pixels in total.

This matters for non-obvious reasons. At 1K, a photograph of a human face devotes a few thousand pixels to the eyes. At 4K, it devotes tens of thousands — enough to resolve individual lashes, the precise curvature of a pupil boundary, fine surface vessels. For applications requiring faithful reconstruction — medical imaging, forensic analysis, detailed digital doubles — this resolution gap isn't aesthetic; it determines what information is physically present in the data.

Processing 4K images, though, creates a computational challenge that scales much faster than the resolution increase itself. Modern vision AI works using "attention mechanisms," which function roughly like a very thorough cross-referencing system: every region of an image checks its relationship to every other region before making a prediction. For a 1K image divided into small patches, this is manageable. For 4K, the number of possible pairwise relationships becomes staggering — more than any current hardware can handle at once.

The solution the researchers adopted, called "windowed attention," divides the image into smaller neighborhoods and processes attention within each window, then allows information to propagate gradually across the whole image. It is the difference between a stadium debate — everyone shouting at everyone simultaneously — and a structured town hall, where people first confer with their immediate neighbors, and delegates then carry summaries to the groups nearby. Local coherence is established first; global coherence emerges through structured exchange. The result is computationally tractable while still allowing the model to reason about large-scale spatial patterns.

What This Opens Up, and What Remains Uncertain

The applications these capabilities suggest are not hard to picture. A system that simultaneously knows where every point on a body is in three-dimensional space, what every surface is made of, how light plays across it, and which pixels belong to which body part — that system could power the kind of detailed reconstruction that until recently required a motion-capture studio with dozens of calibrated cameras and weeks of manual cleanup.

The implications stretch past entertainment. Accurate real-time body understanding could enable clinical gait monitoring that works through a smartphone camera, tracking a patient's recovery from a stroke with the precision currently available only in specialized rehabilitation facilities. It could enable virtual try-on for online retail that accounts for how a specific garment drapes over a specific body shape, rather than just pasting a flat image onto an avatar. It could drive training simulations in surgery where digital bodies respond to procedural touch with anatomically accurate surface geometry.

Some honest caveats remain. The paper reports performance on carefully curated test sets, and benchmark success rarely translates perfectly to real-world robustness. The albedo and pointmap tasks are evaluated primarily on high-quality synthetic assets — photorealistic but not real photographs — which may not capture the full messiness of actual camera conditions. The paper mentions dataset diversity across ages and ethnicities, but "diverse" is a word that can mean many things, and it would be worth careful study to determine whether the system performs uniformly across demographic groups or whether certain populations remain underserved by a dataset that, however large, was still filtered by automated pipelines with their own built-in blind spots.

These are not criticisms of the research; no paper could answer every question. They are reasons to watch subsequent real-world deployments with genuine attention rather than assumption.

What Sapiens2 does establish, clearly and with substantial evidence, is that a single model can be trained to see human beings in something approaching the full complexity of their visual reality — not as blobs to be located, but as three-dimensional surfaces with specific texture, specific reflectance, specific form in space. The trick, it turns out, was teaching it two different ways to learn, and ensuring that neither way let the model forget what it was actually looking at.

📄 https://arxiv.org/abs/2604.21681

tags: computervision, ai, deeplearning, imaging

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/hdr3puab

The Sous Chef Who Guesses in Batches

Bongho Tae — Tue, 28 Apr 2026 15:04:45 +0000

When Waiting Becomes the Problem

You are sitting in a restaurant, watching the kitchen through a pass-through window. The head chef — meticulous, authoritative — is assembling a dish. But the rule of this kitchen is strange: the chef cannot pick up the next ingredient until the previous one has been tasted, judged, and placed. Each move waits on the one before. The kitchen is gorgeous, the chef is talented, and the food will be exquisite — but you are going to be here a very long time.

This is, more or less, the situation with modern AI language models. Programs like the ones that power ChatGPT and similar tools generate text the way that imaginary chef works: one word at a time, each new token produced only after the previous one has been fully committed. The model examines everything it has written so far, makes its best prediction for what comes next, outputs a single word or fragment, and then repeats the cycle — again and again, thousands of times, for a single response. The technical term for this process is autoregressive decoding, but a simpler description is: painfully sequential.

For short replies, the delay is tolerable. But AI systems are increasingly being asked to think through complex problems step by step — long chains of reasoning that might stretch across thousands of words before reaching a conclusion. The more sophisticated the task, the longer the model must cook, one cautious ingredient at a time. The powerful hardware inside these servers — the GPUs that can, in principle, perform millions of calculations simultaneously — sits mostly idle, waiting for the signal to take the next single step.

A team of researchers at MIT and NVIDIA has proposed a clever solution to this bottleneck. Their system, called DFlash, doesn't replace the careful head chef. Instead, it introduces a very fast, very clever sous chef — and gives that sous chef an unusual tool: the ability to guess a whole batch of future ingredients all at once.

The Sous Chef Paradigm, and Its Flaw

The concept DFlash builds on already existed, and it goes by the name speculative decoding. The idea is appealingly simple: rather than making the big, authoritative model do all the work sequentially, you hire a smaller, faster assistant to run ahead and draft several upcoming words at once. Then the big model — which can evaluate many options simultaneously, even if it generates them one at a time — looks over the draft and either accepts each word or rejects it, starting over from the first mistake.

Think of it as a pair of proofreaders. The junior proofreader races through a page, penciling in their best guesses for the next paragraph. The senior proofreader can scan the whole penciled paragraph in a single pass, crossing out anything wrong and handing it back. If the junior's guesses are mostly correct, you've saved a great deal of time. If the guesses are terrible, you haven't helped much at all. The rate at which the junior's guesses get accepted — researchers call this the acceptance length — determines almost everything about whether the strategy pays off.

Here is the problem that DFlash identifies: even in the best existing speculative decoding systems, the junior proofreader still works sequentially. They might be smaller and faster than the senior, but they still write one word, then the next, then the next, checking back after each step. The sous chef is still cooking one ingredient at a time — just more quickly. The serial bottleneck has been made smaller, but it hasn't been fundamentally broken.

The speedup from the best existing method, a system called EAGLE-3, is impressive — roughly two to three times faster than the baseline. But the researchers behind DFlash suspected there was a way to do far better, if only the sous chef could be taught to guess an entire batch of upcoming words in a single burst, all at the same time.

Filling In the Blanks, All at Once

This is where diffusion models enter the story. If you've encountered AI image generators — systems that conjure photographs of dragons or reimagined living rooms from text descriptions — you've seen diffusion models in action, even if you didn't know the name.

The basic idea of a diffusion model is the reverse of destroying a picture. Imagine taking a clear photograph and progressively adding static until it is pure visual noise — indistinguishable white noise. A diffusion model learns to run this process in reverse: starting from noise, it gradually clarifies the image through many rounds of refinement, each round removing a little more static, until something coherent emerges. Applied to text, the concept works similarly. Rather than generating words left to right, a diffusion language model starts with a sentence full of blank tiles — every word hidden, masked — and gradually fills them in over multiple rounds of refinement.

The crucial difference from sequential generation is this: all the blank tiles are being worked on simultaneously. The model doesn't fill in the first word and then the second. It looks at all the blanks at once, makes its best simultaneous guess about all of them, and then refines the whole block together. It is less like writing a sentence word by word and more like solving a crossword puzzle — you work all the crossing answers at once, letting each one inform the others until the whole thing clicks into place.

This parallel quality is exactly what DFlash wants to exploit. If you use a diffusion model as the junior proofreader, it can propose an entire block of upcoming words in a single operation, rather than cranking through them one at a time. The serial bottleneck of sequential drafting is shattered.

But there's a catch — the one that researchers have been wrestling with for years. Diffusion language models, as impressive as they are in theory, tend to produce lower-quality text than their sequential counterparts. The crossword-solver metaphor reveals why: when you work all the answers simultaneously without knowing any of them for certain, you make more mistakes than when you build carefully from a foundation of confirmed answers. Diffusion models often require many rounds of refinement to reach acceptable quality, which erodes their speed advantage. Use too few rounds and the text degrades; use too many and you've given back all the time you saved.

The Head Chef's Secret Notes

DFlash's central insight is about what to give the sous chef before they start guessing.

In the existing speculative decoding systems, the draft model — the junior proofreader — is essentially flying blind. It sees the conversation so far, but it has no access to the deep internal understanding that the senior model has developed. It must predict upcoming words from scratch, without knowing what the senior model is "thinking." This is why diffusion-based drafters have historically failed: not only are they proposing multiple words at once, they're doing so without the contextual richness that makes accurate prediction possible.

DFlash changes this by giving the diffusion drafter something precious: the big model's internal notes.

During normal operation, when a large language model processes text, it builds up rich internal representations at each of its many layers — compressed summaries of everything the model has understood about the context so far. These representations contain far more information than the final word predictions they eventually produce. They encode long-range relationships, thematic coherence, and a kind of implicit forecast about where the text is headed. Think of them as the head chef's mental model of the entire dish — not just the next ingredient, but the flavor arc, the texture progression, the logical end point.

DFlash extracts a selection of these internal representations from the big model and hands them directly to the small diffusion drafter. This is done through a technical mechanism called KV injection — KV standing for key-value, terms from the architecture of modern AI systems. The analogy is precise: imagine not just telling the sous chef what dish is being made, but handing them the head chef's private recipe notebook, filled with shorthand observations about the current state of the dish, the diner's preferences, and what the next three moves should feel like. The sous chef, now richly informed, can make far better batch guesses.

What makes DFlash's implementation clever is how persistently it applies this conditioning. In earlier systems that also borrowed target-model features, those features were fed only at the input stage — like handing the sous chef the recipe notes at the start of the shift and then taking them away. Over the course of many internal processing steps inside the draft model, that guidance fades. DFlash, by contrast, injects the context features directly into every layer of the draft model, keeping the head chef's wisdom present and active throughout the entire process of drafting. The sous chef doesn't just glance at the notes once; they consult them at every step.

Figure 1: Speedup comparison between DFlash, EAGLE-3 against Autoregressive Decoding on Qwen3-8B with the Transformers backend. Overall, DFlash achieves more than 2.5× higher speedup than EAGLE-3.

Figure 2: DFlash Inference Design. Hidden context features extracted from the target model are fused and injected into each draft layer's Key-Value cache to enable conditional speculation.

Why the Numbers Are Striking

The results of this architecture are, by the standards of what came before, surprising. DFlash achieves more than six times the speed of the baseline sequential decoding, and more than two and a half times the speed of EAGLE-3 — the previous state of the art — on the same models and tasks.

What makes this especially noteworthy is the lossless guarantee. In speculative decoding, the senior model always has the final word. If the junior's draft contains an error, the senior catches it and the system rejects everything after the mistake, drafting again from that point. The final output is mathematically guaranteed to be identical to what the senior model would have produced on its own, working sequentially. There is no trade-off in quality — only a trade-off in how much time the junior's mistakes cost you. DFlash's drafts are accepted at high enough rates that the occasional rejection barely dents the speedup.

The draft model itself is remarkably small: just five layers in most configurations, compared to the eighty or more layers of a typical large language model. The system generates blocks of sixteen draft tokens in a single parallel forward pass — one burst of computation instead of sixteen sequential ones. Because the draft model is tiny and because the per-step cost of diffusion can be minimized when you're not asking it to produce perfect text on its own (the verification step handles quality control), DFlash's drafting phase is fast enough that even a modest acceptance rate translates into large overall gains.

Figure 3: Draft cost of 1, 3, 5-layer DFlash and 1-layer EAGLE-3.

What Becomes Possible

It is easy to wave at "faster AI" and not feel the weight of what that means in practice. Let me make it specific.

Imagine a doctor using an AI assistant to review a patient's complete medical history — thousands of records, lab results, clinical notes — and generate a differential diagnosis before a consultation. Today, that task takes long enough that clinicians often skip it, relying instead on incomplete context. A six-fold speedup collapses that wait from something that feels impractical to something that fits inside a brief pause. The assistant becomes a tool you actually use, rather than a tool you consult only when you have time to spare.

Or consider software engineers who now increasingly use AI to generate and review code. The most sophisticated code-generation tasks — those involving architectural reasoning, cross-file dependencies, detailed testing — currently take long enough that experienced programmers often do them manually rather than wait for AI assistance. Faster inference means the AI meets the engineer's pace, rather than the engineer accommodating the AI's latency.

More broadly, the new generation of AI reasoning models — which explicitly think through problems step by step, sometimes generating thousands of words of internal deliberation before producing a final answer — are especially constrained by sequential decoding. These models are where the frontier of AI capability is moving. DFlash's gains matter most precisely there, where long inference chains have become the dominant cost.

What the Paper Doesn't Settle

It would be dishonest to stop without naming the gaps.

DFlash's results are measured on specific benchmarks — mathematics problems, code generation, conversational tasks. These are areas where the quality of outputs can be evaluated somewhat objectively, and where large language models tend to perform in a relatively predictable way. It is less clear how DFlash would perform on tasks where the acceptable output space is broader, and where subtle degradations in the acceptance rate might compound into meaningful quality differences over very long generations. The researchers claim losslessness, and the mathematical argument for it is sound, but the empirical tests cover a constrained set of conditions.

There is also the question of what happens as models continue to grow. DFlash's draft model is conditioned on features extracted from the target model, which means training a new draft model for every new version of a large target. If the major AI labs release updated models frequently — as they have been — the maintenance cost of keeping draft models current may become non-trivial. The researchers acknowledge that their architecture points toward a new paradigm for diffusion models rather than solving every deployment challenge.

Finally, there is a deeper question the paper gestures at but does not fully resolve. DFlash argues that diffusion models are most useful not as standalone generators but as specialized drafters inside a larger system. This is a genuinely interesting reframing — accepting that diffusion's weaknesses in end-to-end quality are structural, and routing around them by confining diffusion to a supporting role. Whether this represents an intellectual concession or a genuine insight about the best use of different architectures is something the field will work out over the next few years.

What seems clear, even now, is that the kitchen metaphor that opened this piece is changing. The head chef remains indispensable — authoritative, careful, irreplaceable. But the sous chef no longer has to work one step at a time, waiting for permission with every move. The kitchen is getting faster, and the food will still be the same.

📄 https://arxiv.org/abs/2602.06036

tags: llm inference, speculative decoding, diffusion models, ai acceleration

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/aw6n2y28

The Swiss Army Knife That Actually Works: How AI Learned to Think and Draw at the Same Time

Bongho Tae — Mon, 27 Apr 2026 15:07:11 +0000

Picture a talented friend who can do something most people cannot: hold a genuine conversation about a painting while simultaneously sketching it from memory, then explain their artistic choices in writing, then generate a variation in a different style — all in one unbroken flow of thought. No pausing to switch hats. No handing the problem to a different colleague. Just one mind, moving fluidly between seeing, understanding, reasoning, and creating.

For years, artificial intelligence couldn't do this. The systems that were brilliant at understanding images were separate creatures from the systems that could generate them. And the systems that could generate text were fundamentally strangers to the ones that could generate pictures. We built specialists and called them state-of-the-art. A new paper from Inclusion AI suggests we may finally be moving past that era.

The Specialist Trap

To understand why building a unified AI brain has been so hard, consider how the field arrived at where it is today.

When researchers wanted to build systems that could understand images — answering questions like "Is the dog happy?" or "What's in the background?" — they built what are called vision-language models. These work a bit like a translator with two desks: one desk for images, one for text, with a bridge between them. The model looks at an image, converts it into a kind of abstract summary, then reasons about that summary using language skills. It became excellent at this. Ask it what's on a table in a photo and it will describe every item with unnerving precision.

But when researchers wanted to build systems that could generate images — creating a picture from a text description — they took an entirely different road. They built diffusion models, which work through a process analogous to developing a photograph in a darkroom. Imagine a blank sheet of photographic paper coated in a fog of random chemical noise. The developer's job is to gradually coax a clear image out of that noise by applying the right chemistry in the right sequence. Generation-focused AI works the same way: it starts with pure randomness and, step by step, refines it into something coherent. These models became extraordinarily good at producing images, but they weren't built for conversation.

The result was a landscape of powerful specialists who couldn't collaborate. Your image-understanding model couldn't create anything new. Your image-generating model couldn't reason about what it made. Asking one system to understand a photograph and then produce a variation of it was like asking a translator and a painter to work together when they speak different languages and have never met.

A Common Alphabet

The fundamental problem was that text and images were written in incompatible scripts. Text arrives as words — discrete, enumerable, easy to shuffle around and reason about. Images arrive as a continuous wash of pixel values: 16 million possible colors per pixel, no obvious boundaries, no clean units. Trying to process both in the same system was like trying to play chess and checkers on the same board with the same pieces.

The solution the LLaDA2.0-Uni researchers found starts with a step that sounds simple but is actually the keystone of everything else: they translated images into the same kind of discrete alphabet that text already uses.

Think of it this way. If you wanted to describe a piece of music to someone who only reads sheet music, you wouldn't play them the recording — you'd transcribe it into notes and rests on a staff. The transcription loses some nuance (the exact timbre of the violin, the subtle swell of dynamics), but it captures the essential structure in a form the reader can work with. The researchers built something similar for images, using a component called SigLIP-VQ, which stands for a particular kind of image encoder paired with vector quantization.

Vector quantization is the sheet music step. Imagine you have a vast library of small visual "stamps" — maybe 16,384 different ones, each representing a distinct visual pattern: a soft edge, a bright diagonal, a particular texture. When you feed an image into the tokenizer, it breaks the image into small patches (like cutting a photograph into a grid of tiny tiles) and asks, for each patch: which stamp in our library is closest to this? The answer — "stamp number 7,341" — is a discrete token. Do this for every patch and you've converted a continuous photograph into a sequence of numbers, just like text.

Now text and images speak the same language. A sentence like "a red barn at sunset" and a photograph of a red barn at sunset can both be represented as sequences of tokens. The same reasoning machinery can process either.

The Crossword Puzzle Model

Here is where the paper's central gamble becomes interesting, because the reasoning machinery they chose is not the dominant approach in the field.

Most large language models today generate text the way a novelist types: one word at a time, left to right, never going back. The model commits to each word before seeing what comes next, which works remarkably well but has limitations — especially for tasks where you might want to revise your global plan as you go, or fill in a document non-sequentially.

The LLaDA2.0-Uni system instead uses what researchers call a discrete diffusion model. The analogy here is a crossword puzzle.

Imagine you're handed a crossword grid where every square has been filled in with random letters — pure noise. Your job is to fix it, guided by the clues. You don't start at 1-Across and work linearly. Instead, you scan the whole grid for places where you're most confident ("7-Down, three letters, 'feline'? That's CAT, obvious"), fill those in, then let those anchors guide the harder squares. You're refining the whole grid simultaneously, converging toward correctness from many directions at once. When you're mostly done, you revisit the remaining uncertain squares with fresh eyes, because now the surrounding letters constrain them.

Discrete diffusion works the same way. The model starts with a sequence of masked tokens — imagine every word in a sentence replaced by a [?] — and iteratively fills them in, guided by the content it's already committed to. It can fill any position at any time, not just left to right. This means it can develop a global sense of what a response should look like before committing to individual words. For images, this is especially powerful: it can simultaneously work on the sky of an image and the ground, letting each inform the other.

The Council of Specialists

Running a model that processes both high-resolution images and complex language simultaneously is computationally expensive — the kind of expensive that makes the electricity bill of a small city seem modest. The researchers addressed this with an architectural choice called Mixture of Experts, or MoE.

Think of a large hospital emergency department. When a patient arrives, a triage nurse assesses the situation and routes them: chest pain goes to cardiology, a broken bone to orthopedics, a rash to dermatology. Not every doctor sees every patient. Most doctors sit idle for any given case while the relevant specialist handles it.

The MoE backbone works the same way. Inside the model, there are many specialized sub-networks — the "experts." When the model processes a given input, a routing mechanism decides which subset of experts should activate. An image-heavy input might activate different experts than a text-heavy one. The result is a model with the capacity of a very large system but the computational cost of a much smaller one, because only a fraction of the total machinery runs at any moment.

This is not a new idea in AI research, but combining it with a diffusion-style architecture for both text and image tokens simultaneously is precisely the kind of integration that makes this work notable.

Reconstructing the Canvas

Even after all this machinery processes an image as tokens, you eventually need to convert those tokens back into actual pixels that a human can see. The gap between "a sequence of stamp numbers" and "a beautiful, coherent image" is where many unified systems stumble, producing outputs that look smeared or incoherent.

The researchers added a dedicated diffusion decoder for this final step — essentially a specialized refinement engine that takes the abstract token sequence and reconstructs it into a high-fidelity image. Think of it as the difference between reading sheet music notation and actually hearing an orchestra perform it. The notation captures the structure; the performance fills in all the richness that makes it real.

To make this fast enough to be useful, they used a technique called few-step distillation. Normally, the diffusion process requires dozens or hundreds of refinement steps — like developing a photograph through a long sequence of chemical baths. Distillation compresses this wisdom: a "teacher" model that takes a hundred careful steps trains a "student" model to achieve comparable results in just a few. The student learns not the teacher's process but the teacher's outcomes, skipping the intermediate labor.

LLaDA2.0-Uni LLaDA-O Lumina-DiMOO Figure 1: Benchmark Perfo Authors are listed in alphabetical order based on last nam 1

The Integrated Mind

What all of this amounts to is a system that, for the first time in this configuration, can genuinely interleave text and images in its reasoning without handing off between different specialized models.

Consider what this means concretely. Imagine asking the system: "Here's a painting. What mood does it evoke, and can you generate a photograph that captures the same feeling in a real-world setting?" A siloed system would have to pass the image to an understanding model, extract a description, pass that description to a generation model, and hope the handoff preserved what mattered. LLaDA2.0-Uni processes the original image and generates the new photograph within the same computational stream. The understanding and the creation are happening in the same mind, informed by each other.

The paper calls this "interleaved generation and reasoning," and it's the feature that most distinguishes this architecture from its predecessors. The model can generate a paragraph of text, then generate an image that continues the narrative, then reason about both together — without the artificial seams that separate-model pipelines inevitably produce.

What Changes in the Real World

The most interesting applications of a system like this are not in the lab but in the workflows where the gap between understanding and generation currently costs time, fidelity, and money.

Consider medical imaging. A radiologist today looks at a scan, forms a judgment, and dictates a report — two separate steps, often using separate tools. A system that can simultaneously examine a CT scan and draft a structured report, then modify the report and highlight the corresponding region of the scan when a colleague asks a follow-up question, collapses multiple handoffs into a single workflow. The bottleneck shrinks.

Consider education. A teacher designing a history lesson might want to explain the significance of a photograph from 1945, generate a map showing the troop positions it references, create a timeline that incorporates both, and then produce a quiz that uses all three. Today, each of those steps requires a different tool and a manual bridge between them. A unified reasoning-and-generation system makes the bridges automatic.

Or consider design iteration, where a product designer needs to produce a concept, explain its rationale to a client, modify it based on feedback, and document the changes — all in a single collaborative session. The ability to reason about what's on the canvas and alter it within the same cognitive loop changes the pace of that process entirely.

What Remains Unanswered

It would be wrong to leave this without noting what the paper doesn't address, because the gap between benchmark performance and real-world deployment is always wider than a research paper can acknowledge.

The authors report that their model "matches specialized VLMs in multimodal understanding while delivering strong performance in image generation." That hedge — "matches" rather than "surpasses," "strong" rather than "best" — is doing meaningful work. The specialized models that focus only on understanding images remain better at it. The specialized models that focus only on generating images remain better at that. What LLaDA2.0-Uni offers is not supremacy in any single domain but competence across all of them simultaneously. Whether that trade-off is worth making depends entirely on the use case.

I'm also skeptical, from the abstract alone, about how the discrete tokenization of images holds up at the extremes. The sheet music analogy works well for capturing structure, but it loses expressiveness. A violin's exact timbre doesn't survive transcription. Similarly, the process of converting an image into a vocabulary of 16,384 stamps and then reconstructing it will introduce artifacts and losses, particularly for images with complex textures or fine detail. The paper claims "high-fidelity" reconstruction, but what "high fidelity" means at scale, across diverse real-world imagery, is a question that requires more than a benchmark table to answer.

Finally, the computational reality is sobering. A Mixture of Experts architecture is cheaper to run than a naive model of the same theoretical capacity, but "cheaper" is relative. Running a system like this in a consumer product, at scale, remains a significant engineering challenge. The gap between "this works in a research paper" and "this works on your phone" is still large.

None of this diminishes the intellectual accomplishment. Building a system that can read an image as a sequence of meaningful symbols, reason about those symbols and text symbols simultaneously using a diffusion process, and then reconstruct coherent images from the output — all within one integrated architecture — represents a genuine step toward the flexible, general-purpose AI systems the field has been working toward for years. The question is never whether a new approach is perfect. The question is whether it moves the frontier in a direction worth moving. This one does.

📄 https://arxiv.org/abs/2604.20796

tags: multimodal, diffusion, imagegeneration, unifiedai

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/f4k7gl8o

Reading the Receipts: How Smarter Privacy Accounting Could Unlock More from Sensitive Data

Bongho Tae — Mon, 27 Apr 2026 03:47:29 +0000

The Problem Hiding Inside Every Medical Study

Picture a coalition of hospitals that wants to train an AI to detect early signs of heart disease. No individual hospital has enough patients to train the model alone, so they decide to collaborate. But there's a catch: they cannot simply share patient records. Privacy law forbids it. So instead, each hospital trains on its own data and shares only the model's learned parameters — not the raw records themselves.

This arrangement sounds safe, but the parameters are not innocent. Through a technique called a membership inference attack, a sophisticated adversary can sometimes probe a shared model and determine whether a specific person's records were used in training. Each round of parameter sharing is a small window through which a little information escapes. Run enough rounds, and the window grows into a door.

Every engineer building this kind of system therefore works under a constraint: a privacy budget. Think of it as a jar of trust coins. Each training round costs some coins. When the jar is empty, you must stop — any further sharing would compromise the privacy guarantees you promised. The question the system designer has to answer before training begins is: how many rounds can we afford?

The answer, it turns out, has historically been too pessimistic — sometimes by a wide margin. A paper by Sophie Taylor, Praneeth Vippathalla, and Justin Coon of the University of Oxford proposes a way to fix that, by changing not the rules of the game, but how carefully the score is kept.

Why the Old Scorekeeping Was Leaving Points on the Table

To understand the inefficiency, you need to understand what differential privacy actually guarantees. At its core, it is a mathematical promise: the output of any query on a database will look almost identical whether or not any single person's record is included. The "almost" is controlled by a small number, typically called epsilon. A very small epsilon means very strong privacy — the outputs barely change regardless of whether your record is present. A large epsilon means the outputs might shift noticeably, giving an adversary more leverage.

The clever mechanism that enforces this guarantee is noise. Before releasing an answer — say, "the average blood pressure of patients in this cohort" — the system deliberately adds a small dose of random static, like a radio signal faintly scrambled. The static is calibrated so that any single patient's record could plausibly have been there or not there; the noise blurs the difference.

Now here is where the budget problem enters. Every time you add noise to an answer and release it, you spend some of your epsilon coins. The mathematical theory of composition tells you how the costs accumulate over multiple queries. And existing composition theorems, for all their sophistication, share a common habit: they charge you for the worst possible query of that type, not for the query that actually happened.

Imagine a family deciding how to budget for a road trip. The parents look up the car's fuel consumption: maximum 9 litres per 100 kilometres. They plan the entire trip assuming every kilometre will cost maximum fuel — and conclude they can drive only 300 kilometres before running out. But in practice, the highway stretches are far more efficient than the worst-case city traffic. If they had tracked the actual fuel gauge reading after each leg of the journey, they would have realized they could drive 450 kilometres.

Existing privacy filters — the software tools that track privacy loss and decide when to stop — function like those overcautious parents. They know the mechanism type being used (say, "Gaussian noise added to an answer"), and they charge each query the maximum that mechanism could ever cost. They never check the fuel gauge. They never read the receipts.

Figure 1: Adaptive data privacy problem

The Insight: Charge for What Actually Happened

The key idea of this paper is disarmingly simple to state, though technically treacherous to implement: measure the privacy leakage that actually occurred, not the worst case that could have occurred.

When a query is answered and noise is added, the actual output is a specific number — not a range of possible numbers. That specific output lands somewhere in the distribution of possible outputs. If it lands near the middle of the distribution, the leakage for that query was small; an adversary learns relatively little from an unexceptional answer. If it lands in the extreme tail — a very unusual answer — the leakage was larger, because extreme answers are harder to fake with noise.

Think of it this way. Suppose a medical study releases the query "how many patients have elevated cholesterol?" and the true answer is 412 patients, plus some random noise. If the released number is 415, that is an utterly unremarkable deviation — it could mean 412 patients, or 411, or 413. An adversary trying to determine whether a specific patient is in the dataset gains almost nothing from this boring answer. The privacy cost of that particular query output was tiny.

But if the system's budget was pre-allocated by saying "this type of query could cost up to 0.3 epsilon coins," when the actual cost was closer to 0.03 coins, you have wasted the difference. The authors call tracking the actual coin-by-coin spending realisation-level accounting, as opposed to the older mechanism-level accounting that rounds every expense up to the catalogue price.

This is not merely thrifty bookkeeping. The gap between what you were charged and what you actually spent accumulates across hundreds or thousands of queries. In federated learning, where model training might take thousands of rounds, that difference can translate into a dramatically longer training run — a more capable model — within exactly the same privacy guarantee you promised at the start.

The Tricky Part: You Can't Just Read the Meter

If realisation-level accounting is so natural, why hadn't it been built before? The answer lies in a subtle mathematical hazard that the paper devotes considerable effort to navigating.

Here is the problem in miniature. Suppose you are playing a card game where you must stop when you've spent your budget. Normally, the stopping rule is independent of the cards themselves — you're just counting. But with realisation-level accounting, the stopping rule depends on the specific cards you've seen. This creates a self-referential tangle: the decision to stop, or not, is itself a piece of information that might reveal something about the data.

Mathematically, this is the problem of stopping times — the point at which you decide to quit. When you condition on having seen all the outputs up to a certain round and then decide whether to continue, you are in a different probability universe than if you had committed your stopping rule in advance. Standard privacy proofs assume the stopping point is fixed ahead of time. The moment it becomes adaptive, the proofs can break.

Picture a journalist who decides to stop investigating a story only when she finds compelling evidence. Her decision to stop is not random — it is correlated with what she found. If someone wants to know whether she stopped because of what her source said, her stopping time itself becomes a leak.

The authors work around this with a careful mathematical construction. Rather than conditioning on the full output history — which would create the self-referential problem — they design the filter to track a running statistic that accounts for how surprising each output was, without directly conditioning on the stopping decision itself. The proof that this filter still guarantees differential privacy requires several pages of careful measure-theoretic argument, grappling with conditional distributions and martingale stopping theorems. But the upshot is clean: you can use the actual leakage to decide when to stop, and the privacy guarantee still holds, exactly as promised.

A Bouncer Who Actually Checks the Tab

Think of the filter as a bouncer at a very strict club. The rule is: each patron can consume at most epsilon units of information over the evening. The old-style bouncer assigned every customer the maximum possible tab the moment they walked in, based on what a typical customer might order. This meant many customers were turned away before they reached their actual limit.

The new filter is a bouncer who actually watches what each customer orders and marks it on a running tab. When the tab hits the limit, the customer stops. Most evenings, most customers reach their real limit far later than the overcautious estimate would have predicted — so the club can stay open longer, and everyone gets more of the experience they came for, without the club violating its capacity rules.

The paper also addresses something practical that older approaches stumbled on. An alternative privacy formalism called Rényi Differential Privacy (RDP) — a variant that uses a specific mathematical measure of information distance between distributions — has been widely used for composition because it composes very cleanly across queries. But it behaves poorly for certain kinds of mechanisms, particularly ones whose noise distributions have heavy tails or unusual shapes. Some mechanisms simply don't fit the RDP framework neatly.

The realisation-level filter in this paper sidesteps that problem entirely. Because it operates directly on the actual output — asking "how surprising was this specific answer?" rather than "how does this mechanism behave on average?" — it does not require the mechanism to fit any particular mathematical family. It works on well-behaved Gaussian mechanisms and badly-behaved ones alike.

What the Numbers Show

The paper's numerical experiments compare the new filter against existing mechanism-level filters in a straightforward setup: a sequence of Gaussian queries on a database, where privacy budgets are identical across all methods. The comparison is measured not in privacy guarantees — all methods satisfy the same epsilon and delta — but in stopping time: how many queries each method allows before calling a halt.

Figure 2: Stopping time survival P(T≥t) of mechanism-level privacy filters compared with our realisation-level privacy filter.

The realisation-level filter consistently permits more queries before stopping. The survival curve — showing the probability that the filter has not yet stopped at round t — stays elevated longer than the mechanism-level competitors. In practical terms, this means more training rounds in federated learning, more analysis steps in a statistical study, or more model iterations in a continuous learning pipeline, all without spending an extra epsilon coin.

The gain is not marginal. In the scenarios tested, the realisation-level filter allows substantially more rounds before halting, particularly in early phases of a query sequence where actual leakages tend to be modest. The difference compounds: more rounds mean more learned signal, which in the medical imaging example might mean the difference between a model that screens for disease adequately and one that screens reliably.

What Becomes Possible — and What Doesn't Yet

Think about what this means for the hospital coalition training a heart disease model. Under mechanism-level accounting, engineers might calculate: "We can afford 200 training rounds before we exhaust our privacy budget." With realisation-level filtering, the same budget might sustain 320 or 400 rounds, depending on how the actual outputs happen to land during training. The model trained on 400 rounds will almost certainly outperform the one cut off at 200 — and the privacy promise to patients has not changed at all.

Or consider a pharmaceutical company analyzing genomic data. Each query into the dataset costs privacy. With the old approach, researchers must submit their entire query plan before starting, pre-allocating budget to each step. With an adaptive realisation-level filter, they can run queries in response to what they find, stopping when the privacy budget runs dry, and trusting that the actual costs will often be lower than the catalogue price.

The honest limits matter here, though. The paper proves that the filter works — meaning it delivers the privacy guarantee it promises — and it demonstrates numerically that it allows more queries on average. What it does not answer is how to choose the filter's parameters optimally for a given application. The filter requires a user-specified epsilon and delta, and the paper is agnostic about how to set them in a real system. In a regulatory context, that choice is anything but obvious.

There is also a gap between proof and deployment. The mathematical machinery underlying the stopping-time proof is non-trivial, and translating it into a production-grade library requires careful implementation. A bug in the filter logic could undermine the privacy guarantee entirely — and unlike many software bugs, privacy bugs tend to be silent. They do not crash systems; they simply leak information that was supposed to stay hidden.

Finally, the paper focuses on privacy filters as a framework but does not provide a full comparison against the most recent FFT-based composition methods, which have their own strengths for certain problem shapes. The landscape of privacy accounting tools is crowded and fast-moving, and situating a new technique precisely within that landscape is genuinely difficult.

What the paper does accomplish is conceptually important: it shows that the gap between worst-case accounting and actual-case accounting is real, measurable, and exploitable without weakening the privacy guarantee. For years, privacy engineers have been paying full catalogue price for every query, even when the actual leakage was much smaller. This work is the first rigorous proof that you can read the receipts — and that reading them changes the bottom line.

The jar of trust coins turns out to have been heavier than anyone thought.

📄 https://arxiv.org/abs/2604.08630

tags: privacy, machinelearning, statistics, federatedlearning

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/m233p3fv

When the AI Learns to See and Think at the Same Time

Bongho Tae — Sun, 26 Apr 2026 04:51:09 +0000

The Problem with Doing Everything in a Line

Picture the last time you organized something genuinely complicated — a move across the country, a wedding, a conference. At some point, you probably realized that doing every task in sequence was killing you. You couldn't wait to finish booking the caterer before calling the venue, and you couldn't wait to confirm the venue before sending invitations. The entire operation required you to hold many threads simultaneously, farming out tasks to different people while you kept track of the whole picture.

Now imagine that the person coordinating all of this could only use a telephone, and could only make one call at a time. That is, roughly, the state of most AI systems today when they face complex, real-world problems. They think in a line. They act in a line. And as tasks grow more intricate — research a topic, then design something, then write code, then verify the result — that single-file approach becomes not just slow but fundamentally inadequate.

A new model from the Chinese AI lab Moonshot AI, called Kimi K2.5, takes direct aim at this constraint. It does so in two ways that, taken together, represent a meaningful shift in how AI systems are designed: it trains the model to genuinely understand both language and images as a single unified skill, rather than grafting vision onto a text-first brain as an afterthought. And it introduces something the researchers call Agent Swarm — a way of multiplying the AI into a small army of specialized workers that tackle sub-problems in parallel, then report back to a coordinating intelligence.

Both ideas sound intuitive. But making them work in practice, and making them work together, turned out to be genuinely hard.

Why Seeing and Reading Have Always Fought Each Other

Most powerful AI models today are, at their core, language machines. They were trained on enormous quantities of text — books, articles, code, conversations — and they learned the deep structure of human reasoning through words. Vision was added later, like fitting a seeing-eye dog with a translation earpiece: technically functional, but not the same as being born with both senses integrated.

The problem with this approach is that the two skills pull against each other during training. Imagine trying to learn French and violin simultaneously, but on a rigid schedule: two hours of French, then two hours of violin, with no mixing allowed. You might get decent at both. But you'd never develop the fluid cross-modal thinking of a musician who hums a tune while writing its lyrics, each skill feeding the other in real time.

The researchers behind K2.5 found something similar. When vision is added to a language model late in training — or when the two modalities are trained in separate phases — the model develops a kind of internal friction. Improving vision sometimes hurts language; improving language sometimes hurts vision. They conflict because they were never taught to speak to each other from the beginning.

K2.5's answer was to insist on early integration. From the very first stages of pre-training — the massive, expensive phase where the model ingests hundreds of billions of words and images — text and vision tokens were mixed together in a constant ratio throughout. Think of it less like learning French and violin on a schedule, and more like growing up bilingual: the two languages don't just coexist in your brain, they shape each other's grammar, expand each other's vocabulary, and ultimately create a richer understanding of both than either would produce alone.

The Surprising Power of Doing Almost Nothing

Here is one of the counterintuitive findings buried in this paper, and it deserves a moment's attention.

The conventional wisdom in AI training is that if you want a model to do something specific — say, interpret a chart, or follow a visual instruction, or use a tool when prompted by an image — you collect examples of those exact behaviors and train the model on them. You show it thousands of human-designed demonstrations. The model watches, imitates, and learns.

The K2.5 team tried this. And it made things worse.

They call what they actually found "zero-vision SFT," which sounds technical but encodes a beautifully strange insight. SFT stands for supervised fine-tuning — the phase of training where a model is shaped to follow instructions and behave helpfully, using human-labeled examples. "Zero-vision" means: during that phase, show the model no visual examples at all. Just text.

The result was that the model's visual reasoning capabilities activated anyway — and generalized better than when human demonstrations were provided.

Why? The researchers' explanation is elegant. The pre-training phase had already established such deep connections between language and vision that the model had, in effect, already learned to think visually. Human-designed demonstrations of visual reasoning, it turns out, are a kind of straitjacket: they constrain the model to imitate specific patterns rather than applying its own already-rich visual understanding. By withholding those demonstrations, the team let the model draw on what it had already taught itself.

The analogy that comes to mind is a writer who has read thousands of novels and deeply internalized the rhythms of prose. If you then give them a rigid template — "write your opening sentence this way, structure your paragraphs like this" — you may actually produce worse writing than if you'd simply told them the subject and let them work. The template interrupts a fluency they already possess.

Figure 1: Kimi K2.5 main results, comparing performance across benchmark categories against leading proprietary and open-source models.

Figure 2: Vision RL training curves on vision benchmarks starting from minimal zero-vision SFT. By scaling vision RL FLOPs, the performance continues to improve, demonstrating that zero-vision activation generalizes effectively.

The curves in the figure above tell the story numerically: as the model was given more and more practice through reinforcement learning — a technique more like game-playing than imitation, where the model tries things and receives feedback on whether they worked — its visual understanding kept climbing. The message is that practice, not prescription, built the skill.

When Training One Sense Sharpens the Other

There is something even stranger in the results, and it directly contradicts an assumption that has quietly shaped AI development for years.

When the team applied reinforcement learning to visual tasks — having the model practice interpreting images and graphs and receive feedback on whether it got things right — they found that the model's language performance improved too. Not despite the visual training. Because of it.

This is not obvious. It would be perfectly reasonable to assume that training on images uses up some finite capacity that was previously devoted to language, producing a tradeoff: more vision skill, less text skill. That is, roughly, what people assumed. The K2.5 results suggest the opposite: that genuine cross-modal integration creates a kind of cognitive leverage. Learning to reason carefully about what a chart is actually showing you makes you better at reasoning carefully about what a sentence is actually claiming.

The analogy is cross-training in athletics. A marathon runner who adds strength training doesn't become a worse runner because the weights are "using up" running capacity. Done right, the strength work changes how the body moves, how forces transfer, how fatigue accumulates — and the runner comes back faster. The skills compound rather than compete.

The Orchestra Problem

With the model's visual and linguistic reasoning unified, the team turned to a different and arguably more fundamental problem: the architecture of how an AI tackles a hard task.

Current AI systems, even sophisticated ones, operate sequentially. The model thinks step one, acts on step one, observes the result, thinks step two, acts on step two, and so on. This works. But it scales badly. If a genuinely complex task requires hundreds of steps — researching a topic across dozens of sources, then synthesizing the findings, then designing something based on those findings, then verifying the design — the time required grows linearly with the number of steps. You are waiting, always, for the model to finish its last thought before it can begin its next one.

This is the telephone-one-call-at-a-time problem from the opening. And Agent Swarm is the solution.

Think of how a large architectural firm tackles the design of a complex building. There is a lead architect who holds the overall vision and makes the decisions that require that vision. But there are also structural engineers, interior designers, environmental consultants, and cost estimators — each working on their own domain, in parallel, reporting back when their piece is complete. The lead architect doesn't wait for the structural drawings before commissioning the interior design study. The pieces develop concurrently and are integrated at the end.

Agent Swarm works on the same principle. A coordinating AI — the orchestrator — receives a complex task and immediately analyzes it for parallelizability: which parts depend on other parts, and which parts can proceed simultaneously without waiting for anything else? It then spins up specialized sub-agents — an AI researcher, a fact-checker, a coder, a visual analyst — and dispatches them to work on their pieces at the same time. The sub-agents are not general intelligences; they are locked-down specialists, given specific tools and specific goals. The orchestrator alone is trained to adapt and coordinate.

Figure 3: An agent swarm has a trainable orchestrator that dynamically creates specialized frozen subagents and decomposes complex tasks into parallelizable subtasks for efficient distributed execution.

The result, according to the paper's measurements, is a reduction in task completion time of up to 4.5 times compared to doing the same work sequentially. On complex search and research tasks, Agent Swarm doesn't just speed things up — it also gets better answers, because the parallel workers cover more ground before the orchestrator synthesizes them.

Figure 4: In our parallel-agent reinforcement learning environment, training accuracy increases smoothly as training progresses. At the same time, the level of parallelism during training also gradually increases.

What is particularly interesting about Figure 4 is that the model learned when to multiply itself. As training proceeded and the model became better at solving hard problems, it spontaneously used more parallel agents. The more capable it became, the more it chose to delegate. A naïve reading might see this as the model becoming lazier; a more accurate reading is that it learned what experienced managers know — that the hardest problems are the ones most worth distributing.

What the Numbers Actually Show

The benchmark results are numerous and the comparisons carefully hedged, as they always are in papers that announce impressive performance. Kimi K2.5 is being compared against GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — the frontier models from OpenAI, Anthropic, and Google respectively — and the picture is genuinely mixed, which is worth saying plainly.

On agentic tasks — the tasks that require planning, using tools, browsing the web, and synthesizing information — K2.5 does well, particularly when Agent Swarm is engaged. On pure mathematical reasoning benchmarks like AIME and HMMT, it trails GPT-5.2 and Gemini 3 Pro somewhat. On knowledge recall tasks like SimpleQA, it trails Gemini significantly. It leads on several coding and web-browsing tasks, and performs strongly on visual understanding tests.
The honest reading of these numbers is that K2.5 is a genuinely capable model with meaningful innovations, particularly in how it handles vision and how it organizes multi-step work. It is not uniformly ahead of the competition. What it offers that the others do not, as an open-source release, is the ability for researchers and developers to examine and build on its architecture — the Agent Swarm mechanism especially — without waiting for a proprietary API to expose those features.

What Becomes Different

Step back from the benchmarks for a moment and think about what these capabilities, combined, actually change.

Consider a person trying to understand a dense medical report after a diagnosis. Currently, they might copy out the relevant sections and paste them into an AI chat window, painstakingly describing what the charts show. A system that genuinely integrates vision can look at the actual document — the actual graph of their bloodwork over time — and reason about it directly, not through a verbal description.

Or consider a journalist trying to verify a complex claim that involves cross-referencing dozens of documents, each containing a mix of text, images, and data tables. A sequential AI, however smart, takes a long time because it must examine each source one by one. A parallel agent swarm can disperse across those sources simultaneously, fact-checking different claims in different documents at once, then bring the findings back to a central synthesizer.

Or consider a small software team using an AI assistant to debug a complex system. The AI currently reasons through possibilities one at a time. A parallel architecture lets it pursue multiple diagnostic hypotheses simultaneously — testing one while continuing to reason about another — potentially compressing hours of investigation into minutes.

These are not wild speculations. They are the natural extensions of what this paper demonstrates working in controlled conditions.

What Remains Uncertain

There is a limit to how much one research paper can establish, and it is worth naming what this one does not answer.

The Agent Swarm results are measured on benchmarks — structured tests with defined right answers. Real-world tasks are messier. They have ambiguous success criteria, contradictory sources, and edge cases that no benchmark designer anticipated. Whether parallel agent orchestration degrades gracefully when the sub-agents encounter genuinely unexpected situations — rather than simply being slower in the controlled case — is not yet clear.

The "zero-vision SFT" finding is striking, but it is also a finding about a specific model at a specific scale with a specific pre-training recipe. Whether it generalizes — whether other labs could replicate the same counterintuitive benefit by withholding visual demonstrations — is an open question that requires independent verification.

And the cross-modal enhancement claim — that training on vision improves language, and vice versa — is compelling in the aggregate benchmark numbers but harder to scrutinize mechanically. The paper shows that the numbers go up together; it does not fully show why, in a way that would let someone predict when this benefit will appear and when it won't.

None of this diminishes what the paper contributes. It presents a coherent, testable set of ideas about how to build AI systems that handle the full complexity of the world — text and images, sequential reasoning and parallel action — and it releases the trained model for others to examine and extend. In a field where many of the most significant advances stay locked inside proprietary systems, that openness is itself a contribution.

The single-file telephone call, it turns out, was always an artificial constraint. What the architects of K2.5 have shown is that AI, given the right training, can learn to run a switchboard.

📄 https://arxiv.org/abs/2602.02276

tags: artificialintelligence multimodal agenticsystems machinelearning

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/eg28mz6h

Time's Fingerprint: How AI Finally Learned to Read the Speed of the World

Bongho Tae — Sat, 25 Apr 2026 05:24:59 +0000

The blur we never thought to ask about

You have almost certainly watched a video that felt wrong before you could explain why. Maybe it was dashcam footage shared on social media — the traffic moving just a beat too briskly, the pedestrians crossing the street with a faint mechanical urgency, as though everyone had somewhere slightly too important to be. Or maybe it was the reverse: a sports clip slowed down to a crawl, the ball hanging in the air like something painted on silk, the crowd frozen mid-roar. Your brain registered something about time before your conscious mind caught up.

That gut feeling — this is moving at the wrong speed — is something humans do effortlessly and machines have, until very recently, struggled to do at all. A new paper from researchers at the University of Washington and Google changes that. They have taught a computer system not just to understand what is happening in a video, but to understand when — to read the flow of time embedded in moving images the way a musician reads tempo from sheet music.

The consequences turn out to be surprisingly far-reaching.

Why computers went blind to speed

Modern computer vision is remarkably capable. Given a video, existing systems can tell you that a dog is chasing a ball, that the man in the blue jacket is the same man who appeared three seconds earlier, that the faces in this clip belong to certain people. What these systems cannot reliably do is answer a simpler-sounding question: is this video playing at normal speed?

The reason is subtler than it first appears. Think about what a video actually is: a sequence of still photographs shown so rapidly that the eye perceives motion. At 24 frames per second — the standard for film — you're seeing 24 photographs every second. At 240 frames per second — the speed of a high-end action camera — you're capturing ten times more moments. When that 240-frames-per-second footage is played back at 24 frames per second, you get the floating, dreamlike quality of slow motion. Every heartbeat of action is stretched into ten beats of screen time.

Now, a machine looking at individual frames faces a chicken-and-egg problem: it sees a ball mid-flight, but how does it know whether that frame came from a 24fps normal-speed video or a 240fps slow-motion clip played back at one-tenth speed? The objects look identical. The scene looks identical. The motion, considered frame-by-frame, looks identical.

This is why most computer vision research simply ignored the question. Speed was treated as a metadata problem — something you look up in the file's technical specifications, not something you read from the pixels themselves. But that assumption collapses the moment you're working with in-the-wild internet video, where metadata is unreliable, absent, or deliberately manipulated.

Motion blur is time's fingerprint

The breakthrough insight in this paper is that time actually does leave fingerprints on pixels — you just have to know where to look.

Consider what happens to a photograph of a speeding motorcycle. If the shutter stays open even a fraction too long, the motorcycle doesn't appear as a crisp object. It smears. You see a streak, a ghost, a blur that traces the path of motion across the frame. This motion blur is not a flaw in the photograph. It is information. It is the camera's way of recording that something moved very fast during the brief window the shutter was open.

Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba

The same logic applies to video. When a bicycle races down a mountain trail in real time, the background trees streak into horizontal smudges behind it. When that same footage is captured at high speed and played back slowly, each individual frame is sharper — there is less blur per frame, because the camera captured each moment during a much shorter window.

The researchers trained their model to read these cues the way a forensic analyst reads tire marks on asphalt — not just noticing that blur exists, but using its character, direction, and intensity to reconstruct what kind of motion produced it, and at what temporal scale.

A panning camera following a bird in flight, for instance, produces a very particular blur signature — the bird is sharp while the background dissolves into horizontal streaks, because the camera tracked the subject and let the world smear behind it. This kind of image is visually unmistakable as fast, even if nothing in the semantic content — bird, sky, trees — carries that information directly.

The audio trick that changed everything

Visual blur is one fingerprint of speed. But the paper's most elegant trick exploits a second one: sound.

Here is something most people don't consciously think about: when you speed up a video, the audio pitch rises. Play a recording of a conversation at twice normal speed and everyone sounds like a cartoon character — voices become thin, reedy, almost helium-inflected. Slow it down to half speed and the same voices become impossibly low and thick, like a record player running out of battery.

This happens for the same reason that a police siren sounds higher as it approaches you and lower as it recedes: the pitch of a sound is determined by the frequency of the sound waves reaching your ears, and that frequency changes when the source is moving (or, in this case, when time itself is compressed or expanded in playback).

Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba

The researchers visualized this as a spectrogram — a map of which sound frequencies appear at which moments. In the image above, you can see the effect directly: the left side of the image, representing slower playback, shows sound energy clustered in lower frequencies, with the high-frequency regions dark and empty. On the right, where playback speed increases, the higher frequencies suddenly light up, the entire spectrum shifting upward like a musical key change written in light.

This creates a profound opportunity. It means that the same video carries two independent, corroborating signals about its own speed: the visual blur in the frames and the pitch signature in the audio. The model can compare these signals against each other, using each one to check and sharpen its reading of the other.

This is what researchers call cross-modal supervision — using two different sensory channels as mutual teachers. Think of how a wine sommelier uses both smell and taste together to identify a vintage. Neither sense alone might be definitive, but the agreement between them, or the revealing discord, tells a richer story than either could alone. The model learns the relationship between visual speed cues and audio pitch cues by watching enormous amounts of ordinary video — without anyone labeling a single frame or telling the system what "slow motion" looks like.

Teaching a machine without a teacher

This brings us to perhaps the most important methodological decision in the paper: everything described so far is learned without labels.

In most machine learning, you need a human to annotate training data. Someone has to watch thousands of videos and write down: "this one is played at half speed," "this one is normal," "this one is sped up two times." This labeling process is expensive, slow, and bottlenecked by human attention. More fundamentally, it requires the person labeling to already know the answer — which is exactly what you're trying to teach the machine.

The researchers sidestepped this entirely through a technique called self-supervised learning. Imagine teaching someone to recognize a forged signature without ever showing them examples of forgeries. Instead, you hand them a stack of authentic signatures and let them look for internal inconsistencies — places where the pen pressure, the angle, the rhythm of a stroke breaks with what the same hand produced moments earlier. They learn by noticing when something doesn't cohere, without anyone ever telling them what to look for.

The model in this paper learns similarly. Researchers took ordinary internet videos and artificially sped some up, slowed others down, or mixed sections of different speeds. They then asked the model to detect these changes — not by consulting a label, but by noticing when the visual flow and audio pitch no longer fit together, or when the blur patterns across consecutive frames don't match the implied rhythm of motion. The "teacher" is the internal consistency of the video itself.

Building the world's largest slow-motion library

Once you have a system that can reliably tell whether a video contains slow motion, you can use that system as a filter — a tireless, infinitely patient curator.

The internet contains an enormous amount of slow-motion footage mixed in with billions of ordinary videos. Tracking its location is the problem: there's no reliable, consistent way to find it from metadata alone. People tag and title videos erratically. One creator calls the same footage "slo-mo," another calls it "60fps," another calls it nothing at all.

The researchers turned their trained model loose on this haystack. By processing large collections of video and flagging clips where the model detected slow-motion signatures — the characteristic blur, the pitch-shifted audio, the visual density of temporal detail — they assembled the largest slow-motion dataset ever collected from naturally occurring sources.

This matters because slow-motion footage is genuinely different from ordinary video in a way that matters for AI training. Think of ordinary video as a novel that describes a battle in broad strokes — armies clash, a hero falls, the tide turns. Slow-motion footage is like a frame-by-frame graphic novel of the same battle, where every sword stroke and expression is captured in full detail. For a machine learning to understand motion, physics, and causality, that detail is not decorative. It is the text.

When the machine learns to control time

The paper's most forward-looking section describes two things the researchers built using all this acquired understanding: a system that generates video at a specified speed, and a system that converts low-quality, blurry, low-frame-rate video into high-quality slow motion.

The first — speed-conditioned video generation — is something like teaching an illustrator to draw differently depending on a mood instruction. Ask them to draw a waterfall as "frozen," and they'll use sharp lines, crystalline forms, stillness implied in every edge. Ask them to draw the same waterfall as "rushing," and the same elements become streaks, arcs, foam caught in mid-scatter. The instruction shapes every aesthetic decision, not just the subject matter. Here, instead of artistic mood, the instruction is temporal: generate this scene as though captured at half normal speed, or double normal speed. The model learns to make every visual choice — how sharp to render edges, how much to blur movement, how to distribute motion across frames — consistent with the specified temporal flow.

The second — temporal super-resolution — is arguably the more practically remarkable achievement. Given a video that is blurry, low-frame-rate, and temporally thin (imagine footage from a security camera, or a clip compressed heavily for file size), the system reconstructs what the in-between moments probably looked like. This is not guessing randomly. It is inference constrained by everything the model has learned about how motion works, how blur distributes across a scene, and how things in the physical world actually move between recorded frames.

Think of how a skilled art restorer approaches a damaged oil painting. Faced with sections where the paint has flaked away entirely, they don't fill in the gaps with random colors. They study the surrounding strokes, the artist's technique as visible in intact sections, the logic of the depicted scene — and from all of this, they reconstruct what almost certainly was there. The result is not certainty, but it is informed reconstruction, and for many purposes it is better than leaving the gap blank.

What becomes possible now

These capabilities, combined, begin to shift what is possible in several concrete domains.

Consider a surgeon training on video of a delicate procedure. Currently, the training footage may have been captured on standard medical cameras at rates that simply don't capture the full motion of the most critical moments — the tension and release of a suture, the exact angle of an incision. With temporal super-resolution, the same footage could be enriched with recovered in-between frames, giving trainees and instructors a more complete picture of technique.

Or consider a forensic analyst asked whether a viral video of an incident has been manipulated — specifically, whether someone sped up footage to make a crowd look more menacing, or slowed it down to make an action look more deliberate than it was. These techniques give investigators a systematic way to test that question, looking for the inconsistencies between visual and audio speed signatures that arise when footage has been post-processed — the equivalent of finding anachronistic fiber in a supposedly antique cloth.

For the film industry, speed-conditioned generation opens the possibility of creating cinematic slow motion in post-production, without the cost of high-speed cameras. What currently requires tens of thousands of dollars in equipment could, if these techniques mature, be applied as a computational process to footage captured with ordinary cameras.

And at a deeper level, there is something philosophically significant about what this paper is pointing toward: the idea that time itself is a visual dimension that can be learned, not just assumed. Most AI systems that watch video treat it as a sequence of images. This paper treats it as a recording of temporal flow — and argues that how things unfold across time is as learnable, and as teachable, as what objects look like or where they are.

What the paper doesn't answer

There are honest gaps here worth noting. The audio-based speed detection, elegant as it is, is useless on silent video — a substantial fraction of internet content. The visual signals alone carry less certainty in certain kinds of footage: scenes with little motion, static shots, or carefully stabilized camera work where blur signatures are deliberately suppressed by stabilization software.

More fundamentally, the temporal super-resolution system, like all such reconstruction methods, is making educated inferences about what it didn't see. In most applications, this is fine. But in forensic or legal contexts, a system that fills in moments it never observed is a system that can produce compelling artifacts — convincing reconstructions of things that may not have happened quite that way. The capability and the caution need to develop together.

And the paper is still largely a proof-of-concept for some of the generation results. The generated videos, while compelling, show the artifacts and limitations familiar to anyone who has watched AI-generated video for more than a few seconds. The principle is demonstrated; the product-quality execution is still ahead.

But the direction is clear, and the foundation is sound. Time has always moved through video. Now, finally, the machines are starting to notice.

📄 https://arxiv.org/abs/2604.21931v1

tags: computervision, videogeneration, selfsupervisedlearning, temporalai

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/4jkzs29p