<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Bongho Tae</title>
    <description>The latest articles on Forem by Bongho Tae (@xoqhdgh1002).</description>
    <link>https://forem.com/xoqhdgh1002</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3896559%2F3b7e6ff4-85a9-47b3-a452-08b8c7ea14d3.png</url>
      <title>Forem: Bongho Tae</title>
      <link>https://forem.com/xoqhdgh1002</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/xoqhdgh1002"/>
    <language>en</language>
    <item>
      <title>When the AI Learns to See and Think at the Same Time</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sun, 26 Apr 2026 04:51:09 +0000</pubDate>
      <link>https://forem.com/xoqhdgh1002/when-the-ai-learns-to-see-and-think-at-the-same-time-235e</link>
      <guid>https://forem.com/xoqhdgh1002/when-the-ai-learns-to-see-and-think-at-the-same-time-235e</guid>
      <description>&lt;h2&gt;
  
  
  The Problem with Doing Everything in a Line
&lt;/h2&gt;

&lt;p&gt;Picture the last time you organized something genuinely complicated — a move across the country, a wedding, a conference. At some point, you probably realized that doing every task in sequence was killing you. You couldn't wait to finish booking the caterer before calling the venue, and you couldn't wait to confirm the venue before sending invitations. The entire operation required you to hold many threads simultaneously, farming out tasks to different people while you kept track of the whole picture.&lt;/p&gt;

&lt;p&gt;Now imagine that the person coordinating all of this could only use a telephone, and could only make one call at a time. That is, roughly, the state of most AI systems today when they face complex, real-world problems. They think in a line. They act in a line. And as tasks grow more intricate — research a topic, then design something, then write code, then verify the result — that single-file approach becomes not just slow but fundamentally inadequate.&lt;/p&gt;

&lt;p&gt;A new model from the Chinese AI lab Moonshot AI, called Kimi K2.5, takes direct aim at this constraint. It does so in two ways that, taken together, represent a meaningful shift in how AI systems are designed: it trains the model to genuinely understand both language and images as a single unified skill, rather than grafting vision onto a text-first brain as an afterthought. And it introduces something the researchers call Agent Swarm — a way of multiplying the AI into a small army of specialized workers that tackle sub-problems in parallel, then report back to a coordinating intelligence.&lt;/p&gt;

&lt;p&gt;Both ideas sound intuitive. But making them work in practice, and making them work &lt;em&gt;together&lt;/em&gt;, turned out to be genuinely hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Seeing and Reading Have Always Fought Each Other
&lt;/h2&gt;

&lt;p&gt;Most powerful AI models today are, at their core, language machines. They were trained on enormous quantities of text — books, articles, code, conversations — and they learned the deep structure of human reasoning through words. Vision was added later, like fitting a seeing-eye dog with a translation earpiece: technically functional, but not the same as being born with both senses integrated.&lt;/p&gt;

&lt;p&gt;The problem with this approach is that the two skills pull against each other during training. Imagine trying to learn French and violin simultaneously, but on a rigid schedule: two hours of French, then two hours of violin, with no mixing allowed. You might get decent at both. But you'd never develop the fluid cross-modal thinking of a musician who hums a tune while writing its lyrics, each skill feeding the other in real time.&lt;/p&gt;

&lt;p&gt;The researchers behind K2.5 found something similar. When vision is added to a language model late in training — or when the two modalities are trained in separate phases — the model develops a kind of internal friction. Improving vision sometimes hurts language; improving language sometimes hurts vision. They conflict because they were never taught to speak to each other from the beginning.&lt;/p&gt;

&lt;p&gt;K2.5's answer was to insist on early integration. From the very first stages of pre-training — the massive, expensive phase where the model ingests hundreds of billions of words and images — text and vision tokens were mixed together in a constant ratio throughout. Think of it less like learning French and violin on a schedule, and more like growing up bilingual: the two languages don't just coexist in your brain, they shape each other's grammar, expand each other's vocabulary, and ultimately create a richer understanding of both than either would produce alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Surprising Power of Doing Almost Nothing
&lt;/h2&gt;

&lt;p&gt;Here is one of the counterintuitive findings buried in this paper, and it deserves a moment's attention.&lt;/p&gt;

&lt;p&gt;The conventional wisdom in AI training is that if you want a model to do something specific — say, interpret a chart, or follow a visual instruction, or use a tool when prompted by an image — you collect examples of those exact behaviors and train the model on them. You show it thousands of human-designed demonstrations. The model watches, imitates, and learns.&lt;/p&gt;

&lt;p&gt;The K2.5 team tried this. And it made things worse.&lt;/p&gt;

&lt;p&gt;They call what they actually found "zero-vision SFT," which sounds technical but encodes a beautifully strange insight. SFT stands for supervised fine-tuning — the phase of training where a model is shaped to follow instructions and behave helpfully, using human-labeled examples. "Zero-vision" means: during that phase, show the model &lt;em&gt;no&lt;/em&gt; visual examples at all. Just text.&lt;/p&gt;

&lt;p&gt;The result was that the model's visual reasoning capabilities activated anyway — and generalized better than when human demonstrations were provided.&lt;/p&gt;

&lt;p&gt;Why? The researchers' explanation is elegant. The pre-training phase had already established such deep connections between language and vision that the model had, in effect, already learned to think visually. Human-designed demonstrations of visual reasoning, it turns out, are a kind of straitjacket: they constrain the model to imitate specific patterns rather than applying its own already-rich visual understanding. By withholding those demonstrations, the team let the model draw on what it had already taught itself.&lt;/p&gt;

&lt;p&gt;The analogy that comes to mind is a writer who has read thousands of novels and deeply internalized the rhythms of prose. If you then give them a rigid template — "write your opening sentence this way, structure your paragraphs like this" — you may actually produce worse writing than if you'd simply told them the subject and let them work. The template interrupts a fluency they already possess.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rjub0ozpb7ux8qtvn5e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rjub0ozpb7ux8qtvn5e.png" alt="Kimi K2.5 main benchmark results" width="800" height="440"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: Kimi K2.5 main results, comparing performance across benchmark categories against leading proprietary and open-source models.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furp2ouktgpgyj6vqlrfv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furp2ouktgpgyj6vqlrfv.png" alt="Kimi K2.5 visual reasoning training curves" width="714" height="424"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 2: Vision RL training curves on vision benchmarks starting from minimal zero-vision SFT. By scaling vision RL FLOPs, the performance continues to improve, demonstrating that zero-vision activation generalizes effectively.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The curves in the figure above tell the story numerically: as the model was given more and more practice through reinforcement learning — a technique more like game-playing than imitation, where the model tries things and receives feedback on whether they worked — its visual understanding kept climbing. The message is that practice, not prescription, built the skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Training One Sense Sharpens the Other
&lt;/h2&gt;

&lt;p&gt;There is something even stranger in the results, and it directly contradicts an assumption that has quietly shaped AI development for years.&lt;/p&gt;

&lt;p&gt;When the team applied reinforcement learning to visual tasks — having the model practice interpreting images and graphs and receive feedback on whether it got things right — they found that the model's &lt;em&gt;language&lt;/em&gt; performance improved too. Not despite the visual training. Because of it.&lt;/p&gt;

&lt;p&gt;This is not obvious. It would be perfectly reasonable to assume that training on images uses up some finite capacity that was previously devoted to language, producing a tradeoff: more vision skill, less text skill. That is, roughly, what people assumed. The K2.5 results suggest the opposite: that genuine cross-modal integration creates a kind of cognitive leverage. Learning to reason carefully about what a chart is actually showing you makes you better at reasoning carefully about what a sentence is actually claiming.&lt;/p&gt;

&lt;p&gt;The analogy is cross-training in athletics. A marathon runner who adds strength training doesn't become a worse runner because the weights are "using up" running capacity. Done right, the strength work changes how the body moves, how forces transfer, how fatigue accumulates — and the runner comes back faster. The skills compound rather than compete.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Orchestra Problem
&lt;/h2&gt;

&lt;p&gt;With the model's visual and linguistic reasoning unified, the team turned to a different and arguably more fundamental problem: the architecture of how an AI tackles a hard task.&lt;/p&gt;

&lt;p&gt;Current AI systems, even sophisticated ones, operate sequentially. The model thinks step one, acts on step one, observes the result, thinks step two, acts on step two, and so on. This works. But it scales badly. If a genuinely complex task requires hundreds of steps — researching a topic across dozens of sources, then synthesizing the findings, then designing something based on those findings, then verifying the design — the time required grows linearly with the number of steps. You are waiting, always, for the model to finish its last thought before it can begin its next one.&lt;/p&gt;

&lt;p&gt;This is the telephone-one-call-at-a-time problem from the opening. And Agent Swarm is the solution.&lt;/p&gt;

&lt;p&gt;Think of how a large architectural firm tackles the design of a complex building. There is a lead architect who holds the overall vision and makes the decisions that require that vision. But there are also structural engineers, interior designers, environmental consultants, and cost estimators — each working on their own domain, in parallel, reporting back when their piece is complete. The lead architect doesn't wait for the structural drawings before commissioning the interior design study. The pieces develop concurrently and are integrated at the end.&lt;/p&gt;

&lt;p&gt;Agent Swarm works on the same principle. A coordinating AI — the orchestrator — receives a complex task and immediately analyzes it for parallelizability: which parts depend on other parts, and which parts can proceed simultaneously without waiting for anything else? It then spins up specialized sub-agents — an AI researcher, a fact-checker, a coder, a visual analyst — and dispatches them to work on their pieces at the same time. The sub-agents are not general intelligences; they are locked-down specialists, given specific tools and specific goals. The orchestrator alone is trained to adapt and coordinate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvgdtn3dbpw7vuewb3wv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvgdtn3dbpw7vuewb3wv3.png" alt="Agent Swarm architecture" width="800" height="433"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 3: An agent swarm has a trainable orchestrator that dynamically creates specialized frozen subagents and decomposes complex tasks into parallelizable subtasks for efficient distributed execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The result, according to the paper's measurements, is a reduction in task completion time of up to 4.5 times compared to doing the same work sequentially. On complex search and research tasks, Agent Swarm doesn't just speed things up — it also gets better answers, because the parallel workers cover more ground before the orchestrator synthesizes them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bqv2wkwb02h4j2is4zu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bqv2wkwb02h4j2is4zu.png" alt="Parallel agent reinforcement learning training" width="800" height="234"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 4: In our parallel-agent reinforcement learning environment, training accuracy increases smoothly as training progresses. At the same time, the level of parallelism during training also gradually increases.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What is particularly interesting about Figure 4 is that the model learned when to multiply itself. As training proceeded and the model became better at solving hard problems, it spontaneously used more parallel agents. The more capable it became, the more it chose to delegate. A naïve reading might see this as the model becoming lazier; a more accurate reading is that it learned what experienced managers know — that the hardest problems are the ones most worth distributing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Numbers Actually Show
&lt;/h2&gt;

&lt;p&gt;The benchmark results are numerous and the comparisons carefully hedged, as they always are in papers that announce impressive performance. Kimi K2.5 is being compared against GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — the frontier models from OpenAI, Anthropic, and Google respectively — and the picture is genuinely mixed, which is worth saying plainly.&lt;/p&gt;

&lt;p&gt;On agentic tasks — the tasks that require planning, using tools, browsing the web, and synthesizing information — K2.5 does well, particularly when Agent Swarm is engaged. On pure mathematical reasoning benchmarks like AIME and HMMT, it trails GPT-5.2 and Gemini 3 Pro somewhat. On knowledge recall tasks like SimpleQA, it trails Gemini significantly. It leads on several coding and web-browsing tasks, and performs strongly on visual understanding tests.&lt;br&gt;
The honest reading of these numbers is that K2.5 is a genuinely capable model with meaningful innovations, particularly in how it handles vision and how it organizes multi-step work. It is not uniformly ahead of the competition. What it offers that the others do not, as an open-source release, is the ability for researchers and developers to examine and build on its architecture — the Agent Swarm mechanism especially — without waiting for a proprietary API to expose those features.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Becomes Different
&lt;/h2&gt;

&lt;p&gt;Step back from the benchmarks for a moment and think about what these capabilities, combined, actually change.&lt;/p&gt;

&lt;p&gt;Consider a person trying to understand a dense medical report after a diagnosis. Currently, they might copy out the relevant sections and paste them into an AI chat window, painstakingly describing what the charts show. A system that genuinely integrates vision can look at the actual document — the actual graph of their bloodwork over time — and reason about it directly, not through a verbal description.&lt;/p&gt;

&lt;p&gt;Or consider a journalist trying to verify a complex claim that involves cross-referencing dozens of documents, each containing a mix of text, images, and data tables. A sequential AI, however smart, takes a long time because it must examine each source one by one. A parallel agent swarm can disperse across those sources simultaneously, fact-checking different claims in different documents at once, then bring the findings back to a central synthesizer.&lt;/p&gt;

&lt;p&gt;Or consider a small software team using an AI assistant to debug a complex system. The AI currently reasons through possibilities one at a time. A parallel architecture lets it pursue multiple diagnostic hypotheses simultaneously — testing one while continuing to reason about another — potentially compressing hours of investigation into minutes.&lt;/p&gt;

&lt;p&gt;These are not wild speculations. They are the natural extensions of what this paper demonstrates working in controlled conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Remains Uncertain
&lt;/h2&gt;

&lt;p&gt;There is a limit to how much one research paper can establish, and it is worth naming what this one does not answer.&lt;/p&gt;

&lt;p&gt;The Agent Swarm results are measured on benchmarks — structured tests with defined right answers. Real-world tasks are messier. They have ambiguous success criteria, contradictory sources, and edge cases that no benchmark designer anticipated. Whether parallel agent orchestration degrades gracefully when the sub-agents encounter genuinely unexpected situations — rather than simply being slower in the controlled case — is not yet clear.&lt;/p&gt;

&lt;p&gt;The "zero-vision SFT" finding is striking, but it is also a finding about a specific model at a specific scale with a specific pre-training recipe. Whether it generalizes — whether other labs could replicate the same counterintuitive benefit by withholding visual demonstrations — is an open question that requires independent verification.&lt;/p&gt;

&lt;p&gt;And the cross-modal enhancement claim — that training on vision improves language, and vice versa — is compelling in the aggregate benchmark numbers but harder to scrutinize mechanically. The paper shows that the numbers go up together; it does not fully show &lt;em&gt;why&lt;/em&gt;, in a way that would let someone predict when this benefit will appear and when it won't.&lt;/p&gt;

&lt;p&gt;None of this diminishes what the paper contributes. It presents a coherent, testable set of ideas about how to build AI systems that handle the full complexity of the world — text and images, sequential reasoning and parallel action — and it releases the trained model for others to examine and extend. In a field where many of the most significant advances stay locked inside proprietary systems, that openness is itself a contribution.&lt;/p&gt;

&lt;p&gt;The single-file telephone call, it turns out, was always an artificial constraint. What the architects of K2.5 have shown is that AI, given the right training, can learn to run a switchboard.&lt;/p&gt;

&lt;p&gt;📄 &lt;a href="https://arxiv.org/abs/2602.02276" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2602.02276&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tags: artificialintelligence multimodal agenticsystems machinelearning&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🇰🇷 Korean version on Velog: &lt;a href="https://velog.io/@tkdnel1002/eg28mz6h" rel="noopener noreferrer"&gt;https://velog.io/@tkdnel1002/eg28mz6h&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>artificialintelligen</category>
    </item>
    <item>
      <title>Time's Fingerprint: How AI Finally Learned to Read the Speed of the World</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sat, 25 Apr 2026 05:24:59 +0000</pubDate>
      <link>https://forem.com/xoqhdgh1002/times-fingerprint-how-ai-finally-learned-to-read-the-speed-of-the-world-3l0k</link>
      <guid>https://forem.com/xoqhdgh1002/times-fingerprint-how-ai-finally-learned-to-read-the-speed-of-the-world-3l0k</guid>
      <description>&lt;h2&gt;
  
  
  The blur we never thought to ask about
&lt;/h2&gt;

&lt;p&gt;You have almost certainly watched a video that felt wrong before you could explain why. Maybe it was dashcam footage shared on social media — the traffic moving just a beat too briskly, the pedestrians crossing the street with a faint mechanical urgency, as though everyone had somewhere slightly too important to be. Or maybe it was the reverse: a sports clip slowed down to a crawl, the ball hanging in the air like something painted on silk, the crowd frozen mid-roar. Your brain registered something about time before your conscious mind caught up.&lt;/p&gt;

&lt;p&gt;That gut feeling — &lt;em&gt;this is moving at the wrong speed&lt;/em&gt; — is something humans do effortlessly and machines have, until very recently, struggled to do at all. A new paper from researchers at the University of Washington and Google changes that. They have taught a computer system not just to understand what is happening in a video, but to understand &lt;em&gt;when&lt;/em&gt; — to read the flow of time embedded in moving images the way a musician reads tempo from sheet music.&lt;/p&gt;

&lt;p&gt;The consequences turn out to be surprisingly far-reaching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why computers went blind to speed
&lt;/h2&gt;

&lt;p&gt;Modern computer vision is remarkably capable. Given a video, existing systems can tell you that a dog is chasing a ball, that the man in the blue jacket is the same man who appeared three seconds earlier, that the faces in this clip belong to certain people. What these systems cannot reliably do is answer a simpler-sounding question: is this video playing at normal speed?&lt;/p&gt;

&lt;p&gt;The reason is subtler than it first appears. Think about what a video actually is: a sequence of still photographs shown so rapidly that the eye perceives motion. At 24 frames per second — the standard for film — you're seeing 24 photographs every second. At 240 frames per second — the speed of a high-end action camera — you're capturing ten times more moments. When that 240-frames-per-second footage is played back at 24 frames per second, you get the floating, dreamlike quality of slow motion. Every heartbeat of action is stretched into ten beats of screen time.&lt;/p&gt;

&lt;p&gt;Now, a machine looking at individual frames faces a chicken-and-egg problem: it sees a ball mid-flight, but how does it know whether that frame came from a 24fps normal-speed video or a 240fps slow-motion clip played back at one-tenth speed? The objects look identical. The scene looks identical. The motion, considered frame-by-frame, looks identical.&lt;/p&gt;

&lt;p&gt;This is why most computer vision research simply ignored the question. Speed was treated as a metadata problem — something you look up in the file's technical specifications, not something you read from the pixels themselves. But that assumption collapses the moment you're working with in-the-wild internet video, where metadata is unreliable, absent, or deliberately manipulated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Motion blur is time's fingerprint
&lt;/h2&gt;

&lt;p&gt;The breakthrough insight in this paper is that time actually does leave fingerprints on pixels — you just have to know where to look.&lt;/p&gt;

&lt;p&gt;Consider what happens to a photograph of a speeding motorcycle. If the shutter stays open even a fraction too long, the motorcycle doesn't appear as a crisp object. It smears. You see a streak, a ghost, a blur that traces the path of motion across the frame. This motion blur is not a flaw in the photograph. It is information. It is the camera's way of recording that something moved very fast during the brief window the shutter was open.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" alt="Motion blur on a fast-moving motorcycle" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The same logic applies to video. When a bicycle races down a mountain trail in real time, the background trees streak into horizontal smudges behind it. When that same footage is captured at high speed and played back slowly, each individual frame is sharper — there is less blur per frame, because the camera captured each moment during a much shorter window.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" alt="Cyclist on a mountain trail with motion blur in the background" width="512" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The researchers trained their model to read these cues the way a forensic analyst reads tire marks on asphalt — not just noticing that blur exists, but using its character, direction, and intensity to reconstruct what kind of motion produced it, and at what temporal scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" alt="Mountain bike racer with strong motion blur showing speed" width="512" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A panning camera following a bird in flight, for instance, produces a very particular blur signature — the bird is sharp while the background dissolves into horizontal streaks, because the camera tracked the subject and let the world smear behind it. This kind of image is visually unmistakable as &lt;em&gt;fast&lt;/em&gt;, even if nothing in the semantic content — bird, sky, trees — carries that information directly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" alt="Bird in flight photographed with panning motion blur" width="512" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The audio trick that changed everything
&lt;/h2&gt;

&lt;p&gt;Visual blur is one fingerprint of speed. But the paper's most elegant trick exploits a second one: sound.&lt;/p&gt;

&lt;p&gt;Here is something most people don't consciously think about: when you speed up a video, the audio pitch rises. Play a recording of a conversation at twice normal speed and everyone sounds like a cartoon character — voices become thin, reedy, almost helium-inflected. Slow it down to half speed and the same voices become impossibly low and thick, like a record player running out of battery.&lt;/p&gt;

&lt;p&gt;This happens for the same reason that a police siren sounds higher as it approaches you and lower as it recedes: the pitch of a sound is determined by the frequency of the sound waves reaching your ears, and that frequency changes when the source is moving (or, in this case, when time itself is compressed or expanded in playback).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" alt="Audio spectrogram showing frequency changes with playback speed" width="800" height="320"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The researchers visualized this as a spectrogram — a map of which sound frequencies appear at which moments. In the image above, you can see the effect directly: the left side of the image, representing slower playback, shows sound energy clustered in lower frequencies, with the high-frequency regions dark and empty. On the right, where playback speed increases, the higher frequencies suddenly light up, the entire spectrum shifting upward like a musical key change written in light.&lt;/p&gt;

&lt;p&gt;This creates a profound opportunity. It means that the &lt;em&gt;same video&lt;/em&gt; carries two independent, corroborating signals about its own speed: the visual blur in the frames and the pitch signature in the audio. The model can compare these signals against each other, using each one to check and sharpen its reading of the other.&lt;/p&gt;

&lt;p&gt;This is what researchers call cross-modal supervision — using two different sensory channels as mutual teachers. Think of how a wine sommelier uses both smell and taste together to identify a vintage. Neither sense alone might be definitive, but the agreement between them, or the revealing discord, tells a richer story than either could alone. The model learns the relationship between visual speed cues and audio pitch cues by watching enormous amounts of ordinary video — without anyone labeling a single frame or telling the system what "slow motion" looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Teaching a machine without a teacher
&lt;/h2&gt;

&lt;p&gt;This brings us to perhaps the most important methodological decision in the paper: everything described so far is learned without labels.&lt;/p&gt;

&lt;p&gt;In most machine learning, you need a human to annotate training data. Someone has to watch thousands of videos and write down: "this one is played at half speed," "this one is normal," "this one is sped up two times." This labeling process is expensive, slow, and bottlenecked by human attention. More fundamentally, it requires the person labeling to already know the answer — which is exactly what you're trying to teach the machine.&lt;/p&gt;

&lt;p&gt;The researchers sidestepped this entirely through a technique called self-supervised learning. Imagine teaching someone to recognize a forged signature without ever showing them examples of forgeries. Instead, you hand them a stack of authentic signatures and let them look for internal inconsistencies — places where the pen pressure, the angle, the rhythm of a stroke breaks with what the same hand produced moments earlier. They learn by noticing when something doesn't cohere, without anyone ever telling them what to look for.&lt;/p&gt;

&lt;p&gt;The model in this paper learns similarly. Researchers took ordinary internet videos and artificially sped some up, slowed others down, or mixed sections of different speeds. They then asked the model to detect these changes — not by consulting a label, but by noticing when the visual flow and audio pitch no longer fit together, or when the blur patterns across consecutive frames don't match the implied rhythm of motion. The "teacher" is the internal consistency of the video itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the world's largest slow-motion library
&lt;/h2&gt;

&lt;p&gt;Once you have a system that can reliably tell whether a video contains slow motion, you can use that system as a filter — a tireless, infinitely patient curator.&lt;/p&gt;

&lt;p&gt;The internet contains an enormous amount of slow-motion footage mixed in with billions of ordinary videos. Tracking its location is the problem: there's no reliable, consistent way to find it from metadata alone. People tag and title videos erratically. One creator calls the same footage "slo-mo," another calls it "60fps," another calls it nothing at all.&lt;/p&gt;

&lt;p&gt;The researchers turned their trained model loose on this haystack. By processing large collections of video and flagging clips where the model detected slow-motion signatures — the characteristic blur, the pitch-shifted audio, the visual density of temporal detail — they assembled the largest slow-motion dataset ever collected from naturally occurring sources.&lt;/p&gt;

&lt;p&gt;This matters because slow-motion footage is genuinely different from ordinary video in a way that matters for AI training. Think of ordinary video as a novel that describes a battle in broad strokes — armies clash, a hero falls, the tide turns. Slow-motion footage is like a frame-by-frame graphic novel of the same battle, where every sword stroke and expression is captured in full detail. For a machine learning to understand motion, physics, and causality, that detail is not decorative. It is the text.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the machine learns to control time
&lt;/h2&gt;

&lt;p&gt;The paper's most forward-looking section describes two things the researchers built using all this acquired understanding: a system that generates video at a specified speed, and a system that converts low-quality, blurry, low-frame-rate video into high-quality slow motion.&lt;/p&gt;

&lt;p&gt;The first — speed-conditioned video generation — is something like teaching an illustrator to draw differently depending on a mood instruction. Ask them to draw a waterfall as "frozen," and they'll use sharp lines, crystalline forms, stillness implied in every edge. Ask them to draw the same waterfall as "rushing," and the same elements become streaks, arcs, foam caught in mid-scatter. The instruction shapes every aesthetic decision, not just the subject matter. Here, instead of artistic mood, the instruction is temporal: generate this scene as though captured at half normal speed, or double normal speed. The model learns to make every visual choice — how sharp to render edges, how much to blur movement, how to distribute motion across frames — consistent with the specified temporal flow.&lt;/p&gt;

&lt;p&gt;The second — temporal super-resolution — is arguably the more practically remarkable achievement. Given a video that is blurry, low-frame-rate, and temporally thin (imagine footage from a security camera, or a clip compressed heavily for file size), the system reconstructs what the in-between moments probably looked like. This is not guessing randomly. It is inference constrained by everything the model has learned about how motion works, how blur distributes across a scene, and how things in the physical world actually move between recorded frames.&lt;/p&gt;

&lt;p&gt;Think of how a skilled art restorer approaches a damaged oil painting. Faced with sections where the paint has flaked away entirely, they don't fill in the gaps with random colors. They study the surrounding strokes, the artist's technique as visible in intact sections, the logic of the depicted scene — and from all of this, they reconstruct what almost certainly was there. The result is not certainty, but it is informed reconstruction, and for many purposes it is better than leaving the gap blank.&lt;/p&gt;

&lt;h2&gt;
  
  
  What becomes possible now
&lt;/h2&gt;

&lt;p&gt;These capabilities, combined, begin to shift what is possible in several concrete domains.&lt;/p&gt;

&lt;p&gt;Consider a surgeon training on video of a delicate procedure. Currently, the training footage may have been captured on standard medical cameras at rates that simply don't capture the full motion of the most critical moments — the tension and release of a suture, the exact angle of an incision. With temporal super-resolution, the same footage could be enriched with recovered in-between frames, giving trainees and instructors a more complete picture of technique.&lt;/p&gt;

&lt;p&gt;Or consider a forensic analyst asked whether a viral video of an incident has been manipulated — specifically, whether someone sped up footage to make a crowd look more menacing, or slowed it down to make an action look more deliberate than it was. These techniques give investigators a systematic way to test that question, looking for the inconsistencies between visual and audio speed signatures that arise when footage has been post-processed — the equivalent of finding anachronistic fiber in a supposedly antique cloth.&lt;/p&gt;

&lt;p&gt;For the film industry, speed-conditioned generation opens the possibility of creating cinematic slow motion in post-production, without the cost of high-speed cameras. What currently requires tens of thousands of dollars in equipment could, if these techniques mature, be applied as a computational process to footage captured with ordinary cameras.&lt;/p&gt;

&lt;p&gt;And at a deeper level, there is something philosophically significant about what this paper is pointing toward: the idea that time itself is a visual dimension that can be learned, not just assumed. Most AI systems that watch video treat it as a sequence of images. This paper treats it as a recording of temporal flow — and argues that how things unfold across time is as learnable, and as teachable, as what objects look like or where they are.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the paper doesn't answer
&lt;/h2&gt;

&lt;p&gt;There are honest gaps here worth noting. The audio-based speed detection, elegant as it is, is useless on silent video — a substantial fraction of internet content. The visual signals alone carry less certainty in certain kinds of footage: scenes with little motion, static shots, or carefully stabilized camera work where blur signatures are deliberately suppressed by stabilization software.&lt;/p&gt;

&lt;p&gt;More fundamentally, the temporal super-resolution system, like all such reconstruction methods, is making educated inferences about what it didn't see. In most applications, this is fine. But in forensic or legal contexts, a system that fills in moments it never observed is a system that can produce compelling artifacts — convincing reconstructions of things that may not have happened quite that way. The capability and the caution need to develop together.&lt;/p&gt;

&lt;p&gt;And the paper is still largely a proof-of-concept for some of the generation results. The generated videos, while compelling, show the artifacts and limitations familiar to anyone who has watched AI-generated video for more than a few seconds. The principle is demonstrated; the product-quality execution is still ahead.&lt;/p&gt;

&lt;p&gt;But the direction is clear, and the foundation is sound. Time has always moved through video. Now, finally, the machines are starting to notice.&lt;/p&gt;

&lt;p&gt;📄 &lt;a href="https://arxiv.org/abs/2604.21931v1" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2604.21931v1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tags: computervision, videogeneration, selfsupervisedlearning, temporalai&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🇰🇷 Korean version on Velog: &lt;a href="https://velog.io/@tkdnel1002/4jkzs29p" rel="noopener noreferrer"&gt;https://velog.io/@tkdnel1002/4jkzs29p&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>computervision</category>
      <category>videogeneration</category>
      <category>selfsupervisedlearni</category>
      <category>temporalai</category>
    </item>
  </channel>
</rss>
