Forem: Papers Mache

Diffusion models approach AR quality and improve inference speed

Papers Mache — Sun, 10 May 2026 05:00:00 +0000

Diffusion language models have long promised parallel generation, yet their serving speed has lagged behind autoregressive decoders. Recent work shows that diffusion can now deliver three‑fold throughput gains over prior diffusion models, and LangFlow reports perplexities of 30.0 on LM1B and 24.6 on OpenWebText. The gap between parallelism and practical efficiency is finally narrowing.

Earlier diffusion language models suffered from two intertwined problems. First, the lack of introspective consistency—unlike AR models that always condition on their own past tokens—produced a quality deficit noticeable on standard benchmarks. Second, inference pipelines were built on naïve sampling loops, so even when quality improved, latency remained higher than causal decoders. Autoregressive systems, by contrast, benefitted from decades of system‑level tuning such as causal masking and logit shifting, which implicitly enforce token‑level consistency.

Introspective Diffusion Language Models (I‑DLM) close the consistency gap with a novel “introspective strided decoding” algorithm that verifies previously generated tokens while advancing new ones in the same forward pass. The authors report that “Beyond quality, I‑DLM is designed for the growing demand of large‑concurrency serving, delivering about 3× higher throughput than prior state‑of‑the‑art DLMs.” [1] They also achieve “69.6 on AIME‑24 and 45.7 on LiveCodeBench‑v6, exceeding LLaDA‑2.1‑mini (16B) by more than 26 and 15 points, respectively.” [1] Crucially, I‑DLM is claimed to be “the first DLM to match the quality of its same‑scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks.” [1]

LangFlow tackles the continuous‑time side of the problem. By linking embedding‑space diffusion to flow matching via a Bregman divergence and introducing an ODE‑based negative‑log‑likelihood bound, the model reaches “a PPL of 30.0 on LM1B and 24.6 on OpenWebText,” rivaling top discrete diffusion systems. [2] Moreover, “It even exceeds autoregressive baselines in zero‑shot transfer on 4 out of 7 benchmarks.” [2] These numbers place continuous diffusion on equal footing with the best AR language models, at least on the evaluated corpora.

The papers acknowledge several open questions. I‑DLM’s throughput claims stem from a single‑H100 benchmark and a stationary‑batch scheduler; scaling to multi‑node or heterogeneous clusters remains untested. The quality comparison covers 15 curated benchmarks, but the behavior on truly massive, multilingual corpora is unknown. LangFlow’s ODE likelihood bound hinges on a learnable Gumbel‑based noise schedule, which may be sensitive to hyper‑parameter choices not explored in the released experiments. Its zero‑shot advantage appears on a modest set of seven tasks, leaving the generality of the improvement uncertain.

For teams that need to serve thousands of concurrent requests, evaluating a diffusion backend is now a concrete option rather than a speculative future. You can benchmark I‑DLM’s stationary‑batch scheduler against your existing causal decoder on the same hardware to see whether the reported 3× throughput translates to cost savings. Likewise, swapping an AR checkpoint for a LangFlow checkpoint and measuring perplexity on your domain data will reveal if the continuous‑time approach holds up outside LM1B and OpenWebText. If the results align, diffusion models could become the default choice for high‑throughput, low‑latency LLM serving.

References

Flux Attention halves inference cost on long contexts

Papers Mache — Sun, 10 May 2026 05:00:00 +0000

Dynamic sparse routing now delivers two‑ to three‑fold speedups on long‑context inference while leaving reasoning quality virtually untouched. The trick is that each transformer layer decides on the fly whether to attend densely or sparsely, reducing the blanket‑over‑all quadratic cost associated with standard attention in large language models. The result is a practical, drop‑in acceleration that works on the chat‑style workloads that dominate production today.

Standard self‑attention scales as O(n²) with the token count, so extending context windows from 4 k to 32 k tokens quickly becomes prohibitive. Hybrid schemes that mix full attention (FA) and sparse attention (SA) have been proposed, but they usually fix the FA/SA ratio globally or at the head level, forcing a one‑size‑fits‑all allocation that either wastes compute or starves the model of needed context. Moreover, head‑level sparsity often creates load‑imbalance spikes that hurt autoregressive decoding on modern accelerators.

Flux Attention sidesteps these constraints by introducing a lightweight Layer Router that statically plugs into a frozen pretrained model and, during inference, routes each layer to either FA or SA based on the current input. Because the decision happens at layer granularity, the memory access pattern stays contiguous, turning theoretical FLOP reductions into measurable wall‑clock gains. The authors report speed improvements of up to 2.8× during the prefill phase and 2.0× while decoding, all while preserving performance on long‑context and mathematical reasoning benchmarks. Training the router is exceptionally cheap: “Our parameter‑efficient training converges in just 12 hours on an 8‑GPU A800 node.” [1] The routing overhead itself is negligible, “our router incurs a negligible overhead, averaging only 0.20 ms per layer.” [1]

The paper’s evaluation focuses on long‑context scenarios and math‑heavy tasks, leaving open how the method behaves on short‑prompt or multilingual benchmarks. The approach also assumes access to the original frozen checkpoint; models that have already been fine‑tuned or heavily customized might need additional adaptation steps. Finally, the reported speedups stem from A800 GPU measurements; different hardware architectures could exhibit a different balance between the cost of the router and the gains from sparsity.

For teams that already serve chat‑style LLMs with extended windows, the take‑away is immediate: a layer‑wise router can be trained in a single half‑day and, as demonstrated by the authors, has been integrated into released checkpoints on Hugging Face and ModelScope. Before rolling it out, benchmark both prefill and decode latency on your target context lengths to confirm the 2–3× gains materialize in your stack. If the router’s 0.20 ms per‑layer penalty is acceptable, the resulting throughput boost can shave seconds off each interaction, turning long‑context reasoning from a niche capability into a production‑ready feature.

References

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Distillation that keeps confidence honest

Papers Mache — Sun, 10 May 2026 05:00:00 +0000

On‑policy distillation has become the go‑to recipe for squeezing a large language model’s capabilities into a smaller student after training. The process, however, inherits a hidden bias: the student learns to mimic a teacher that has access to privileged context, and it consequently reports confidence scores that are far too optimistic. Recent work shows that this optimism can be tamed without giving up the accuracy gains that distillation promises.

Traditional OPD treats the teacher’s probability token as both a signal of what to say and how sure to be. Because the teacher’s confidence is conditioned on information unavailable at deployment, the student ends up with a systematic “certainty illusion.” The paper formalizes this mismatch as a scaling law of miscalibration, arguing that privileged context collapses entropy and drives optimism — the same mechanism that makes the student’s logits sharper than they should be.

CaOPD rewrites that recipe. First it runs the student on its own roll‑outs, measures the empirical confidence, and then replaces the teacher’s implicit confidence token with this student‑grounded estimate while keeping the teacher’s trajectory for capability cloning. As the authors put it, “We preserve the teacher’s high‑quality trajectory for capability cloning, but overwrite the implicit confidence token with the student’s actual confidence.” [1] The resulting model “achieves Pareto‑optimal calibration while maintaining competitive capability, generalizing robustly under out‑of‑distribution and continual learning.” [1] In concrete terms, the approach “collapses the massive OCG from +32.0% (SDFT) down to an exceptionally aligned -0.7%.” [1] Thus the trade‑off curve shifts left: raw accuracy stays on par with standard distillation, while expected calibration error drops dramatically.

The study evaluates CaOPD on benchmark suites commonly used for assessing calibration, where confidence can be estimated from a single forward pass per token. Computing student roll‑outs for every training example adds overhead that may be prohibitive for very large corpora. While the reported experiments primarily involve classification‑style tasks, it remains an open question how the method scales to multi‑turn dialogue or open‑ended generation where confidence is less well defined. The reported robustness is demonstrated on the out‑of‑distribution splits used in the paper, but it is unclear how the method would perform under broader domain shifts.

For pipelines that gate downstream actions by model confidence—retrieval re‑ranking, recommendation thresholds, or safety filters—trustworthy probabilities are as valuable as raw scores. Swapping a vanilla OPD checkpoint for a CaOPD one can immediately shrink the overconfidence gap, reducing false‑positive alarms without sacrificing the hit‑rate of the underlying model. Before committing to a new student, benchmark both accuracy and calibration on a slice of your real query distribution; if the calibrated error drops while the top‑k precision holds, CaOPD offers a low‑risk upgrade path. In environments where every mis‑calibrated score can trigger costly remediation, treating confidence as a first‑class objective may soon become the default engineering habit.

References

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Adaptive reasoning reduces token usage up to 90% with minimal accuracy loss

Papers Mache — Sat, 09 May 2026 05:00:00 +0000

Adaptive reasoning formats that let a model decide on the fly which reasoning steps are truly needed can slash the number of tokens processed by as much as ninety percent, yet leave the quality of the answer essentially untouched. The trick is to replace a monolithic chain of computation with a handful of lightweight alternatives that are chosen dynamically. When the extra logic for picking the right path adds only a few hundred milliseconds, the trade‑off becomes hard to refuse.

Parallel reasoning has become the de‑facto way to boost Large Reasoning Models, but the cost of evaluating every possible path quickly dwarfs any gains in accuracy. Visual‑language systems suffer a similar symptom: they often “overthink,” generating long chains of internal dialogue even when a simple perception step would suffice. Prior work has mostly treated pruning as a post‑hoc filter or relied on static heuristics, leaving a gap for methods that can learn to drop unnecessary computation as part of the model’s forward pass.

STOP introduces a differentiable token‑pruning head that learns, from the model’s own key‑value cache, which reasoning tokens can be discarded before they are even materialized. “For instance, on the AIME 24 benchmark (1.5B), STOP increases average accuracy from 30.10 % to 37.92 %—significantly exceeding Type II (32.50 %) and Type III (32.92 %)—while simultaneously reducing total token consumption by over 73 %.” [1] The overhead of this head is almost invisible: “STOP (Type IV) minimizes overhead to a negligible 0.20 s (0.59 %).” [1] AVR tackles the same problem from the format side, giving a model three explicit response styles – full reasoning, perception‑only, and direct answer – and training it with a policy‑gradient objective to pick the cheapest viable format. “Experiments on multiple vision‑language benchmarks show that AVR reduces token usage by 50–90 % while maintaining overall accuracy, especially in perception‑intensive tasks.” [2] Across seven benchmarks the method “achieves 50–90 % token reduction … while matching or improving accuracy … and generalizes across different model scales and families.” [2] In the most perception‑heavy settings the paper reports “over 80 % token reduction and a 2–4 % accuracy gain.” [2] Together, the two techniques demonstrate that a model can stay on‑track while shedding the bulk of its internal chatter.

Both works leave open questions about how far the savings extend beyond curated benchmarks. STOP’s token‑pruning classifier is trained on internal KV‑cache statistics, which may behave differently on non‑vision tasks or on models that do not expose a comparable cache. AVR’s reinforcement‑learning step adds a training complexity that can be fragile when data are scarce or when the reward signal does not align cleanly with downstream latency budgets. Moreover, the reported token reductions assume the same input distribution as the test suites; a shift toward longer, more compositional queries could re‑activate the pruned paths and diminish the gains. Finally, the latency benefit of STOP is measured on a single‑GPU setup; on heterogeneous edge hardware the relative cost of the pruning head versus the main model could change.

For teams shipping multimodal inference to edge devices, the practical takeaway is to prototype a lightweight pruning head before committing to a full model redesign. Because STOP’s classifier only inspects cached activations, it can be dropped into any transformer‑based LRM with a few lines of code and a checkpoint that already reduces token usage by more than seventy percent. When the application is visual‑language heavy, wrapping the model in AVR’s three‑format wrapper lets you benchmark the token distribution of real user queries and automatically steer the system toward perception‑only or direct‑answer paths whenever they suffice. In short, run an ablation on your own workload: measure token counts per query, enable STOP or AVR, and compare end‑to‑end latency. If the latency budget is met without a statistically significant dip in accuracy, the deployment is ready to scale to billions of in‑the‑wild interactions.

References

Hierarchical skill KB improves performance of weaker models

Papers Mache — Sat, 09 May 2026 05:00:00 +0000

The dominant paradigm for teaching autonomous language‑model agents is to let each instance wander through its own training episodes, rediscovering the same sub‑tasks over and over. That redundancy inflates exploration budgets and leaves even modest models struggling on long‑horizon problems. A fully automated pipeline that extracts reusable, hierarchical behaviors from a collective pool of trajectories flips the script.

Historically, agents have relied on flat replay buffers or hand‑crafted macro‑actions; neither approach captures the layered structure of real‑world plans. Without an explicit representation that separates strategy, function, and atomic operation, weaker backbones cannot efficiently retrieve the right piece of experience when a new request arrives. This limitation has kept them a step behind larger, compute‑heavy models.

SkillX addresses the gap by distilling raw execution traces into a three‑tiered knowledge base—strategic plans, functional skills, and atomic skills—then iteratively refining each entry based on execution feedback and expanding coverage through exploratory generation. When this SkillKB is plugged into a baseline model such as Qwen3‑32B, “SkillX improves the base model’s performance. In particular, Qwen3-32B gains roughly around 10 points across multiple benchmarks” [1]. The same study notes that the library “cuts redundant steps and context length,” confirming that hierarchical skill retrieval streamlines inference. Moreover, “Multi-Level Skills Design Outperform Other Forms of Experience Representation,” underscoring that the structured hierarchy itself is the driving factor behind the gains [1].

The evaluation is limited to a handful of long‑horizon, user‑interactive suites (AppWorld, BFCL‑v3, τ²‑Bench) and assumes a strong backbone (GLM‑4.6) to bootstrap the initial skill extraction. It remains unclear how the approach scales to domains with sparse demonstrations or to agents that already incorporate external memory modules. One open question is whether the same performance lift would appear when the skill library is built from heterogeneous logs rather than a single, high‑capacity teacher.

If the hierarchy can be reproduced for your own workloads, a smaller model can inherit a sizable portion of the expertise typically locked behind larger parameters. Adding a lightweight retrieval layer that queries the SkillKB at inference time may shrink token budgets enough to run on edge hardware, while still delivering success rates that rival bigger counterparts. Before committing to a full model upgrade, consider constructing a pilot skill library from existing logs and measuring both task accuracy and context usage on a representative subset of queries.

References

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Fast edit loops improve AI document workflow

Papers Mache — Sat, 09 May 2026 05:00:00 +0000

The moment you hit “regenerate” and watch a 30‑second spinner eat your momentum, the allure of AI‑generated lecture notes evaporates. When the latency drops to a barely‑noticeable blink, the same tool becomes a collaborator instead of a bottleneck.

Until now, pushing a generative model through a full HTML or LaTeX pipeline meant waiting minutes for the next preview. Classic zero‑shot HTML generators churn out static pages, while LaTeX OCR pipelines spit out raw code that often fails to compile. The result is a broken feedback loop that forces authors back to manual edits.

MAIC‑UI tackles the latency head‑on with a “generate‑verify‑optimize” loop that separates content alignment from visual polishing. By slicing edits into unified diffs and only re‑generating the changed fragment, the system delivers “Click‑to‑Locate editing with Unified Diff‑based incremental generation achieving sub‑10‑second iteration cycles” [1]. That alone shaves minutes off the “full regeneration for modifications requires 200–600 seconds, disrupting creative flow” problem that plagued earlier tools [1]. In a controlled lab study, participants needed 4.9 editing rounds instead of 7.0, and a three‑month deployment with high‑school students produced “the pilot class achieved 9.21‑point gains in STEM subjects compared to -2.32 points in control classes” [1].

TexOCR flips the OCR script by training a 2 B‑parameter model with reinforcement learning that rewards verifiable LaTeX unit tests. The benchmark suite evaluates not only transcription fidelity but also structural faithfulness and end‑to‑end compilability. Across 21 frontier models, existing systems stumble on section continuity, float placement, and reference integrity, while TexOCR’s RL‑augmented training delivers consistent gains on those very metrics.

RaV‑IDP closes the loop with a reconstruction‑as‑validation stage. After each entity extraction, the pipeline rebuilds the region and scores its fidelity against the original crop. The resulting “fidelity scores achieve Spearman ρ = 0.800 with ground‑truth table quality (p = 2.0×10⁻¹¹²) and ρ = 0.877 on native PDFs” [2], providing a statistically robust signal that a piece of output truly mirrors its source. When the score dips, a “GPT‑4.1 vision fallback” is triggered, recovering “38.1% of failed table extractions via the GPT‑4.1 fallback path” [2]. The authors also show that the gate‑only variant collapses to 0.1408 ANLS, confirming that the fallback is essential rather than optional.

Together, these three systems demonstrate a concrete, fast edit loop: generate a fragment, verify its structural and compilation integrity, and, if needed, optimise it with a targeted fallback. The pipeline stays interactive because each stage works on incremental diffs rather than re‑processing the whole document, and verification is grounded in measurable fidelity rather than opaque confidence scores.

The papers leave several questions open. MAIC‑UI’s incremental diff engine is tied to HTML‑based interactive courseware; extending it to pure LaTeX authoring would require a different diff representation. TexOCR’s 2 B model, while impressive, still demands substantial GPU resources, which may limit on‑device deployment. RaV‑IDP’s reliance on a proprietary GPT‑4.1 vision model introduces latency and cost considerations that could outweigh the benefits in high‑throughput pipelines. Moreover, all three evaluations focus on STEM material; it remains to be seen whether the same approach scales to humanities or multilingual corpora.

If you are building an AI‑augmented authoring platform, the takeaway is pragmatic: replace monolithic regeneration with diff‑driven incremental generation, attach a compilation‑aware OCR model that learns from unit‑test rewards, and wrap every extraction in a reconstruction‑based fidelity gate that can summon a stronger model only when needed. Benchmark each stage on your real query distribution before committing to a full migration, and measure both compile success rates and the number of human edit cycles saved. A fast, verifiable edit loop could turn AI‑written technical drafts from a risky experiment into a reliable coworker.

References

Physics‑based adaptation slashes edge LLM energy

Papers Mache — Fri, 08 May 2026 05:00:00 +0000

The conventional view holds that edge‑LLM runtimes are limited by static, rule‑of‑thumb scaling of compute and memory, leaving most of the device’s power budget unused. QEIL v2 overturns that assumption by grounding its resource allocator in a physics‑derived energy model and steering the search with simulated‑annealing, delivering a dramatic cut in inference energy.

Earlier work, such as QEIL v1, relied on fixed efficiency factors and greedy heuristics, which yielded modest speedups but still depended on hand‑tuned knobs that ignored the chip’s actual power‑flow dynamics. The new system replaces every static heuristic with runtime‑adaptable metrics that trace back to semiconductor physics—compute utilization from roofline analysis, memory pressure from allocation theory, and thermal yield from CMOS leakage—while a Pareto‑guided simulated‑annealing engine explores the joint space of energy, latency, and device utilisation [1].

The results are striking. QEIL v2 delivers “75.7% pass@k at 63.8W (IPW 0.9749), a 2.86 × improvement over standard inference” [1] and, more dramatically, “Total energy drops 75.6% vs. standard with 38.3% latency reduction, zero thermal throttling, and 100% fault recovery across all benchmarks and model families” [1]. In practice this means that, for the evaluated 4‑bit Llama‑3.1‑8B model, the system can substantially extend runtime on a handheld device while staying within thermal envelopes and preserving inference quality.

The paper notes that the gains stem from workload‑adaptive device allocation on models with reduced memory‑bandwidth requirements, which hints at two open questions. First, the evaluation focuses on models up to 8 B parameters; it remains unclear how the physics‑based routing scales to larger transformers that stress both compute and bandwidth. Second, the metrics assume accurate roofline and leakage models for the target silicon; devices without such profiling infrastructure may not reap the full benefit. Extending the approach to heterogeneous clusters or to GPUs with dynamic voltage scaling would also test the robustness of the energy equation.

For engineers building on‑device AI, the takeaway is concrete: replace static scaling rules with runtime measurements of compute utilisation, memory pressure, and thermal yield, then feed those signals into a multi‑objective optimizer such as simulated annealing. Before committing to a new quantisation scheme, benchmark the edge system with QEIL v2’s Pareto‑guided search and verify that energy drops and latency improvements hold on the actual workload distribution. A modest investment in physics‑aware profiling could translate into hours of extra battery life for every deployed LLM.

References

QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration

Micro LM delivers large‑model quality on device

Papers Mache — Fri, 08 May 2026 05:00:00 +0000

Edge assistants have been forced to choose between a responsive first word and a thoughtful complete answer. The round‑trip to a cloud model routinely adds several seconds, shattering the illusion of a conversational partner. A new study shows that a model an order of magnitude smaller can seed the answer locally, letting a cloud model finish without the user noticing the handoff.

Before this work, on‑device language models were limited: even the smallest 100 M‑parameter models were reported to exceed the power and compute constraints of many wearables, as noted in the study. Consequently, many systems rely on pure cloud inference despite its latency penalty, or on rule‑based generators that can produce stilted replies. Researchers have noted that even the smallest 100 M‑parameter models strain smartwatch CPUs, and cloud APIs dominate latency budgets, as reported in the paper [1].

The authors propose Micro Language Models (μLMs): ultra‑compact models (8 M–30 M parameters) that instantly generate the first 4‑8 words of a contextually grounded response on‑device, while a cloud model completes it; thus, masking the cloud latency [1]. In practice, the 28 M‑parameter “Swen” checkpoint runs on an Orange Pi and reaches a time‑to‑first‑token of 45 ms, a first‑token decode of 3 ms, and outputs four words in 55 ms, which is near‑instantaneous for all practical purposes [1]. Those four words are enough to anchor the continuation, and the downstream cloud model produces completions that match the quality of 70 M–256 M‑parameter systems evaluated on standard generation benchmarks.

The approach leaves several questions open. The local generator only produces a handful of tokens, so any mistake in that prefix forces the cloud model to either correct or repeat the error, and the paper relies on three handcrafted error‑correction strategies to smooth such failures. Evaluation is limited to a single embedded platform and a specific cloud continuator (GPT‑4o in the demo); it remains unclear how the handoff behaves on more heterogeneous hardware or with lower‑capability back‑ends. Moreover, the quality gap is measured against existing mid‑size models, but not against the very latest instruction‑tuned giants, so the trade‑off may shift as those models improve.

For developers of wearable or AR assistants, the practical takeaway is to prototype a hybrid pipeline rather than committing to an all‑cloud or all‑edge architecture. The released 28 M checkpoint can be dropped onto a modest SBC, benchmarked for first‑token latency on the target device, and then paired with any cloud LLM that accepts a prefix prompt. If you are building a smartwatch assistant, you can try swapping the local prefix generator with the 28 M Swen model and see whether the user‑perceived latency drops below 100 ms, while still delivering the nuanced responses that only a large model can provide.

References

Micro Language Models Enable Instant Responses

Tiny weight edits improve LLM safety

Papers Mache — Fri, 08 May 2026 05:00:00 +0000

Targeted tweaks to specific attention heads can slash jailbreak success rates by several‑fold (e.g., reducing from 42% to 8% in the reported experiments), yet a subset of attacks remains viable. The same principle applies when pruning an almost negligible fraction of parameters, erasing most harmful outputs while leaving overall competence intact.

Before these interventions, most safety pipelines leaned on broad‑scale alignment—RLHF, instruction fine‑tuning, or post‑hoc classifiers—without a precise view of which internal pathways enabled a model to refuse or comply. Even state‑of‑the‑art LLMs routinely fell to simple linguistic tricks, such as flipping tense, that bypassed their refusal mechanisms.

ASGuard first isolates the attention heads that drive the tense‑changing jailbreak, then learns a channel‑wise scaling vector that dampens their activations. The authors report that “Our ASGuard surgically patches the targeted vulnerability (attack success rate of tense jailbreaking reduced from 42% to 8%, GCG reduced 15% to 1%, and LogiBreak 30% to 13% in Llama) based on synergistic combination with activation scaling vector” [1]. When evaluated on four models, the method yields an overall attack success of just 8 % while preserving utility metrics in the mid‑60s to low‑70s—“ASGuard (Ours) | 8 | 96.4 | 66.8 | 68.2 | 71.8 | 52.9” [1].

A complementary line of work shows that harmful content generation hinges on an extremely compact weight motif. The study finds that “harmful content generation depends on a remarkably compact subset of model parameters—approximately 0.0005% of total parameters—which can be surgically removed while leaving general model capabilities largely intact” [2]. Moreover, “These reductions are achieved at remarkably low sparsity levels—approximately 0.0005% of total model parameters—indicating that the mechanism underlying harmful generation is extremely compressed” [2]. Pruning this tiny slice dramatically curtails emergent misalignment without noticeable degradation of benign performance.

Both papers acknowledge constraints. ASGuard is evaluated only on tense‑based jailbreaks and a limited set of LLM families; its scaling vectors are derived from circuit analysis that may not transfer to other architectures or prompt patterns. The pruning study reports that safety gains appear at very low sparsity, but it does not explore long‑term effects on downstream fine‑tuning or rare capabilities that might also reside in the excised weights. Together, the results suggest that while a minimal circuit motif can be edited to block many attacks, a fully robust guard likely requires layered defenses and continual verification.

For practitioners, the takeaway is practical rather than theoretical. Running a lightweight activation‑scaling wrapper around identified heads can be dropped into an existing serving stack as a cheap safety shim, avoiding costly full‑model retraining. When building new models, consider a pruning pass that removes the sub‑0.001 % of parameters most correlated with harmful token logits—validate the edit on your own query distribution before promotion. In environments where latency or compute budget is tight, these tiny edits offer a tractable path to hardening models against the bulk of jailbreak attempts.

References

Stateless scheduler doubles LLM training speed

Papers Mache — Thu, 07 May 2026 05:00:00 +0000

Fine‑tuning a 10 B‑parameter model on a single RTX 4090 feels like watching paint dry—most of the GPU sits idle while a handful of layers chew through memory, and the whole job stalls at a crawl. The bottleneck isn’t the raw FLOPs; it’s the rigid coupling between model weights and the slots you allocate on the device.

Pipeline parallelism was supposed to solve that, but conventional schedules bind each model stage to a fixed GPU. When a heavyweight head sits on one card, that card becomes the choke point and bubbles waste up to 30 % of the pipeline’s capacity [1]. The cache that powers autoregressive generation suffers a similar fate: each layer hoards its own key‑value memory, ballooning the footprint and throttling batch size.

RoundPipe breaks the binding entirely. “RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round‑robin manner, achieving a near‑zero‑bubble pipeline” [1]. In an eight‑RTX 4090 server it delivered 1.48–2.16 × the throughput of the strongest existing baselines when fine‑tuning models from 1.7 B to 32 B parameters [1]. The paper also shows LoRA‑based fine‑tuning of a 235 B‑parameter model with a 31 K token context on a single server, proving the scheduler scales far beyond the modest setups most hobbyists use.

The memory side of the equation gets a similar lift from stochastic KV routing. By training layers to attend either to their own cache or to a predecessor’s, the approach lets several depths share a single cache without losing information. “KV cache memory scales as : at 8K tokens, it drops from 1170 MB (baseline) to 293 MB …, a 4 × reduction. Decode throughput improves consistently, from 34.0 tok/s (baseline) to 41.6 tok/s … (+22 %) at 8K context, due to skipping / projections on non‑leader layers” [2]. The authors confirm that the technique “reduces memory consumption, enabling longer contexts and larger batch sizes” [2].

The results are impressive, yet they leave open questions. The reported speedups come from an eight‑GPU configuration; it remains unclear how much of the gain survives on a single‑card setup, which is what most independent developers run. The stochastic KV scheme is evaluated on standard benchmarks, but its impact on niche domains or on models that already employ aggressive quantisation has not been explored. Moreover, the round‑robin dispatch assumes roughly homogeneous devices—heterogeneous clusters might re‑introduce imbalance.

If you already struggle to squeeze a 7 B model into 24 GB of VRAM, trying RoundPipe’s open‑source library could let you push the same hardware closer to its theoretical ceiling before you need to shard further. Pairing it with depth‑wise KV sharing may free enough memory to double batch size or stretch context windows without sacrificing latency. The safest path is to profile your specific workload: measure token‑per‑second with and without the scheduler, then repeat after enabling stochastic KV routing. The numbers will tell whether the stateless pipeline delivers its advertised double‑speed on your own rig, or whether you need to add a second GPU to reap the full benefit.

References

AI agent logs expose reproducibility gaps

Papers Mache — Thu, 07 May 2026 05:00:00 +0000

Across dozens of repeated executions, the same autonomous agent can flip from success to failure by a noticeable margin. The swing is not uniform; it widens dramatically on web‑navigation , exposing a gap between headline scores and day‑to‑day reliability.

Historically, progress reports have leaned on single‑run leaderboards: a model that solves a benchmark once is declared “state‑of‑the‑art.” Few works have logged the entire interaction history of developers or systematically replayed the same task under identical conditions.

The SWE‑chat corpus of 6 000 real‑world coding sessions shows how fragile that assumption is. “Less than half (44.3%) of all agent‑produced code survives into user commits (Table 3)” [1]. Moreover, “Overall, users push back after 39% of turns, regardless of coding mode” [1], indicating frequent manual corrections and interruptions even when the agent is nominally competent.

A complementary study of computer‑use agents confirms the phenomenon on a different front. The authors observe that “yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task” [2]. By replaying each OSWorld task three times, they compute Pass^k, McNemar, and Wilcoxon scores that reveal statistically significant regressions for certain models (e.g., Qwen) while others improve (OpenCUA, UI‑TARS‑1.5). Crucially, “We find that clarification leads to consistent improvements across models, with more tasks transitioning from not reliably solved to reliably solved than the reverse (Figure 3)” [2], pointing to ambiguous specifications as a key instability source.

Both papers acknowledge constraints that temper the universality of their numbers. SWE‑chat captures only open‑source developers who opt into logging, and its “vibe coding” vs. “human‑only” split may not reflect enterprise workflows. The reliability study limits its variance assessment to three runs per task and to the OSWorld sandbox; stochasticity in larger, longer‑running deployments could manifest differently. Moreover, the reported gains from clarification assume a human‑in‑the‑loop that can disambiguate prompts on the fly.

For teams eyeing production‑grade agents, the takeaway is to treat stability as a first‑class metric, not an afterthought. Incorporate repeated‑run suites into CI pipelines, report Pass^k or Wilcoxon scores alongside accuracy, and automate clarification dialogs where task intent is vague. Benchmarks that reward a single peak score risk overlooking the very variance that will surface once the agent is handed a real user’s keyboard. Monitoring these signals early can prevent costly rollbacks when an “improved” model suddenly drops from reliable to flaky under unchanged conditions.

References

VideoLLM runs live video QA at 2 FPS

Papers Mache — Thu, 07 May 2026 05:00:00 +0000

Most video‑large language models still operate on pre‑recorded clips, pausing after each inference. The emerging expectation that a model can watch a live feed and answer questions instantly has remained out of reach—until a system demonstrated continuous processing on a streaming pipeline.

Earlier streaming attempts treated the visual front‑end and the language back‑end as separate stages, often limiting interaction to caption‑style narration or relying on explicit triggers before a response. Those designs struggled with open‑ended question answering and with maintaining context over long horizons.

AURA unifies a video encoder with an LLM and adds a sliding‑window history that reuses prefix key‑value caches, yielding bounded latency. In practice the framework “supports a real‑time demo system with ASR and TTS running at 2 FPS on two 80G accelerators” [1]. The authors also note the model “which runs at 2 FPS on two 80G accelerators” [1] and that it can “stream video continuously for 5 minutes at 2 FPS” [1]. This shows not only that the throughput is achievable, but that it sustains over extended periods, making open‑ended QA on live video feasible.

The paper evaluates AURA on several streaming benchmarks using a hardware setup of two 80 GB GPUs; while the reported 2 FPS throughput may be insufficient for high‑frame‑rate domains such as fast‑moving sports or autonomous driving. Moreover, the reliance on two 80 GB GPUs makes the approach costly for many deployments, and the sliding‑window cache strategy could encounter memory pressure as the interaction length grows. One open question is how the system behaves when the visual encoder processes higher‑resolution streams or when multiple camera feeds are merged.

For practitioners eyeing real‑time multimodal assistants, the result suggests a concrete baseline: benchmark dense video‑LLM pipelines against AURA’s 2 FPS latency on comparable hardware before committing to more exotic architectures. If you need sub‑second responses on a live feed, allocate at least two high‑memory GPUs and adopt the cache‑reuse pattern to keep latency predictable. Monitoring the trade‑off between frame rate and context length will be essential as you move from demo to production.

References

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Forem: Papers Mache

Diffusion models approach AR quality and improve inference speed

References

Flux Attention halves inference cost on long contexts

References

Distillation that keeps confidence honest

References

Adaptive reasoning reduces token usage up to 90% with minimal accuracy loss

References

Hierarchical skill KB improves performance of weaker models

References

Fast edit loops improve AI document workflow

References

Physics‑based adaptation slashes edge LLM energy

References

Micro LM delivers large‑model quality on device

References

Tiny weight edits improve LLM safety

References

Stateless scheduler doubles LLM training speed

References

AI agent logs expose reproducibility gaps

References

VideoLLM runs live video QA at 2 FPS

References

VideoLLM runs live video QA at 2 FPS