Forem: Simon Paxton

Firefox Zero-Day: Mozilla Says Claude Mythos Found 271 Bugs

Simon Paxton — Sun, 10 May 2026 04:38:33 +0000

Mozilla said this week that its Firefox zero-day hardening work with an early version of Claude Mythos Preview helped identify and fix 271 vulnerabilities shipped in Firefox 150. In Mozilla’s account, the model-assisted effort followed earlier scans with Opus 4.6 that had already led to fixes for 22 security-sensitive bugs in Firefox 148.

The company also named three Firefox CVEs it explicitly credited to Claude Mythos Preview: CVE-2026-6746, CVE-2026-6757, and CVE-2026-6758. Mozilla’s public posts said the broader set of 271 findings came from the initial Mythos evaluation, while most of those fixes did not receive individual public CVE listings.

AI model helps Mozilla fix 271 Firefox vulnerabilities

In Mozilla’s security blog, the company said Firefox 150 includes fixes for 271 vulnerabilities identified during its first evaluation of Claude Mythos Preview. Mozilla said the work started in February, when the Firefox team began using frontier AI models to look for latent security bugs in the browser.

That number is a lot larger than Mozilla’s previous public AI-assisted result. The same post said earlier work with Claude Opus 4.6 produced fixes for 22 security-sensitive bugs in Firefox 148.

Ars Technica reported Mozilla engineers described the latest results as having “almost no false positives.” That is the part worth noting. AI bug reports have usually had the opposite reputation: lots of plausible text, then a human spends the afternoon discovering the bug does not exist.

Mozilla’s own engineering post says exactly that. It described earlier AI-generated security reports as “unwanted slop” and said the dynamic changed because both the models improved and Mozilla got better at steering and filtering them.

For related coverage of model performance on offensive and defensive security tasks, see NovaKnown’s earlier reporting on AI cyber capabilities.

Mozilla says Claude Mythos Preview found three Firefox zero-days

Mozilla’s advisories for Firefox 150 publicly credit three named vulnerabilities to Claude Mythos Preview: CVE-2026-6746, CVE-2026-6757, and CVE-2026-6758. Those are the clearest line items connecting the model to specific disclosed bugs.

Mozilla’s engineering post added an important detail about the mix of findings: some of the reports were sandbox escapes. In browser security, a sandbox escape is a bug that lets code break out of the restricted rendering process into a more privileged one.

Mozilla said those sandbox escapes would need to be combined with other exploits to produce a full-chain Firefox compromise. The company also said the model was allowed to patch Firefox source code during these investigations, as long as the modified code only ran in the sandboxed process.

That matters because Mozilla framed these as hardening results across multiple browser subsystems, not just a list of one-shot critical remote code execution bugs. Several findings were defense-in-depth issues or bugs that improved exploitability boundaries rather than standalone takeover chains.

For background on the model itself, NovaKnown previously covered anthropic mythos.

The harness and workflow behind the Firefox scans

Mozilla said the jump in useful findings came from a custom agent harness wrapped around the model. The harness gave the LLM instructions, access to project tools, and a loop that kept it working until it either produced a verifiable result or ran out of road.

Ars Technica quoted Mozilla Distinguished Engineer Brian Grinstead describing the harness as code that tells the model to find a bug in a file, gives it tools to read and write files and run test cases, and then keeps iterating until completion. Mozilla said the harness plugged the model into the same testing pipeline and special Firefox builds its human developers already use.

One concrete example was memory-safety work with sanitizer builds. Grinstead said the team could point the agent at a source file, tell it there was an issue to find, and let it generate test cases until it produced a crash under the sanitizer build. That is a much clearer success condition than “read this code and tell me if anything looks bad,” which is how you get slop.

Mozilla also said the model could use existing fuzzing infrastructure and other internal tools. The workflow was not a chatbot staring at source code. It was an LLM inside a project-specific loop with deterministic checks.

The company’s post says this setup improved both signal generation and noise filtering. That lands squarely in the bucket of LLM failure modes: the model still needs a workflow that can verify outputs against reality.

How Mozilla framed the findings and the remaining caveats

Mozilla’s public framing was blunt. In its security post, the company wrote that “the zero-days are numbered” and said defenders now have a chance to win “decisively.” The reporting underneath that claim is narrower and more concrete: Firefox 150 shipped with 271 fixes tied to model-assisted hardening work, and Mozilla published extra technical detail because of the level of interest.

Mozilla also said it intentionally released only a small sample of the underlying reports. The company normally keeps detailed bug reports private for several months after fixes ship, and said it made a calculated decision to unhide some examples earlier than usual.

The engineering post also describes what the models did not find. Mozilla said some hardened surfaces and layered defenses held up against the model’s attempts, including areas where previous human researchers had found clever routes. That is a useful detail because it puts the Firefox zero-day discussion on actual terrain: a browser with layered mitigations, not a generic claim that AI now solves security.

A separate government evaluation from the UK AI Security Institute puts Claude Mythos Preview’s cyber performance in a broader comparison set. The institute said an early checkpoint of GPT-5.5 now reaches a similar level on its cyber evaluations, after Mythos Preview had previously been the first model to complete its end-to-end corporate network attack simulation. On the institute’s expert-level cyber tasks, GPT-5.5 posted a 71.4% average pass rate versus 68.6% for Mythos Preview.

That evaluation does not measure Firefox directly. It does, however, place Mozilla’s Firefox zero-day work next to an external benchmark showing Mythos Preview is no longer alone at that performance tier.

Key Takeaways

Mozilla said Firefox 150 includes fixes for 271 vulnerabilities found during an initial evaluation of Claude Mythos Preview.
Mozilla explicitly credited three CVEs to Claude Mythos Preview: CVE-2026-6746, CVE-2026-6757, and CVE-2026-6758.
Mozilla said a custom agent harness was central to the results, giving the model tools, test infrastructure, and deterministic verification loops.
Some findings were sandbox escapes and defense-in-depth issues, not all standalone full-chain compromises.
The UK AI Security Institute said GPT-5.5 now performs at a similar level to Mythos Preview on its cyber evaluations.

1,000x Claim, No Independent Proof: Subquadratic Architecture

Simon Paxton — Fri, 08 May 2026 04:24:23 +0000

Subquadratic launched from stealth this week with a claim that its subquadratic architecture can cut attention compute by nearly 1,000x at very large context lengths. On its launch page, the startup said its first model, SubQ 1M-Preview, is built on a “fully subquadratic architecture” rather than the standard transformer pattern where attention cost rises quadratically with context length.

The headline number is large enough to attract immediate scrutiny. VentureBeat reported that Subquadratic had not published independent research validating the claim at launch, even as it pitched three private-beta products built around the same subquadratic architecture.

Subquadratic claims a 1,000x attention-compute reduction

On its launch page, Subquadratic says its model belongs to “a new class of large language models” and that its subquadratic architecture reduces attention compute by “almost 1,000x compared to other frontier models.” The company ties that figure to very long inputs, saying the comparison applies at 12 million tokens.

That is a direct shot at the main cost curve in transformer models. In standard attention, each token is compared with every other token, so compute grows quadratically as context gets longer. Subquadratic says its approach changes that scaling so compute grows linearly with context length instead.

SubQ 1M-Preview and the products it is pitching

VentureBeat reported that the company’s first model is called SubQ 1M-Preview. Alongside it, Subquadratic launched three products into private beta:

an API with access to the full context window
a command-line coding agent called SubQ Code
a search product called SubQ Search

The launch positions the model as more than a research claim. The company is already packaging the subquadratic architecture as an API, a coding tool, and a search system.

What Subquadratic says about long-context costs

Subquadratic’s pitch is centered on long-context workloads, where context length means how much text a model can process in one shot. The company says lower attention cost makes workloads that were previously too expensive to run at scale more practical.

That claim lines up with a real bottleneck. In conventional transformer systems, doubling context length does not double attention cost; it quadruples it. That is why long-context applications often rely on retrieval, chunking, and other workarounds instead of simply sending everything to the model. NovaKnown has covered adjacent efficiency work before in pieces on speculative checkpointing, LLM performance drop, and Claude Code token usage.

Published evidence at launch

The missing piece at launch was independent backing. VentureBeat reported that the efficiency numbers would matter only if validated independently, and that no published independent research was available at the time of the announcement.

That leaves the public record in a very specific state. Subquadratic has made a concrete claim about a subquadratic architecture, given a concrete figure for attention-compute reduction, and announced products based on it. What it had not done, in the material available at launch, was publish outside validation showing the architecture performs as claimed.

Funding and launch details

VentureBeat reported that Subquadratic emerged from stealth on Tuesday and had raised $29 million in seed funding. The report said investors include Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, and early investors in Anthropic, OpenAI, Stripe, and Brex.

The same VentureBeat report cited The New Stack as saying the raise valued the company at $500 million. Those funding details sat alongside the product launch, not a peer-reviewed paper — which is one reason the discussion around the model quickly split between curiosity and demands for proof.

Key Takeaways

Subquadratic says its subquadratic architecture reduces attention compute by almost 1,000x compared with frontier models at 12 million tokens.
The company’s first announced model is SubQ 1M-Preview.
Subquadratic launched three private-beta products: SubQ API, SubQ Code, and SubQ Search.
The company says its design changes long-context economics by making compute scale linearly with context length instead of quadratically.
At launch, VentureBeat reported no independent published validation of the architecture claim.

They Rejected It. It’s Building Anyway: OpenAI Oracle Data Center

Simon Paxton — Thu, 07 May 2026 21:38:31 +0000

The OpenAI Oracle data center project in Saline Township, Michigan moved into construction after local officials voted down the rezoning tied to the site. Fortune reported in May that building work had begun even after the local rejection.

The Washington Post reported that OpenAI is involved in the planned giant Michigan facility, and Data Center Dynamics identified Oracle as a partner on the same project. That puts the OpenAI Oracle data center inside the companies' broader Stargate data center push, but in this case the immediate story is local: a farm-town rezoning fight that did not stop construction.

OpenAI and Oracle’s Michigan data center project went ahead after local rejection

Construction started after the township board rejected the rezoning request for the project. Fortune reported that the Saline Township development had already been turned down by local officials before work began.

Data Center Dynamics reported that the project is being developed with OpenAI and Oracle in Saline Township, a rural community near Ann Arbor. The outlet described the project as part of Stargate and as the center of a dispute between developers and township governance.

That sequence is the unusual part. In the standard version of a local land-use fight, a failed rezoning vote stops the project. Here, the OpenAI Oracle data center advanced into construction anyway.

Saline Township voted down the rezoning before construction started

Planet Detroit reported that the Saline Township board voted to deny the rezoning needed for the project. The vote came before construction moved forward.

The site dispute centered on whether farmland in the township could be used for the data center development. Fortune's reporting on the start of construction placed that earlier denial at the center of the local conflict.

Saline Township's role was not symbolic. Local officials had already said no to the rezoning request before the project advanced on the ground.

The project became a zoning fight over exclusionary zoning

The Washington Post reported that the developer sued and argued that the township's rejection amounted to exclusionary zoning. In plain English, that is the claim that local zoning rules were used to block a lawful category of development rather than regulate it neutrally.

The same report tied the lawsuit directly to the Saline Township data center project involving OpenAI. Data Center Dynamics separately described the conflict as a zoning dispute between the companies behind the project and local government.

Planet Detroit also reported on later court activity, including a judge denying intervention in the case. By then, the dispute had already moved beyond a local board vote and into litigation over how the township used its zoning power.

OpenAI’s role sits inside the broader Stargate buildout

OpenAI said in its January announcement of the Stargate Project that it was launching a major AI infrastructure effort with partners including Oracle. That company announcement established OpenAI's role in the larger buildout that the Michigan project is part of.

The Michigan facility is one local piece of that broader infrastructure push. The Washington Post connected the Saline Township project to OpenAI's AI expansion, and Data Center Dynamics explicitly linked the site to Stargate.

If this sounds familiar, it fits a broader pattern of AI infrastructure colliding with local land-use politics. NovaKnown has covered the spending side in AI Datacenter Spending, community resistance in Data Center Backlash, and new buildouts abroad in Datagrid New Zealand.

Key Takeaways

The OpenAI Oracle data center in Saline Township moved into construction after local officials rejected the rezoning tied to the project.
OpenAI and Oracle were both identified by named sources as participants in the Michigan project, which is tied to Stargate.
Planet Detroit reported that the township board voted down the rezoning before construction began.
The Washington Post reported that a lawsuit over the project argues the township used exclusionary zoning.
The dispute has become both a local governance fight and a court case over the township's zoning decision.

OpenAI Revenue is Not the Whole Story: Anthropic's Enterprise Bet

Simon Paxton — Sun, 03 May 2026 06:16:56 +0000

OpenAI revenue is still the number people reach for when they want a leaderboard. But the cleaner frame is different: Anthropic appears to be building a different kind of AI business, one centered on enterprise customers, safety positioning, and less dependence on mass-market fame.

That distinction matters because public discussion keeps collapsing three separate things into one scorecard: revenue, valuation, and brand recognition. The available sources here do not show that Anthropic has passed OpenAI on valuation or revenue. They do show why Anthropic can look strong anyway.

Why Anthropic’s Enterprise Focus Changes The Revenue Conversation

The strategic frame is revenue mix versus public visibility. A company can be less famous and still look formidable if it is optimized for business spending rather than consumer attention.

Anthropic’s own Claude for Enterprise page makes that positioning unusually explicit. It leads with enterprise workflows, secure connections to company knowledge, and business use cases rather than a mass-market assistant pitch.

That is a different motion from a consumer product becoming a household verb. Enterprise buyers care about access controls, internal knowledge retrieval, and whether a tool can slot into existing company work. Anthropic is selling into that budget line.

A small detail on the same page is revealing: Anthropic highlights Lyft reducing customer support time by 87% with Claude. That is not consumer marketing. It is a procurement story, aimed at managers who sign contracts after seeing labor savings and workflow gains.

This is why claims about OpenAI revenue often miss the interesting part. Two AI companies can generate money through very different channels. One can dominate public awareness while the other builds a quieter base of higher-touch business accounts.

That difference also helps explain why Anthropic shows up so often in discussions about workplace AI adoption and developer workflows, including in comparisons like our look at Claude vs ChatGPT. The products overlap, but the go-to-market emphasis is not the same.

OpenAI Still Has The Stronger Consumer Brand

On public recognition, the gap is much easier to support. The Reuters Institute’s 2025 report says ChatGPT is by far the most widely recognised generative AI system.

That matters because brand recognition and revenue are related, but they are not interchangeable. ChatGPT gave OpenAI something Anthropic does not have at the same scale: a consumer brand that functions as category shorthand.

When people talk about AI in casual conversation, they usually say “ChatGPT,” not “Claude.” That creates distribution all by itself. It also makes OpenAI revenue a more natural headline than Anthropic’s business performance, because consumer familiarity drives media attention.

Anthropic’s relative lack of consumer fame should not be confused with weakness. It means the company is playing a different game. OpenAI owns more of the public mindshare; Anthropic is visibly pitching itself to organizations that care more about internal deployment than mass recognition.

There is a second-order effect here. Consumer fame tends to distort how outsiders judge company strength. A company with the stronger household brand often gets treated as if it must also lead every business metric. That is exactly the shortcut readers should avoid.

What Anthropic’s Public Messaging Reveals About Its Business Model

Anthropic’s homepage is unusually consistent about one thing: safety is not a side note. The company foregrounds “AI to serve humanity’s long-term well-being,” links to its Responsible Scaling Policy, and frames Claude as “a space to think” with “No ads. No sponsored content.”

That messaging is branding, but it is also customer selection. Safety language, governance language, and enterprise product pages all point toward buyers who want a lower-drama procurement story: controlled deployment, business use cases, and an AI vendor that talks like a risk committee can live with it.

This is the part many leaderboard arguments miss. Anthropic’s safety posture is not just philosophy; it is part of the sales motion. For an enterprise customer, especially one connecting internal company knowledge, trust signals can be part of the product.

That does not mean the strategy is frictionless. Enterprise-first companies often run harder into account controls, permissions, and support expectations. You can see how quickly trust becomes operational, not abstract, in situations like reported Anthropic bans. Once your buyer is a business, reliability and account handling become part of the value proposition.

Why Valuation Claims Need Caution When Revenue Is Not Public

Here is the part worth stating plainly: the source set does not support the claim that Anthropic has overtaken OpenAI in valuation or revenue. Those claims were investigated and dropped.

That leaves a narrower, better argument. Anthropic’s enterprise positioning explains why some observers may feel like it is winning, especially inside technical teams and business deployments, without needing any unsupported claim about beating OpenAI revenue or surpassing OpenAI’s valuation.

This is a common category error in AI coverage. People see strong enterprise adoption, a credible product, and a clear safety brand, then translate that into assumptions about top-line revenue or private-market value. But those are separate measurements, and neither company’s full numbers are public in a way that lets this comparison be made cleanly from the cited sources.

A better way to read the chessboard is this: OpenAI has the stronger consumer brand because ChatGPT is the public face of generative AI; Anthropic has built a more overt enterprise-first narrative through Claude Enterprise and safety-centered positioning. Both can be true at once.

That also changes how to read future reporting on AI company revenue. If a headline treats consumer mindshare as proof of enterprise dominance, or treats enterprise credibility as proof of overall revenue leadership, it is probably compressing too much into one metric. Our earlier coverage of OpenAI revenue 2026 is useful here precisely because revenue stories need source discipline, not vibes.

Key Takeaways

Anthropic’s verified public positioning is enterprise-first, centered on Claude for Enterprise, secure company knowledge access, and business use cases.
Reuters Institute reports that ChatGPT is by far the most widely recognised generative AI system, giving OpenAI a stronger consumer brand.
The available sources do not support claims that Anthropic has overtaken OpenAI in valuation or revenue.
Anthropic’s safety-heavy messaging appears tightly linked to its business model, especially for enterprise customers evaluating risk and trust.
The real comparison is not a simple leaderboard: OpenAI revenue, Anthropic’s enterprise motion, and ChatGPT brand recognition describe different kinds of strength.

11 Minutes, $1.73, and GPT-5.5 Cybersecurity Simulation

Simon Paxton — Fri, 01 May 2026 21:37:56 +0000

The UK AI Security Institute says GPT-5.5 cybersecurity simulation results now look a lot less like a one-off milestone and a lot more like a repeatable frontier capability. In its latest evaluation, AISI found that an early checkpoint of OpenAI’s GPT-5.5 reached roughly the same level as Anthropic’s Mythos Preview on hard cyber tasks—and slightly beat it on one key benchmark.

That matters because AISI was explicitly testing whether Mythos Preview’s earlier result was a weird outlier. Instead, a second model from a different developer now lands in the same range, including solving a difficult multi-step cyber attack simulation end-to-end in some attempts. If you’ve been tracking rising AI cyber capabilities, this is the part worth circling.

GPT-5.5 Cybersecurity Simulation Is No Longer a One-Model Fluke

AISI’s headline finding is simple: GPT-5.5 reached a similar cyber capability level to Mythos Preview. That is the interesting result.

Back in April, AISI said Mythos Preview was the first frontier model it had seen complete its corporate network attack simulation end-to-end, a multi-step exercise it estimates would take a human expert around 20 hours. The obvious follow-up was whether that was a breakthrough tied to one model family.

AISI’s answer is now: probably not. GPT-5.5, from a different lab, hit a comparable level and achieved a slightly higher average pass rate than Mythos Preview on expert tasks.

That shift changes the interpretation. A surprising benchmark win can be a stunt. Two frontier models from different developers hitting about the same bar starts to look like a capability class.

How GPT-5.5 Performed Across OpenAI's Cyber Task Suite

AISI’s testbed is broader than a single dramatic demo. It uses a suite of 95 narrow cyber tasks across four difficulty tiers, built in capture-the-flag format—structured challenges where the model has to actually recover a “flag” by solving the task.

Those tasks cover things like reverse engineering, web exploitation, and cryptography. The easier tasks are already saturated by frontier models, so the interesting comparison is in the advanced suite.

On Expert-level tasks, AISI reports these average pass rates:

Model	Expert task pass rate
GPT-5.5	71.4% ± 8.0%
Mythos Preview	68.6% ± 8.7%
GPT-5.4	52.4% ± 9.8%
Opus 4.7	48.6% ± 10.0%

That is a real jump over earlier OpenAI and Anthropic frontier models. GPT-5.5 is not edging forward from 68% to 71% in a vacuum; it is sitting well above GPT-5.4 and Opus 4.7 on the hardest tier AISI reports.

The advanced tasks themselves are also nasty in exactly the way you’d want for this kind of evaluation. AISI says they include reversing stripped binaries and embedded firmware without source code, building reliable exploits for memory corruption bugs, recovering keys from weak crypto implementations, winning TOCTOU races, unpacking obfuscated malware, and weaponizing synthetic vulnerabilities planted in real open-source software.

One example AISI highlights is a reverse-engineering challenge built around a stripped Rust ELF implementing a custom virtual machine, plus a second unknown-format file containing bytecode for that VM. That is not “write a phishing email.” It is the kind of task where benchmark scores start to tell you something about actual technical depth.

Why Minutes Matter: The Human-versus-Model Time Gap

AISI says GPT-5.5 solved a difficult cyber task in under 11 minutes. The same full-chain simulation is estimated to take a human expert about 20 hours.

The raw comparison is startling, but it needs one clarification: this does not mean GPT-5.5 is a drop-in replacement for a human red teamer. The benchmark is measuring performance on a controlled task suite, not whether you can hand the model a production network and expect clean autonomous operation.

Still, the time gap matters for two reasons.

First, it changes what becomes cheap to try. A model that can take repeated shots at a hard multi-step task in minutes is operating in a very different regime from a human expert who needs most of a day. Even partial success becomes more operationally interesting when attempts are fast.

Second, AISI says the run cost was $1.73. That is a tiny price for a benchmark result at this level. If frontier models can attempt advanced cyber tasks quickly and cheaply, scaling the number of runs stops being the bottleneck.

That cost number is easy to miss, but it is one of the most important lines in the evaluation. High-end cyber capability is one thing. High-end cyber capability at commodity-run pricing is another.

This is also why model autonomy research keeps spilling into security. Once you combine strong task performance with low per-run cost and agentic iteration, you get the same pattern people worry about in things like agentic sandbox escape: more attempts, more persistence, and less friction.

What GPT-5.5 Actually Changes for Cyber Evaluation

The cleanest update is that cyber evals now need to assume multiple labs can produce models at this level. GPT-5.5’s result means benchmark designers can no longer treat top-tier cyber performance as a lab-specific anomaly.

That pushes evaluation in two directions.

One is harder, more realistic tasks. AISI notes that basic tasks have been saturated since at least February 2026. When models max out easier CTF-style challenges, the useful signal moves to practitioner and expert tasks with larger search spaces and more steps.

The other is more careful interpretation. Stronger benchmark performance does not automatically prove deployable defensive capability. A model passing expert CTF cybersecurity tasks can still fail in messy real environments full of unreliable tooling, access constraints, and adversarial inputs.

We’ve already seen how brittle agentic systems can be when the environment fights back—whether through deliberate attacks like prompt injection in peer review or through the ordinary chaos of multi-step tooling. So the right reading of the GPT-5.5 cybersecurity simulation result is not “AI can now do cybersecurity.” It is narrower and, in some ways, more significant: frontier models are now repeatedly reaching expert benchmark territory on serious cyber tasks.

That is enough to force a change in how these systems are tested, gated, and compared.

Key Takeaways

AISI found GPT-5.5 reached a similar level to Mythos Preview, suggesting frontier cyber performance is no longer a one-model fluke.
On Expert-level tasks in AISI’s advanced cyber suite, GPT-5.5 scored 71.4%, ahead of Mythos Preview at 68.6%.
AISI says GPT-5.5 solved a difficult multi-step cyber task in under 11 minutes, while the full chain is estimated to take a human expert around 20 hours.
The reported run cost was $1.73, which makes repeated attempts at advanced cyber tasks unusually cheap.
The result shows stronger benchmark performance, not proof of broadly deployable real-world defensive capability.

DeepSeek Forces Visual Reasoning Through Points and Boxes

Simon Paxton — Fri, 01 May 2026 04:40:28 +0000

DeepSeek has released an open-source visual reasoning framework called Thinking with Visual Primitives. According to 36Kr, the system changes how a multimodal model is asked to reason: instead of describing an image in loose language, it has to work through explicit visual units like point coordinates and bounding boxes.

That is a much more concrete bet than “better multimodal understanding.” It pushes reasoning closer to measurement. When a model says “the object is near the left side,” language can blur the geometry; when it has to point to coordinates or mark a box, the error has less room to hide.

What DeepSeek’s Thinking with Visual Primitives Actually Changes

The release is a framework, not just a vague claim of improved perception. 36Kr reports that DeepSeek unveiled Thinking with Visual Primitives as a multimodal model and technical report, and released it as open source.

The interesting part is the representation layer. The model is not only generating words about what it sees. It is being made to reason through visual primitives — basic spatial elements such as points and boxes.

That sounds small, but it changes the failure mode. A lot of visual reasoning errors happen when the model jumps too quickly from pixels to prose. It can produce a fluent sentence that sounds right while quietly dropping the actual layout of the scene.

With visual primitives, the model has to show more of its work. If a task depends on location, size, or relative position, a coordinate is harder to fudge than a sentence.

Why Visual Primitives Beat Vague Descriptions for Spatial Tasks

The core claim 36Kr makes is specific: the framework improves spatial reasoning by requiring precise visual data points instead of vague natural-language descriptions. In practice, that means the model has to anchor its reasoning in things that can be checked.

Take a simple spatial task. “Which object is closest to the top-right corner?” A language-first system might narrate the scene and guess based on a rough impression. A primitive-based system can mark candidate objects with bounding boxes, compare positions, and reason from those coordinates.

Or imagine “point to the handle of the mug.” The phrase “the handle is on the side” is descriptive, but it is not an answer you can directly score. A point coordinate is.

That distinction matters because language is compressive. It throws away detail on purpose. Humans do this constantly and get away with it because we share context. Models often do not. They replace measurement with summary, and summary is where a lot of hallucination-like behavior starts.

This is the same broad instinct behind work to reduce LLM hallucinations: force the system to stay attached to observable structure for as long as possible. Here, the observable structure is spatial.

There is a nice symmetry with research on zero-shot world models, too. If you want a model to reason about a scene, you need a representation that preserves the scene. Text alone often smooths over exactly the information you care about.

Why Open Source Matters for Visual Reasoning

DeepSeek released Thinking with Visual Primitives as open source, according to 36Kr. For this kind of work, that matters more than usual.

A lot of multimodal claims are hard to inspect. You get a demo, a benchmark headline, maybe a polished sample image — but not the machinery that tells you what changed. Open-sourcing a visual reasoning framework gives researchers and builders something much more useful: a way to test whether the representation itself is doing the work.

That opens up a few concrete paths:

Researchers can compare primitive-based reasoning against language-only baselines.
Builders can inspect where coordinate constraints help and where they add overhead.
The community can try alternative visual primitives, scoring methods, or training recipes.

This is where open-source AI keeps getting more interesting. Once the representation layer is visible, progress is not limited to whoever owns the API. Other teams can copy, modify, and pressure-test the idea directly.

And this specific idea is worth pressure-testing. If multimodal models keep getting larger without getting more grounded, they will keep producing confident spatial mistakes. A framework built around points and boxes is a direct attempt to fix that at the level where the mistake starts.

What the DeepSeek-PKU-Tsinghua Collaboration Signals

36Kr says the work was developed in collaboration with Peking University and Tsinghua University. That does not just add prestige. It suggests a research direction.

This looks like an effort to treat multimodal reasoning as a representation problem, not only a scale problem. Bigger models can help, but there is a different question underneath: what internal units should a model think with when the task is visual and spatial?

DeepSeek’s answer here is unusually explicit. Use primitives that map back to the image. Make reasoning legible in geometric terms. Reduce the amount of hidden translation from scene to text.

That is a strong signal because it points away from the “just let the model narrate more” approach. If that collaboration keeps producing work in this vein, expect more systems that mix symbolic-looking structure with end-to-end multimodal learning.

It is also a useful contrast with a lot of current multimodal product rhetoric. Plenty of systems claim to “understand images.” Far fewer specify the units of that understanding. DeepSeek, PKU, and Tsinghua are at least making a falsifiable bet: that visual primitives are a better substrate for some reasoning tasks than free-form language.

Key Takeaways

DeepSeek released an open-source visual reasoning framework called Thinking with Visual Primitives.
The framework pushes a multimodal model to reason with point coordinates and bounding boxes, not just textual descriptions.
That matters most for spatial reasoning, where language often blurs position, size, and relative layout.
The open-source release lets researchers test whether grounded representations improve multimodal performance in practice.
The collaboration with Peking University and Tsinghua University signals serious interest in changing the representation layer, not just scaling model size.

The Research Map is Already Live, but the Methods Aren’t: Semantic Map Tool

Simon Paxton — Thu, 30 Apr 2026 04:31:55 +0000

The Global Research Space, a new semantic map tool, is live now as a browser-based alpha that lets people explore 10 million research papers as if they were moving across a map. The public site is up at globalresearchspace.com, and the map view is currently labeled v0.2.0 alpha, with a pan-and-zoom canvas showing floating topic labels spread across clustered regions.

What is confirmed is unusually split across two places. The site itself shows a working product and the map interface; the methodology mostly comes from an April 30 Reddit launch post by the creator, who said the system uses the latest 10 million papers from OpenAlex and turns them into “semantic neighborhoods” for browsing.

What The Global Research Space Actually Is

The homepage describes The Global Research Space in one sentence: “Explore the landscape of the latest research.” Click through to the map and you get a large pan-and-zoom interface with floating topic labels and clustered regions, not a normal search-results page.

That difference matters. A standard paper search tool starts with a query box and returns a ranked list. This semantic map tool starts with position: papers and topics appear to be arranged near related work, so browsing means moving through adjacent areas rather than reformulating keywords over and over.

OpenAlex is a plausible substrate for this. OpenAlex says it is a “map of the world’s research network” and links works, authors, institutions, journals, topics, and more, with hourly updates. Its 2026 roadmap says the database now contains 477 million works, which makes a 10 million-paper slice both substantial and clearly a subset.

On the live map, the directly visible product state is simple: a full-screen map surface, labeled regions, a search-oriented interface around the canvas, and the alpha version badge. That is enough to verify the core claim that this is a working research paper map, not just a mockup or concept page.

How the semantic map tool is built

The current pipeline description is creator-reported, not documented on an official methods page. In the April 30 Reddit launch post, the creator said the system:

sourced the latest 10 million papers from OpenAlex
generated embeddings using SPECTER 2
used titles and abstracts as input
reduced dimensionality with UMAP
applied Voronoi partitioning on density peaks
generated floating labels with a custom labeling pipeline

That is a pretty specific recipe. It is also mostly single-sourced.

Here is the current evidence split for the semantic map tool pipeline:

Claim	Status	Source
The Global Research Space exists and is publicly accessible	Confirmed	Product homepage and map page
The map page is labeled v0.2.0 alpha	Confirmed	Live map page
It uses the latest 10M papers from OpenAlex	Creator-reported	Reddit launch post
It uses SPECTER 2 on titles and abstracts	Creator-reported	Reddit launch post
It uses UMAP	Creator-reported	Reddit launch post
It uses Voronoi partitioning on density peaks	Creator-reported	Reddit launch post
Labels are custom and still a work in progress	Creator-reported	Reddit launch post
Code is not open source	Creator-reported	Reddit comments

There is one more useful signal in that launch post: the creator engaged directly with questions about clustering choices. In one exchange, they said they had not considered HDBSCAN and might explore a hybrid. That does not invalidate the current method. It does show the pipeline is still being worked out in public, which fits the alpha label on the site.

What users get from semantic search and analytics

Two things are visible from the live product state and public copy. First, the interface is built around map navigation rather than a flat results page. Second, the creator says the product supports keyword and semantic queries plus analytics for institutions, authors, and topics.

The first part is confirmed by direct observation. The second part is currently creator-reported and only lightly documented on the accessible public pages.

That distinction matters. We can verify that users are being invited to navigate a spatial interface for scientific literature navigation. We cannot yet verify, from public methods documentation, exactly how the analytics are calculated or how semantic retrieval quality compares with a standard search engine.

Implication, with caveat: this kind of semantic paper map is most useful when a researcher knows the area loosely but not the exact keywords. In fast-moving domains, terminology drifts. A spatial interface can help surface nearby work that uses different language for similar ideas. That is a plausible benefit, and it matches the interaction design, but the public site does not yet provide benchmarks showing how often it beats conventional search. For related context on evaluating ML papers in messy, fast-changing domains, see our piece on empirical research in machine learning.

From Query Box to Spatial Browsing

What this launch demonstrates, clearly and directly, is that a live product can turn literature discovery into movement across a terrain. That is the concrete shift here.

The broader pattern did not start with this project. Research mapping interfaces have shown up before in forms like the ArXiv Machine Learning Landscape and other topic-atlas style explorers. What makes The Global Research Space interesting is the combination of scale, public access, and upstream infrastructure: an OpenAlex map-style scholarly graph underneath, then an interface that treats papers as neighborhoods instead of list items.

That is a meaningful product decision. Search boxes are good when users know the term they want. Spatial browsing is better for orientation, adjacency, and “what sits next to this?” exploration. If you want a broader framing for why these systems feel different once the underlying graph is rich enough, our piece on how discovery systems work covers that pattern from another angle.

There is also a very current AI-tooling dynamic here: the interface is shipping before the methods are properly documented. Users can test whether the neighborhoods feel sensible right now. Outsiders still cannot fully audit how those neighborhoods were produced. That split is not unusual anymore, but it is especially important for a paper discovery tool that may influence what researchers read and miss.

Documentation Gaps in the Alpha Release

The biggest missing piece is an official methods page. The live product does not currently provide a detailed public explanation of corpus selection, refresh cadence, embedding infrastructure, clustering evaluation, label generation, or ranking methodology for institutions and authors.

Several important questions are still open:

What counts as the “latest” 10 million papers? No publication-date cutoff was publicly documented in the accessible pages.
How often is the map refreshed? OpenAlex updates hourly, but that does not mean this product does.
How are rankings calculated? The site references analytics, but not the exact formulas.
How good is the semantic retrieval? No public benchmark compares it with standard academic search.
How stable are the neighborhoods and labels? The creator described the labeling as a work in progress.

The alpha label on the product is doing real work here. So is the fact that the code is reportedly not open source.

The confirmed picture is narrow but solid: the site is live, the map interface is public, and the alpha badge is visible. The creator-reported picture is more detailed: OpenAlex as source data, SPECTER 2 embeddings, UMAP, Voronoi-based partitioning, and custom labels. What still lacks public documentation is the evaluation layer — refresh timing, ranking formulas, retrieval quality, and evidence that the mapped neighborhoods are stable and useful across real research workflows.

Key Takeaways

The Global Research Space is a live alpha product that lets users browse research as a pan-and-zoom map rather than a list of search results.
The core pipeline details are mostly creator-reported, not formally documented on the product site: OpenAlex source data, SPECTER 2 embeddings, UMAP, and Voronoi-based semantic neighborhoods.
OpenAlex is a credible upstream corpus with 477 million indexed works, making a 10 million-paper slice plausible but still partial.
The interface changes the discovery workflow from keyword lookup to spatial exploration, which can be useful for surveying unfamiliar or fast-moving fields.
What is still missing is evaluation and documentation: refresh cadence, ranking methods, retrieval quality, and clustering choices are not yet publicly specified in depth.

Compute Anxiety, Not Collapse: OpenAI Revenue 2026

Simon Paxton — Wed, 29 Apr 2026 21:44:49 +0000

OpenAI revenue 2026 is under a real pressure test. In the last 30 days, the dominant story has been a Reuters report, citing the Wall Street Journal, that OpenAI fell short of internal revenue and user targets while wrestling with the cost of future compute commitments.

That has been easy to turn into a collapse narrative. The actual record is less dramatic and more interesting: OpenAI is missing some targets, rewriting key economics with Microsoft, and pushing customers through fast product and API changes at the same time. That is what a company under strain looks like when it is still trying to buy more paths to growth, not what visible free fall looks like.

Why OpenAI Is Under Pressure Now

The pressure is simple: huge compute bills, very high growth expectations, and more competition in coding and enterprise AI.

Reuters, via the WSJ report, said OpenAI missed multiple monthly revenue targets earlier this year and also fell short of an internal goal of reaching 1 billion weekly active users by the end of 2025. The same report said CFO Sarah Friar had expressed concern internally about whether revenue growth would keep pace with future computing contracts.

OpenAI leadership directly pushed back. Sam Altman and Friar told Reuters:

“This is ridiculous. We are totally aligned on buying as much compute as we can and working hard on it together every day.”

That leaves the core picture intact even if you discount the strongest infighting angle. Reuters and the WSJ report a mismatch between OpenAI’s growth goals and its future compute commitments. If those are the internal benchmarks, missing them makes the economics tighter.

That curve is expensive everywhere, not just at OpenAI. We’ve already seen the broader constraints in power gear, utility approvals, and datacenter buildouts in AI datacenter spending. If compute supply is tight and demand is still climbing, revenue misses matter more because the infrastructure plan does not get cheaper.

What the Revenue Miss Reports Actually Say

Reuters, citing the Wall Street Journal, reported four concrete things: OpenAI missed multiple monthly revenue targets in early 2026, missed an internal goal of 1 billion weekly active users by the end of 2025, saw Sarah Friar raise concern about whether growth would cover future compute contracts, and got a direct denial from Altman and Friar that they were split over buying that compute.

Here’s the clean version:

Reported issue	What was claimed	Status
Revenue targets	OpenAI missed multiple monthly revenue targets earlier in 2026	Reported by Reuters citing WSJ
User targets	OpenAI missed an internal goal of 1 billion weekly active users by end of 2025	Reported by Reuters citing WSJ
Compute anxiety	Sarah Friar reportedly raised concern about affording future compute contracts if growth lagged	Reported by Reuters citing WSJ
Internal split	Altman and Friar denied misalignment over compute buying	Direct statement to Reuters

That is serious. But it is also the kind of miss you get when a company sets targets that assume continued hypergrowth and then has to fund infrastructure to match.

The missed-target story also sits awkwardly next to OpenAI’s own product output. In late April alone, OpenAI’s release log shows GPT-5.5, workspace agents in ChatGPT, ChatGPT Images 2.0, Codex for (almost) everything, Agents SDK updates, and distribution expansion to AWS. Companies in obvious operational collapse do not usually ship like that.

The better reading is that OpenAI revenue 2026 is running into the gap between internal expectations and the cost base required to chase them. That is a pressure test.

The Microsoft Rewrite Changes the Stakes

The Microsoft deal rewrite is the most important structural change in this story.

Axios reported that the revised agreement gives Microsoft a non-exclusive license to OpenAI technology, lets OpenAI sell models across multiple clouds, caps Microsoft’s share of OpenAI revenue, and removes the old AGI-trigger provision that had been hanging over the partnership. Those are concrete changes, not mood.

In compact form, the reported before-and-after looks like this:

License structure: Microsoft’s access is now reported as non-exclusive, rather than functionally tied to a tighter exclusive relationship.
Cloud distribution: OpenAI can now sell through multiple clouds, not just through the old Microsoft-centered route.
Revenue sharing: Axios reported a cap on Microsoft’s share of OpenAI revenue, which matters directly for margins.
Control clause: The reported removal of the AGI-trigger provision reduces one of the strangest contractual constraints in the industry.

On its face, that changes two things at once: distribution flexibility and economic leakage.

The bearish read is obvious. If you are under revenue pressure, getting out from under exclusivity and a richer revenue share looks like a move to recover margin and open more sales channels quickly.

The stronger read is that this is exactly what OpenAI should have wanted anyway. If your product is becoming core infrastructure, being trapped in one cloud is a tax on growth. OpenAI’s own April 28 release — “OpenAI models, Codex, and Managed Agents come to AWS” — shows how fast that new freedom can be used.

That matters for OpenAI revenue 2026 because every point of revenue no longer forced through the old Microsoft structure has better odds of sticking. It also matters for the competitive picture. If customers want multi-cloud procurement, sovereign hosting options, or simply leverage in vendor negotiations, OpenAI is now in a better position to meet them.

There is a wider pattern here too. As discussed in open-source AI revenue, the economics are moving away from simple model access and toward distribution, integration, and control over where workloads run. The Microsoft rewrite is OpenAI adjusting to that reality in public.

What Developers and Customers Feel on the Ground

The cleanest practitioner-level signal is not bankruptcy gossip. It is platform churn.

OpenAI’s developer documentation says the Responses API represents the future direction for building agents. It also says the Assistants API was deprecated on August 26, 2025, with a sunset date of August 26, 2026. For developers, that means migration work, changed abstractions, and another round of updating tooling around the platform’s preferred architecture.

That kind of churn is expensive in a very boring way. Teams have prompts, evals, agent logic, tool integrations, and monitoring built around one API surface. When the center of gravity moves, they have to move too.

Consumer-facing model turnover adds another layer. OpenAI help materials say GPT-4o and additional models were deprecated in ChatGPT on February 13, 2026, while remaining available in the API. So end users can see abrupt product changes even when developers still have access underneath.

Short version: customers do not experience OpenAI revenue 2026 as a finance chart. They experience it as model retirements, new defaults, migration deadlines, and the need to retest workflows after every major release. The migration guide and deprecation notices make the cost transfer visible: when OpenAI changes the platform surface, customers absorb the integration and testing work.

OpenAI’s Harder Growth Phase

The evidence points to a harder growth phase.

OpenAI’s own posture has shifted toward public justification of flexibility. In “Our principles,” published April 26, Altman said OpenAI would be transparent about when its operating principles change and emphasized iterative deployment in the face of uncertainty. That is a company preparing users, partners, and regulators for more course corrections.

At the same time, it is still expanding. Product cadence stayed dense through late April. OpenAI pushed into AWS, government sales through FedRAMP, and cybersecurity positioning. Those are not the actions of a firm openly retrenching.

The tension is the story. OpenAI missed at least some internal targets, according to Reuters’ account of the WSJ report. It is carrying massive compute ambition into a market where infrastructure is constrained, pricing pressure is real, and rivals like Anthropic and Google are taking slices of demand. So it is doing what pressured growth companies do: rewriting partnerships, broadening distribution, shipping relentlessly, and making customers absorb more churn.

The practical question for OpenAI revenue 2026 is whether the new cloud flexibility buys enough monetization headroom before the next round of compute bills hits. Right now, the evidence says pressure, not free fall.

Key Takeaways

Reuters, citing the WSJ, reported that OpenAI missed multiple internal revenue targets and an internal goal of 1 billion weekly active users by end-2025.
OpenAI leadership denied any internal split over compute buying, calling that framing “ridiculous.”
The Microsoft rewrite gives OpenAI more cloud flexibility and reportedly caps Microsoft’s revenue share, which could improve OpenAI’s economics.
OpenAI’s late-April release cadence was unusually dense, which cuts against a simple free-fall narrative.
Developers and customers are feeling the pressure through API migrations, deprecations, and faster platform churn.

10,000 Members, 1 Tight Script: Santana Mine Supporters

Simon Paxton — Tue, 28 Apr 2026 04:57:11 +0000

The Facebook group Santana Mine Supporters appeared to show broad grassroots backing for a proposed Central Otago gold mine. But the reporting dataset behind this story — a Playwright scrape of 9,890 member tiles, 208 posts, 50 comments, and 327 profile samples, captured from the public Facebook surface — found a much stranger pattern: membership arrived in huge batches near launch, the admin team includes people who do not appear local to the affected area, and one admin lists work at a New Zealand virtual admin business.

The group matters because Santana Minerals is seeking approvals tied to its Bendigo-Ophir gold project in New Zealand, and a Facebook constituency of nearly 10,000 people can look politically useful. What the public evidence shows, from that scrape and from New Zealand business records checked for the admin connection, is not proof of intentional deception. It is, however, a tidy case study in how a Facebook astroturf group can look convincing from the outside while leaving obvious operational fingerprints on the surface.

How the Santana Mine Supporters group was built

The reporting scrape captured 9,890 of roughly 9,908 members — about 99.8% coverage — along with Facebook’s relative join-date labels for each account. The pattern is the opposite of what you’d expect from a support group that grew gradually around a live local issue.

Here is the membership distribution from the scrape:

Facebook join label	Count	% of group
Joined about 3 months ago	4,893	49.5%
Joined about 2 months ago	2,269	22.9%
Joined about a month ago	972	9.8%
Joined about 2 weeks ago	587	5.9%
Joined this week	304	3.1%
Joined this month, named day	597	6.0%
Joined within the last 24 hours	71	0.7%
Other / unknown	197	2.0%

Nearly three quarters of the group — 72.4% — joined in the first two large cohorts. If the group were mostly organic, you’d expect a long tail: older members, gradual accumulation, and a distribution that reflects ongoing local attention. The scrape instead shows a launch spike first, then a much smaller trickle.

That does not by itself prove fake members. A campaign can absolutely drive a big early signup burst. But this particular burst is so concentrated that it suggests deliberate bulk onboarding, not a constituency slowly finding one another.

Facebook only exposes relative join labels, not exact timestamps, so the method has limits. You cannot reconstruct the exact day-by-day curve, and “about 3 months ago” compresses a range of join times into one bucket. But for cohort analysis, that granularity is still enough. If half the group lands in one oldest visible bucket and another quarter lands in the next one, you do not need exact timestamps to see that Santana Mine Supporters was assembled in a few large waves rather than accumulated steadily.

The group’s timing also matters. There are no members older than about 90 days in a public campaign that predates the group itself. On the scrape evidence, Santana Mine Supporters looks less like a community that formed around a mine debate and more like a communications asset that was stood up for this phase of the permit fight.

Why the admin team looks outsourced

The admin roster visible on Facebook is small: 5 admins, 0 moderators. For a supposedly broad-based local support group of this size, that means control is concentrated in a handful of accounts.

The scrape identified the following publicly visible admin details:

Admin	Visible location	Notable detail
John Wekking NZ	Cromwell, NZ	Local to the area
Brett Nicol	Wanaka, NZ	Local to the area
Paul Bright	Taupo, NZ	Roughly 1,000 km from the mine area; listed as “Admin CEO at Devon Street Property Limited”; produced 10.1% of all posts
Karen Sweatman	No location shown	Lists work at The Admin Superstar; joined about 2 weeks before becoming admin
Jackie Finnie	No location shown	Lists work as “Office Admin”; joined about a month ago

Two admins look local. Three do not obviously look like local supporters. The strongest signal is Karen Sweatman, because the claim here is not just based on a suggestive job title. The Admin Superstar’s public business presence describes it as a virtual assistant and business-support operation, and the New Zealand business record checked during reporting shows it as a registered New Zealand business. In other words: the “outsourced admin” reading is tied to a real public business record, not a vague Facebook self-description.

That matters because of the timing. On the scrape data, Sweatman appears to have joined the group roughly two weeks before becoming an admin. A recent joiner who publicly lists work at a virtual admin business is not strong evidence of local civic enthusiasm. It is strong evidence of someone being brought in to help run the page.

There are innocent explanations. A campaign can hire admin help to handle volume. Someone can volunteer while also working in outsourced admin. But the visible configuration here points toward coordinated communications work. The group does not just have active organizers; on the public record, it has the staffing pattern of a small public-facing PR operation.

That changes what the member count means. A local group with 9,000 people and local admins suggests one thing. A group with bulk-join cohorts and an admin bench that includes outsourced admin labor suggests something else: constituency as presentation.

The profile fingerprints that suggest sockpuppets

The strongest account-level evidence in Santana Mine Supporters comes from a 327-profile sample: 127 active posters or commenters and 200 randomly sampled silent members. For each profile, the scrape recorded visible fields like location, work, school, friend count, profile photo presence, locked status, and whether the profile showed any visible activity.

A sockpuppet account is a profile used to create the appearance of independent support. You usually do not catch it with one clue. You catch it with stacks of weak clues that line up.

Here are the main signals in the sample recorded from public profile surfaces:

Signal	Count	% of sample
No workplace listed	325	99.4%
No school/education listed	324	99.1%
No historical year mentioned anywhere on profile	231	70.6%
No activity on own profile	135	41.3%
No location at all	134	41.0%
Profile is privacy-locked	112	34.3%
No cover photo	50	15.3%
Only year visible is 2026	7	2.1%

None of these, alone, proves anything. Plenty of real people barely use Facebook, list nothing, and lock their profiles down. But the pattern here is cumulative.

The missing piece in the earlier draft was the threshold. The “about 1 in 5” figure was not derived from any one signal above. It came from counting profiles that matched a multi-signal shell pattern: accounts with several of the high-risk traits at once — for example no workplace, no school, no location, no visible profile activity, and privacy locking or similarly minimal surface completeness. In the reporting dataset, the cutoff was a pre-defined cluster threshold rather than a single red flag: accounts had to stack multiple shell-like signals before being counted in the sockpuppet-style bucket.

That is a better method than cherry-picking one weird field, but it is still circumstantial. A privacy-conscious real user and a cheaply prepared shell can look similar from the outside. The point is not that every sparse account is fake. The point is that Santana Mine Supporters contains a meaningful fraction of accounts whose visible profile surfaces are sparse in the same way, at the same time, inside the same support group.

The split between active and silent members matters too. Silent members were more likely to look like empty shells, while active posters and commenters were somewhat more complete on average. That is what you would expect if some accounts existed mainly to inflate apparent support while a smaller subset handled visible engagement.

Locality and attribution limits

The public evidence establishes four things.

First, Santana Mine Supporters did not grow in the pattern you’d expect from a long-running organic community. The bulk-join cohorts are visible and quantifiable in the member scrape.

Second, the group is run by an admin team that does not read as purely local. The presence of a recent joiner tied to a virtual admin business is especially hard to square with the idea that this is just neighbors gathering themselves.

Third, the member base includes a meaningful share of low-completeness profile shells that fit the usual sockpuppet profile. Not all of them are fake. Enough of them look fake that the group’s headline size stops being trustworthy as a measure of local public sentiment.

Fourth, only a small minority of members are demonstrably local to the affected area. The 6% figure in the reporting dataset comes from conservative classification using only explicit public signals: profiles that listed a local place name in or near the mine-affected region, or otherwise exposed clear location fields linking them to that area. Members with no visible location, ambiguous locations, or only broader New Zealand identifiers were not counted as local. In other words, the 6% is not “everyone who might be local.” It is the share that could be verified as local from public profile data at scrape time.

That method has obvious limits. Many real people do not list a location, and Facebook profile surfaces vary by privacy setting. So the locality figure is best read as a floor on visible local membership, not a complete census. But that floor is still revealing: if a nearly 10,000-member support group can publicly verify only a thin slice of local ties, its headline size is doing more rhetorical work than evidentiary work.

What the evidence does not prove is intent. A stronger attribution case would require evidence such as:

internal messages or campaign instructions
payment records linking admin services to group operations
creation-time metadata from Facebook
repeated reuse of the same accounts across multiple advocacy groups
IP, device, or login overlap that only the platform could see

That is the line between forensic public evidence and a completed attribution case. On the public evidence alone, the most defensible conclusion is that Santana Mine Supporters shows the visible fingerprints of a manufactured support operation. The only thing missing is the invoice.

Key Takeaways

Santana Mine Supporters grew in two huge early cohorts, with 72.4% of members joining in the first two months, not through a long organic buildup.
The admin team includes non-local accounts and one admin who lists work at The Admin Superstar, a New Zealand outsourced admin business.
A 327-profile sample found many low-completeness accounts, with roughly 1 in 5 showing a sockpuppet-style fingerprint based on a multi-signal threshold, not any single trait.
The visible-locality measure was conservative: only profiles with explicit local identifiers were counted, and that produced a figure of about 6% demonstrably local members.
The evidence is strongly circumstantial, not definitive proof of deception; proving intent would require platform or financial records beyond the public Facebook surface.

125 Words, No Account Cues: AI Identifies Writer From Style

Simon Paxton — Mon, 27 Apr 2026 04:34:18 +0000

Anthropic’s Claude Opus 4.7 reportedly identified journalist Kelsey Piper from 125 words of unpublished text, and the details of her test are why this has landed so hard. In Piper’s account, the model named her not from account history or a saved chat, but from prose she says had never been published.

That makes the interesting claim bigger than “Claude guessed a journalist.” If AI identifies writer from text alone, anonymity stops being just a browser, account, or IP problem. It becomes a stylometric fingerprinting problem — a writing-style problem — where the signal is in the prose itself.

How Claude Opus 4.7 Identified Kelsey Piper

Piper’s report in The Argument is the core evidence here. She says Claude Opus 4.7 took a 125-word excerpt from an unpublished political column and answered that the likeliest author was Kelsey Piper.

She then tried to remove obvious alternative explanations. She says she ran the prompt in Incognito Mode, with memory disabled, then repeated it on a friend’s computer, and then through the API. Each step is aimed at stripping away a different clue: account context, browser state, local machine history, and some ordinary web tracking routes.

She also says she changed the genre. According to Piper, Claude still named her from unpublished writing outside her normal public beat, including a school progress report about her child and a movie review. That matters because topic is the laziest route to writer identification. If you write a lot about policy, a model can cheat by inferring the pool of likely authors from subject matter alone.

ChatGPT and Gemini reportedly did not match Claude on her test. Piper says ChatGPT guessed Matt Yglesias and Gemini guessed Scott Alexander on the initial sample. That is still anecdotal, but it’s a useful comparison: the same text, different models, different result.

Anthropic has not documented “identify the author of this text” as a product feature. Its release post for Opus 4.7 and model page position the model around coding, agentic work, document analysis, and complex tasks, not authorship attribution. So this is not a vendor-announced capability. It is an externally reported behavior from a single prominent self-test.

Why the Test Matters for Anonymous Writing

The stakes are not mainly “an AI can name a famous columnist.” The real problem is cross-account deanonymization.

A pseudonymous writer often tries to separate identities by separating accounts, devices, and contexts. That is classic privacy hygiene. But if AI identifies writer from the text itself, those controls stop being the whole game. A model does not need your login if your sentence rhythm, punctuation habits, favorite transitions, and word choices are enough.

That creates concrete risks for three groups in particular:

Journalists sharing notes, drafts, or source material with AI systems
Whistleblowers trying to communicate anonymously across platforms
Pseudonymous writers who keep public and private identities separate

The mechanism is simple. An adversary does not need one perfect “this is definitely Jane Doe” answer. They need a tool that can reliably say these two anonymous accounts are probably the same person. Linking identities is often enough.

That is why this story sits next to broader privacy questions around AI tools. If you are already thinking about whether your prompts stay private in products like Claude Enterprise privacy or whether extensions leak extra data as in ChatGPT extension privacy, this adds another layer: even a well-contained prompt may still reveal the author through style.

What Stylometric Fingerprinting Can and Cannot Do

Stylometric fingerprinting is the practice of identifying authors from patterns in how they write. This is older than LLMs. Forensic linguistics has used it for years.

The underlying signals are usually mundane and unconscious:

sentence length and pacing
punctuation habits
transition words
preferred phrasing
syntactic patterns
how often someone uses abstraction versus concrete nouns

A frontier model changes the interface, not the idea. Instead of training a narrow classifier on a fixed corpus, you can now ask a general model to reason over style directly, compare it against learned examples in its training data, and produce a ranked guess. That makes writer identification far more accessible.

But there are limits.

First, Piper’s result is still not independently replicated at scale. One strong anecdote is not a benchmark. The Washington Post’s Megan McArdle reported similar self-tests on her own unpublished writing, which suggests Piper may not be a one-off, but that is still anecdotal evidence rather than a controlled study.

Second, famous writers are easier targets. A journalist with a large public corpus gives the model more to compare against than an ordinary private person. Claude identifying Kelsey Piper does not automatically mean it can identify any random office worker from 125 words.

Third, author attribution can be directionally useful without being forensically reliable. A model that over-guesses a known writer, or narrows the field to a handful of likely candidates, can still be dangerous. Security tools do not need courtroom certainty to create real risk.

That uncertainty is exactly why this belongs with other LLM failure modes. Models can be weirdly strong at one task, brittle at another, and overconfident throughout. “It guessed a name” is not enough by itself. The interesting part is the test design and the pattern across repeated attempts.

The Real Risk: Linking Anonymous Accounts Across Text

The deanonymization problem is bigger than naming celebrities. It is about linkage.

Imagine two newsletter accounts, a private Discord identity, and an anonymized tip sent to a reporter. They use different emails, different browsers, maybe even different devices. If their prose carries the same statistical signature, a strong model can treat them as one trail.

That changes what “anonymous text privacy” means in practice. The vulnerable unit is no longer just the account. It is the style.

A useful way to think about it is voice recognition for writing. Not perfect. Not universal. But often good enough. A model might fail to say “this is definitely Kelsey Piper” and still succeed at “these four texts were probably written by the same person.” For whistleblowers, that can be enough to collapse the wall between safe and unsafe identities.

There is also an asymmetry here. Anthropic’s public materials describe Opus 4.7 as strong at document work and analysis. Piper’s result, plus the model comparison she reported, hints that Claude Opus 4.7 may currently be better at reading prose than rival models in this specific sense — spotting latent structure in writing style. That is not a formal benchmark result, but it fits the observed behavior better than the simpler alternatives she tried to eliminate.

The next obvious step is independent testing: blinded samples, larger author pools, repeated trials, and same-text comparisons across models. Until then, Piper’s experiment is best treated as a strong anecdotal demonstration of something people in stylometry have long argued: your writing voice is not just expressive. It is identifying.

Key Takeaways

Kelsey Piper reported that Claude Opus 4.7 named her from 125 words of unpublished text.
Her test tried to remove account, browser, device, and topic cues by using incognito mode, a friend’s computer, the API, and off-genre samples.
Anthropic does not document writer identification as a product feature; the evidence so far is external and anecdotal.
The main risk is not celebrity recognition but cross-account deanonymization for journalists, whistleblowers, and pseudonymous writers.
Stylometric fingerprinting is an established idea, but Claude’s apparent performance here still needs independent replication.

A Formula From Another Field Opened Erdős Problem

Simon Paxton — Mon, 27 Apr 2026 04:31:38 +0000

Erdős problem #1196 now has a serious claimed solution, and the evidence ladder is unusually visible. Liam Price posted GPT-5.4 Pro output to erdosproblems.com; Scientific American reports that Terence Tao and Jared Lichtman said the opening move looked new for this problem; an 8-page note organizing the argument now exists; and a Lean formalization repository claims a machine-checked proof. The theorem claim and the proof artifacts are public. The novelty of the opening is still best described as an expert assessment, not a settled historical fact.

That is the interesting part. Not that an amateur suddenly outran the field, but that a general-purpose model may have made move one differently. Tao’s description, in Scientific American, is the load-bearing fact: researchers had converged on a standard opening for this Erdős problem, and the model instead reached for a formula from a related area of math that nobody had been trying here.

What AI Actually Solved in Erdős Problem #1196

The problem is about primitive sets and a weighted sum Erdős defined for them. A primitive set is a set of positive integers where no element divides any other. All primes form a primitive set, but many non-prime examples exist too.

Erdős Problem #1196 asks whether every primitive set made only of sufficiently large numbers obeys a universal upper bound for its Erdős sum. More concretely, the claimed result bounds the weighted sum over any primitive set using weights proportional to 1/(n log n), as long as every element of the set is at least x. When the later Lean repository says the set is supported on [x, ∞), that just means every number in the set is at least x.

That is more specific than “AI solved a math problem.” The theorem is a quantitative statement about all primitive sets above a threshold, not a one-off construction or a numerical experiment.

The problem was not ignored. Scientific American reports that it had eluded prominent mathematicians, and Tao’s quote there is tighter: “people did look at it.” That matters because it rules out the easiest fake-AI-breakthrough story, where a model stumbles into a neglected exercise nobody serious cared about.

The current status is stronger than a forum post. There is:

a public thread on erdosproblems.com where the proof was posted and refined,
an 8-page write-up organizing the argument,
and a Lean repository claiming a formalization of the result.

Formalization does not mean the model wrote a correct proof end to end. It means the final theorem was translated into Lean, a proof assistant that checks each logical step once humans make those steps explicit enough.

Why the amateur mattered less than the model’s first move

Liam Price’s role was prompting, posting, and surfacing the result. The potentially novel mathematical step is what experts attributed to the model.

This gets blurred in headlines. According to Scientific American, Price is a 23-year-old with no advanced math training, and the claimed solution began with a single prompt to GPT-5.4 Pro. If the story were simply “an amateur solved a hard open problem,” the right default reaction would be skepticism.

Instead, Tao and Lichtman focused on something narrower. Tao said previous researchers had a standard sequence of moves they usually started with. The model did not follow that sequence. It applied a formula already known in a related part of mathematics to this primitive-sets question.

That difference is the whole story. The important claim is not that ChatGPT became a mathematician. It is that a general-purpose model may have proposed a first step specialists had systematically not been trying on this Erdős problem.

Tao’s public wiki on AI contributions to Erdős problems is useful context here because it is cautious, not promotional. It notes selection bias, provisional assessments, and cases where AI-assisted work later turned out to be wrong. So this result got attention despite a skeptical backdrop, not because mathematicians have started lowering the bar for AI math headlines.

How the proof moved across subfields

The mechanism was not “the model reasoned perfectly.” It was “the model tried a different route.” In source-safe terms, that route was to use a formula already known in a related area of math on this primitive-sets problem, rather than following the standard sequence of moves earlier researchers used. The supplied sources do not spell out the exact formula in enough detail to name it more precisely here, and that is exactly why the right description stays at this level.

From there, the process was procedural, not magical:

Stage	Who did it	What happened
Initial prompt	Liam Price	Submitted the problem to GPT-5.4 Pro
First proof attempt	ChatGPT	Produced a rough proof containing the nonstandard opening move
Expert evaluation	Jared Lichtman, Terence Tao, others	Checked whether that move could actually support the theorem
Proof cleanup	Human mathematicians	Rewrote, shortened, and organized the argument
Formal verification	Math, Inc. Lean repo	Encoded the theorem as a machine-checked proof artifact

That middle phase is the part most AI headlines skip. Lichtman told Scientific American that the raw ChatGPT output was “actually quite poor” and that experts had to “sift through and actually understand what it was trying to say.”

So the result was not a polished theorem-proof package dropped out of a chatbot. It was a messy draft with one promising move inside it, followed by human interpretation, proof repair, and later formalization.

That chronology also explains why this looks different from previous AI math breakthrough claims around Erdős problems. The public record here includes the original posting, mathematician commentary, a cleaned-up note, and a Lean artifact. You can watch the proof becoming legible.

Breadth-Stuck Problems and Cross-Subfield Search

The evidence here supports a narrower conclusion than “LLMs can do frontier math.” It supports the claim that a model can sometimes help when a problem is stuck because everyone keeps opening the same way.

Tao’s wiki is the reason to keep that conclusion narrow. It explicitly says the list is not a benchmark, warns that assessments are provisional, and tracks incorrect claims too. So Erdős problem #1196 is not proof that general-purpose models are now reliable theorem provers.

What it does show is a workflow that looks plausible and testable:

a model proposes an off-path opening,
experts decide whether that opening contains a real idea,
humans rebuild the argument into mathematical form,
and a proof assistant can later verify the final structure.

That is a very specific capability: broad analogical search across subfields, followed by expert cleanup and formal verification. On this evidence, that is the part worth taking seriously.

Key Takeaways

Erdős problem #1196 concerns primitive sets and a weighted sum bound, not a generic “AI solved math” stunt.
The visible evidence chain is public: Price’s post, mathematician commentary, an 8-page note, and a Lean formalization repository.
Liam Price surfaced the result, but Tao and Lichtman’s reported view is that the important step was the model’s nonstandard opening move.
The raw ChatGPT output was, in Lichtman’s words, “actually quite poor,” which makes this a story about expert cleanup and verification, not autonomous theorem proving.
This case shows that a model can contribute a novel opening move, but only after expert interpretation and later formal verification does that become a legitimate mathematical result.

302 Designs, 16 Hits: AI-Designed Viruses in the Lab

Simon Paxton — Sun, 26 Apr 2026 21:31:23 +0000

AI-designed viruses are now a lab result, but not in the way the viral posts made it sound. Researchers affiliated with Stanford, Arc Institute, and UC Berkeley used a specialized genome language model called Evo to generate bacteriophage genomes, then tested them experimentally. According to Nature and Semafor’s reporting on the September 2025 preprint, the team made 302 designs and 16 of them infected E. coli.

That is the verified core of the story. These were bacteriophages—viruses that infect bacteria—not human viruses, and the system was not a consumer chatbot improvising bioweapons. The result matters anyway because it is a concrete test of whether sequence models can search biological design space and occasionally land on something that works in the lab.

What Stanford’s AI-designed viruses actually were

The model here was Evo, which Stanford described in December 2024 as “a generative AI model that writes genetic code.” Stanford said Evo was trained on 80,000 microbes and 2.7 million prokaryotic and phage genomes, covering 300 billion nucleotides. Arc Institute called it a biological foundation model trained on DNA at scale.

That training setup matters because it explains what kind of system this was. Evo is not a general-purpose assistant with some biology knowledge taped on. It is a sequence model trained directly on genomes, built to generate and score DNA.

In the later phage experiment, reported by Nature and Nature’s Daily Briefing, the researchers used the DNA of ΦX174, a simple bacteriophage, as a guide for design. They generated candidate phage genomes intended to infect E. coli.

Nature and Stanford both describe these as bacteriophages targeting E. coli, not human viruses.

Stanford also said Evo’s training excluded viruses known to infect humans and some other organisms, explicitly as a safeguard against bioweapon misuse. That does not erase dual-use concerns, but it does tell you the developers were not casually training a model on human-pathogen genomes and then seeing what happened.

Why 302 designs produced only 16 working phages

The headline number is 302 designed phages, 16 functional phages. Nature’s Daily Briefing reported that 16 could infect E. coli, and Semafor independently reported the same 302/16 figure.

That is a 5.3% hit rate. For anyone used to reading AI launch copy, that number is refreshingly concrete.

It also tells you what the system did not do. Evo did not solve virology end to end. It searched a large design space, produced many candidates, and most failed.

The likely failure points are biological, not rhetorical. A generated genome still has to survive synthesis, assembly, expression, protein folding, packaging, and infection dynamics before anyone can call it functional.

Nature and Semafor’s reporting is what makes this more than an in-silico result: the candidates were synthesized and tested in the lab, and a subset actually infected E. coli.

Nature’s reporting adds an important practical result: combinations of the successful phages could kill three E. coli strains, including strains the original ΦX174 could not kill. That is the therapy angle. The win here is not “AI created life.” The win is that a model-generated search process produced some antibacterial candidates with lab-validated activity.

The novel protein claim needs a stricter reading

The most dramatic version of this story says one AI-designed virus used “a protein that doesn’t exist in any known organism on Earth.” That wording is stronger than the accessible source base supports.

Here the source status matters. Nature’s accessible coverage does not document that stronger wording, and Stanford’s 2024 Evo explainer makes a broader claim that models like this may help researchers design new biological systems and proteins. That is not the same thing as verifying that a specific protein in this experiment exists nowhere in known life.

The underlying reporting does support a narrower claim: at least one design appears to include a highly divergent or apparently novel protein sequence associated with phage function. But the exact statement “does not exist in any known organism” is unverified from the accessible primary and high-quality sources here.

Why is that too strong? Because sequence novelty is not the same as biological novelty. A protein can be absent from current databases and still resemble known folds, motifs, or functions. Genomes in the wild are massively under-sampled. And even if the amino acid sequence is new, that does not automatically mean the structure or mechanism is unprecedented.

So the right read is simpler. The experiment supports that the model produced functional phages with at least some substantially divergent sequence content. It does not, from the reporting and source material available here, prove that Earth had never seen anything like that protein before.

That narrower claim is still interesting. If a genome model can generate sequences far enough from known examples to look unusual and still function, then it is doing more than trivial memorization. It is exploring a real design space, with a low but nonzero lab success rate.

What this means for biosecurity and therapy for AI-designed viruses

The immediate upside is antibacterial phage therapy. Drug-resistant bacteria are an obvious target because bacteriophages can be tailored to attack specific bacterial strains. If a model can help generate useful phage candidates faster than manual design or blind screening, that is a practical capability.

The immediate downside is that the barrier to exploring viral design space may keep falling. Not because this experiment created human pathogens—it did not—but because it shows a sequence model can move from genome generation to occasional working biological artifacts. Biosafety teams care about demonstrated workflow compression, not just worst-case headlines.

Stanford’s exclusion of human-infecting viruses from training is therefore one of the most important details in the whole story. Stanford presented that exclusion as a concrete safeguard against bioweapon misuse, and that is exactly why it will matter to biosafety teams evaluating training scope and misuse risk.

The bigger shift is methodological. AI-designed viruses in this paper were not a one-shot act of machine creativity. They were the output of a pipeline: curated training data, constrained design around a known phage, synthesis, and experimental screening. With a 5.3% hit rate and a design process guided by ΦX174, the result is both narrower than the headlines and more useful than the hype. Labs now have a proof point that genome language models can be used as search tools for biological engineering.

Key Takeaways

AI-designed viruses in this case were bacteriophages, not human viruses.
The researchers used Evo, a specialized genome language model trained on microbial and phage genomes.
The best-supported experimental result is 302 generated phage designs, with 16 shown to infect E. coli.
The strongest novelty claim is about divergent functional sequences, not a settled proof that a protein existed nowhere in known life.
Stanford says Evo’s training excluded known human-infecting viruses, a concrete biosafety measure that will matter to regulators and labs.

Forem: Simon Paxton

Firefox Zero-Day: Mozilla Says Claude Mythos Found 271 Bugs

AI model helps Mozilla fix 271 Firefox vulnerabilities

Mozilla says Claude Mythos Preview found three Firefox zero-days

The harness and workflow behind the Firefox scans

How Mozilla framed the findings and the remaining caveats

Key Takeaways

Further Reading

1,000x Claim, No Independent Proof: Subquadratic Architecture

Subquadratic claims a 1,000x attention-compute reduction

SubQ 1M-Preview and the products it is pitching

What Subquadratic says about long-context costs

Published evidence at launch

Funding and launch details

Key Takeaways

Further Reading

They Rejected It. It’s Building Anyway: OpenAI Oracle Data Center

OpenAI and Oracle’s Michigan data center project went ahead after local rejection

Saline Township voted down the rezoning before construction started

The project became a zoning fight over exclusionary zoning

OpenAI’s role sits inside the broader Stargate buildout

Key Takeaways

Further Reading

OpenAI Revenue is Not the Whole Story: Anthropic's Enterprise Bet

Why Anthropic’s Enterprise Focus Changes The Revenue Conversation

OpenAI Still Has The Stronger Consumer Brand

What Anthropic’s Public Messaging Reveals About Its Business Model

Why Valuation Claims Need Caution When Revenue Is Not Public

Key Takeaways

Further Reading

11 Minutes, $1.73, and GPT-5.5 Cybersecurity Simulation

GPT-5.5 Cybersecurity Simulation Is No Longer a One-Model Fluke

How GPT-5.5 Performed Across OpenAI's Cyber Task Suite

Why Minutes Matter: The Human-versus-Model Time Gap

What GPT-5.5 Actually Changes for Cyber Evaluation

Key Takeaways

Further Reading

DeepSeek Forces Visual Reasoning Through Points and Boxes

What DeepSeek’s Thinking with Visual Primitives Actually Changes

Why Visual Primitives Beat Vague Descriptions for Spatial Tasks

Why Open Source Matters for Visual Reasoning

What the DeepSeek-PKU-Tsinghua Collaboration Signals

Key Takeaways

Further Reading

The Research Map is Already Live, but the Methods Aren’t: Semantic Map Tool

What The Global Research Space Actually Is

How the semantic map tool is built

What users get from semantic search and analytics

From Query Box to Spatial Browsing

Documentation Gaps in the Alpha Release

Key Takeaways

Further Reading

Compute Anxiety, Not Collapse: OpenAI Revenue 2026

Why OpenAI Is Under Pressure Now

What the Revenue Miss Reports Actually Say

The Microsoft Rewrite Changes the Stakes

What Developers and Customers Feel on the Ground

OpenAI’s Harder Growth Phase

Key Takeaways

Further Reading

10,000 Members, 1 Tight Script: Santana Mine Supporters

How the Santana Mine Supporters group was built

Why the admin team looks outsourced

The profile fingerprints that suggest sockpuppets

Locality and attribution limits

Key Takeaways

Further Reading

125 Words, No Account Cues: AI Identifies Writer From Style

How Claude Opus 4.7 Identified Kelsey Piper

Why the Test Matters for Anonymous Writing

What Stylometric Fingerprinting Can and Cannot Do

The Real Risk: Linking Anonymous Accounts Across Text

Key Takeaways

Further Reading

A Formula From Another Field Opened Erdős Problem

What AI Actually Solved in Erdős Problem #1196

Why the amateur mattered less than the model’s first move

How the proof moved across subfields

Breadth-Stuck Problems and Cross-Subfield Search

Key Takeaways