Forem: MxGuru

Local-First AI: Why Your Threat Intel Shouldn't Live on Someone Else's Server

MxGuru — Wed, 20 May 2026 22:38:00 +0000

Every time you send a query to a cloud AI API, you're sending data you don't control.

For most use cases, this is fine. For security teams, it's a compliance problem.

Your threat intelligence. Your vulnerability scan results. Your client's infrastructure details. Your red team findings. All sitting on someone else's server, governed by someone else's retention policy, subject to someone else's subpoena.

The Local-First Alternative

I built The Sovereign Hive to run entirely on local hardware:

114 local models via Ollama (including quantized models that run on consumer GPUs)
Zero-trust secrets vault with hardware key support (YubiKey/USB auth)
Full audit trail — every action, every tool call, every agent decision logged
SPIFFE workload identity for service-to-service authentication
BitLocker integration for encrypted-at-rest key storage

Your data never leaves your network. Not even for embeddings — the semantic intent classifier uses nomic-embed-text running locally via Ollama.

What You Lose

Honestly? Not much.

Latency: Local inference on a 3090 is 30-60 tok/s. Cloud APIs are ~80-100 tok/s. The difference rarely matters for agent workloads.
Model variety: Ollama supports hundreds of models. Anything on Hugging Face can be converted.
Scale: If you need 1000 concurrent users, you need a cloud. For a security team of 1-20? Local is more than enough.

What You Gain

Your data stays yours
No API bills (after the hardware investment)
No vendor lock-in
No rate limits
Runs during internet outages
Full reproducibility — same model, same weights, same results

If you handle sensitive data and you're still sending it to cloud APIs, it's worth asking: is the convenience worth the risk?

Repo is private during development — DM me for early access.

Building a Self-Healing Kill Switch for AI Infrastructure

MxGuru — Wed, 20 May 2026 20:15:00 +0000

AI platforms have a unique failure mode: they can bankrupt you.

A runaway inference loop. A cascading retry storm. An agent that decides to call GPT-4 in a tight loop. Traditional SRE practices catch crashes. They don't catch slow financial death.

The Extinction Protocol

I built a daemon called the Extinction Protocol Agent (EPA) that monitors:

Token burn rate — catch runaway inference before the bill spikes
Data integrity — detect corruption before it propagates through the knowledge graph
Cascade failures — one agent crash shouldn't take down the swarm
Turn ledger health — track conversation state integrity

Phase Escalation

The EPA doesn't just alert. It acts.

NORMAL -> QUARANTINE -> PRESERVATION -> RECOVERY -> LIFEBOAT

NORMAL: Everything's fine. Passive monitoring.

QUARANTINE: Anomaly detected. Isolate the affected subsystem. Block new requests to it. Keep everything else running.

PRESERVATION: Multiple anomalies. Start persisting critical state to durable storage. Reduce non-essential operations.

RECOVERY: System is degraded. Attempt automatic recovery — restart failed services, replay lost messages, rebuild corrupted state.

LIFEBOAT: Recovery failed. Save everything salvageable, shut down gracefully, and prepare for clean restart.

Why Not Just Use PagerDuty?

PagerDuty tells a human there's a problem. The EPA fixes the problem — or at least contains the blast radius — before a human even wakes up.

The key insight: AI infrastructure fails gradually, not suddenly. By the time a traditional alerting system pages someone, the damage is already done. The EPA intervenes at the first sign of drift.

Try It

The Sovereign Hive is open source. The EPA ships as one of 11 power-up modules in the Intelligence Bundle.

Repo is private during development — DM me for early access.

I Built a 127-Tool MCP Server From Scratch — Here's What I Learned

MxGuru — Wed, 20 May 2026 17:01:00 +0000

The Model Context Protocol (MCP) is how AI agents talk to tools. Claude Code, Cursor, Windsurf — they all use it. But most MCP servers have 5-10 tools.

I built one with 127.

Why?

I run a local AI operations platform called The Sovereign Hive. It coordinates multi-agent swarms, runs security scans, manages a knowledge graph, and serves as the backbone for everything I build. Every agent needs tools — and I got tired of wiring up 8 different MCP servers.

So I consolidated everything into one server, one port, one health endpoint.

The Tool Categories

Category	Count	Examples
File I/O	11	read, write, copy, move, delete, head, tail, wc
Search	6	grep, glob, find_symbol, find_references, search_replace
Git	10	status, diff, log, blame, commit, branch, stash, tag
Code Analysis	6	lint, complexity, dead_code, dependency_graph
Browser Automation	7	navigate, screenshot, click, fill, evaluate, snapshot
Docker	8	ps, logs, exec, images, inspect, run, stop, stats
Semantic Memory	7	store, search, relate, observe, get, list, delete
Monitoring	4	health_probe, logs_tail, service_status, uptime_check
HTTP/Web	5	fetch, request, dns_lookup, url_encode, curl_equivalent
Web Search	1	DuckDuckGo via ddgs (no API key)
System	7	system_info, process_list, env_vars, port_check, disk_usage
Data Parsing	7	json_query, csv, yaml, toml, ini, xml, json_format
Database	3	sqlite_query, sqlite_schema, sqlite_tables
Archive	5	zip/tar create, extract, list
Text/Transform	8	diff, regex, base64, hash, token_estimate, string_transform
Crypto	4	generate_secret, uuid, hmac, password_hash
Notebook	3	read, create, add_cell
Task/Todo	4	create, list, update, complete
Prompt Engineering	4	build, chain, message_format, library
Thinking/Reasoning	4	sequential_think, decision_matrix, assumption_check, pros_cons
API Testing	4	graphql_query, websocket_send, api_test, openapi_parse
Comms Hub	3	post, read, channels
Ollama	2	list models, generate

Architecture Decisions

Every tool is an async function with the same signature:

async def tool_name(args: dict) -> dict:

Input is always a dict. Output is always a dict. No exceptions in the signature — errors go in {"error": "..."}.

Every tool carries MCP metadata:

TOOL_META = {
    "name": "grep_recursive",
    "description": "Search for a regex pattern across files in a directory tree.",
    "inputSchema": { ... }  # JSON Schema
}

This means any MCP client can discover the tool, see its parameters, and call it — without reading the source code.

The registry supports both stdio and HTTP/SSE transport:

mcp_server.py — JSON-RPC over stdin/stdout (for Claude Code direct integration)
mcp_server_sse.py — FastAPI with /tools, /tools/call, /mcp, /sse, /health endpoints

No mandatory external dependencies. Every tool uses Python stdlib where possible. Browser tools need Playwright. Docker tools need Docker. But the other 112 tools work with zero pip installs beyond FastAPI/uvicorn.

The Semantic Memory System

This was the most interesting piece to build. It's a knowledge graph stored in SQLite with TF-IDF similarity search — no vector database, no embeddings model required.

await memory_store({"name": "project-x", "content": "FastAPI backend with Redis caching", "type": "project"})
await memory_relate({"from": "duayne", "relation": "builds", "to": "project-x"})
await memory_observe({"entity": "project-x", "content": "Deployed to production"})
results = await memory_search({"query": "FastAPI caching backend"})

Entities, relationships, and observations — all queryable. Agents can build up persistent knowledge across sessions without needing a GPU or external service.

What I'd Do Differently

Start with MCP metadata from day one. I retrofitted it onto 15 existing tools. Building it in from the start is much cleaner.
Group tools by file, not one-per-file. Related tools (like all git operations) belong together.
The DDG HTML scraper approach failed. DuckDuckGo now serves CAPTCHAs to scrapers. Use the ddgs library or pay for a search API.

Try It

The entire stack is open source: Repo is private during development — DM me for early access.

The Best Result This Week Was a Failed Prediction — Phase-3a Doesn't Transfer

MxGuru — Wed, 20 May 2026 16:35:16 +0000

Part 3 of the quantization series. Yesterday I tested whether Part 1's drift-inversion intervention generalizes beyond granite. I wrote down a falsifiable prediction before the result. The prediction failed in real time — Qwen-2.5-14B reverses the sign of the effect, distributed across 61% of windows, not noise. This post is why a clean failed prediction is a better outcome than three-for-three same-direction would have been, and what the n=3 transfer data actually says about whether the intervention generalizes. Spoiler: it doesn't. And that's the win.

Two Localizers, Both Wrong: Bounding a Quantization Cost That Wouldn't Close

MxGuru — Wed, 20 May 2026 14:45:46 +0000

Part 2 of the quantization series. Spent two days and $12 hunting for the right localizer after Part 1 showed the per-layer drift metric lies. Both candidates — token-level logit-divergence at wrong tokens, AWQ-clipping on the surfaced layers — came back empty. Honest finding: an 8B model on a 12GB card costs ~12.7% PPL on wikitext-2, the gap is diffuse and proportional, no clever subset-targeted fix closes it. One process habit (a no-op control reproducing the baseline to 4 decimals) caught a silent bug that would have shipped a wrong 'AWQ-clipping wins' claim.

When the Sensitivity Metric Lies: A Drift-Inversion Smoking Gun in Mixed-Precision LLM Quantization

MxGuru — Wed, 20 May 2026 11:32:35 +0000

The HSAQ pipeline (Hybrid Sensitivity-Aware Quantization) is supposed to do one thing well: spend bits where they hurt. Profile each Linear layer's output drift under 2/3/4-bit quantization on real calibration data, then let a greedy allocator distribute the bit budget so total drift is minimized under the VRAM ceiling.

That works. Until it doesn't.

This is the story of one experiment — Phase-3a, run 2026-05-19 on ibm-granite/granite-3.3-8b-instruct — that broke a quiet assumption underneath the whole approach. The drift metric mismeasures real PPL impact on outlier-heavy attention layers. Worse, it mismeasures it in the wrong direction: the harder you push the metric down, the more outliers can sometimes corrupt generation.

The setup

HSAQ's baseline on granite-3.3-8B at a 12 GB consumer VRAM budget produces a mixed assignment averaging ~3.3 bits per Linear across 281 quantized modules. Measured against bf16, this baseline lands at:

Metric	bf16 baseline	HSAQ baseline	Δ
Wikitext perplexity	8.756	10.013	+14.42%

A +14.42% PPL hit is rough. Target was <8% (a soft "you can still feel it but it's usable" line in our internal eval). The first thing you do when the budget is the constraint is examine the residue — which layers are at the bottom of the bit-ladder, and could a small structural rule move them up?

After baseline assignment, 16 of 281 Linears sit at 3-bit (the rest at 4):

7 × mlp.down_proj — FFN expansion projections (~59M params each, the allocator's favorite victims)
6 × self_attn.o_proj — attention output projections (the outlier-heavy ones)
2 × mlp.gate_proj (L0, L39)
1 × self_attn.q_proj (L34)

The Phase-3a intervention was simple: force all o_proj layers to a minimum of 4 bits, regardless of allocator preference. Six layers move 3 → 4. About 0.05 GB of weight budget gets reallocated. Re-run end to end.

The result

Metric	HSAQ baseline	HSAQ + o_proj floor	Δ
PPL above bf16	+14.42%	+13.80%	-0.62pp

A real improvement. Small — about 4% relative on the gap to bf16 — but real. And reproducible: the baseline run inside the same job matched yesterday's baseline to 4 decimal places (10.0133 → 10.0133), so the methodology is bulletproof. Cache invariance also confirmed: HSAQ's SQLite sensitivity cache produced identical drift values across both runs.

So far this is unremarkable. The "+0.62pp from a 0.05 GB nudge" finding alone would justify a paragraph in an internal log, nothing more.

Then we looked at the per-layer drift.

The inversion

When the floor forced these six o_proj layers from 3-bit to 4-bit, their measured per-layer drift went dramatically worse — not better:

Layer	Drift at 3-bit	Drift at 4-bit	Ratio
`model.layers.21.self_attn.o_proj`	2.70	8.44	3.1× worse
`model.layers.30.self_attn.o_proj`	1.26	6.51	5.2× worse
`model.layers.8.self_attn.o_proj`	1.39	3.44	2.5× worse

Three of six layers showed >2.5× drift inflation at the higher bit-width. And the overall PPL — the thing the drift metric is supposed to predict — got better anyway.

Let that land. The signal the allocator uses to decide which layers deserve more bits is telling us:

"Layer 21's o_proj is 3× more damaged at 4-bit than at 3-bit. Definitely don't promote it."

And the model is responding:

"Actually, the 4-bit version generates better text. Thanks."

This is not noise. It reproduced across 32-sample and 256-sample calibration sets. It is a systematic divergence between what HSAQ measures and what actually matters.

What's actually happening

HQQ's quantization is groupwise: it picks one scale and zero-point per group of 64 weights. The mechanism that makes HQQ fast and parameter-light is the same mechanism that breaks here.

"One scaling factor for 128 weights means one outlier crushes the other 127 to zero." — Gemini's description of HQQ group quantization (we run at group_size=64, but the principle is identical).

On outlier-heavy layers like o_proj (which carries the per-head attention output back into the residual stream) and down_proj (which projects the wide FFN intermediate back down), a small number of channels carry order-of-magnitude larger activations than the rest. At 3-bit, the quantization is so coarse that everything is approximate and the model has already absorbed the noise. At 4-bit, you get more precision per group, but the outlier still dominates its group's scale — so the 63 non-outlier weights in that group get more crushed relative to what they should be, not less.

The drift metric notices this. It measures normalized MSE between the bf16 layer output and the quantized layer output on captured calibration activations. The increased crushing of small weights inside outlier-dominated groups produces a larger MSE — that part is real and the metric is honest about it. But the model in practice is much more tolerant of "small weights got squashed" noise than of "outlier weight got rounded to a bin that doesn't represent its magnitude" noise. The drift metric weights these the same. Real PPL doesn't.

"HQQ is blind to data flowing through it." — same source. This is the whole conceptual gap that activation-aware methods (AWQ, GPTQ, imatrix) close.

What this means if you use a drift-based allocator

If you're running anything in the mixed-precision-by-sensitivity family — SqueezeLLM, OWQ, our HSAQ, anything that picks per-layer bit-widths from a calibration MSE signal — there is a category of layer where your signal is lying to you. Specifically: outlier-heavy attention output projections (o_proj) and FFN down projections (down_proj). These are the layers AWQ identified five years ago as needing per-channel scaling, and the reason is precisely the dynamic our drift metric is failing to model.

Two implications:

Treat the drift signal as approximate on o_proj and down_proj. A sensitivity floor is one cheap way to do this — force these layers to a known-better bit-width regardless of what calibration MSE says. That's what Phase-3a tested, and it worked, even though it cut against the allocator's recommendation.
Calibration-MSE is the wrong signal for outlier-heavy layers. The right signal is something like KL divergence on output logits, or PPL impact directly measured on a held-out validation set. Both are more expensive than HQQ-output MSE, but on the layers where MSE lies, the expense is justified.

We are not the first to notice this. AWQ's original paper makes the case in different language: "the importance of a weight is determined by the activation magnitude, not the weight magnitude." HQQ's design choice to be data-blind is the feature that makes it fast and the bug that makes it brittle. What this experiment adds is a clean reproduction on a current 8B model, with the exact mechanism visible: same calibration cache, same allocator, two runs differing only in the floor parameter, drift-vs-PPL anticorrelation jumping out at you.

What didn't work

For completeness — Phase-3a tested two structural levers, only one helped meaningfully.

o_proj sensitivity floor: +0.6pp PPL improvement. Useful, but small.
group_size=64 (vs the HQQ default of 128): already baked into HSAQ from day one (config.py:52: HQQ_OVERHEAD_FACTOR = 1.065 # 6.5% average (zeros 64 + scales 64 per group)). The hypothesis that tightening the group size would help was wrong about our starting point — we were already at the practical floor. Tightening further to gs=32 has diminishing returns and roughly doubles overhead.

The conclusion is sharper than the headline number: more HQQ tuning is not the lever. The bit budget is gone, the group size is at the practical floor, and the drift metric we're using to allocate the budget that remains is unreliable on the layers where allocation matters most.

What's next: AWQ on a 9-layer target list

A separate diagnostic — logit divergence comparison between the HSAQ-quantized model and bf16, run on 96 prompts the same day — produced a clean QUANTIZATION_BIAS_DOMINANT verdict: 63/96 divergences are confidently wrong (the model is sure of a wrong token), only 3/96 are high-entropy uncertainty. This is the signature of representation failure, not undertraining. It is what AWQ is designed to fix.

The diagnostic surfaced nine specific layers driving the divergence:

Layer	Drift score
`model.layers.28.self_attn.o_proj`	23.00
`model.layers.13.self_attn.o_proj`	14.53
`model.layers.15.mlp.down_proj`	6.36
`model.layers.28.mlp.down_proj`	6.28
`model.layers.25.mlp.down_proj`	6.21
`model.layers.20.mlp.down_proj`	5.41
`model.layers.14.self_attn.o_proj`	5.18
`model.layers.15.self_attn.o_proj`	5.15
`model.layers.17.self_attn.o_proj`	4.69

Pattern: mid-to-late transformer (L13–L28), attention output and MLP down projections. Textbook activation-outlier signature. The next post will report on an AWQ POC targeting exactly these nine layers — leaving the other 272 Linears under HSAQ as today, swapping only the outliers to AWQ. If the gap closes there, the recipe likely generalizes. If it doesn't, we have a different problem.

Calibrating prior claims

A previous LinkedIn pulse made the claim that this hybrid quantisation recipe holds across model families. That claim should be softened pending the AWQ run. The HSAQ allocator's behavior on o_proj and down_proj is consistent across architectures we've tested — but the fix (whether AWQ closes the gap to <8% PPL across architectures) is not yet validated. Phi-4 has a different attention layout (no separate o_proj); confirming transferability there requires running the same divergence diagnostic on a Phi-4 HSAQ quantization, which is queued.

Bottom line

If you're using calibration-MSE as your per-layer sensitivity signal, run a sanity check: pick your worst-PPL allocation and force-promote the o_proj and down_proj layers to 4-bit anyway. If PPL improves, your drift metric is lying to you in the same direction ours is. That's information you can use without changing your quantizer; it's information that says your quantizer needs to change.

This is part of an ongoing series on running 13–20B language models on 12 GB consumer GPUs. The pipeline is open work-in-progress at mxguru1/hsaq-tools on Hugging Face. Granite-3.3-8B was chosen as the headline target because community AWQ/GPTQ quants exist for ground truth, and because 8B parameters at mixed 3/4-bit fits comfortably on a 12 GB card with room for a LoRA adapter.

Update (2026-05-21) — model-specificity caveat

Follow-up transfer testing on the o_proj 3→4-bit floor intervention shows it is model-specific, not a generalizable recipe. On a clean, identical evaluation protocol (full wikitext-2 test set, non-overlapping 2048-token windows):

Model	Δ PPL from floor	Direction
granite-3.3-8B	+0.0840 (1.137%)	improvement
phi-4 (14B)	+0.0088 (0.127%)	small improvement
Qwen-2.5-14B	−0.0019 (0.031% worse)	mild regression

Phase-3a's observation — drift-MSE on outlier-heavy layers disagrees with downstream PPL — holds for granite as originally reported. The intervention of forcing o_proj layers from 3-bit to 4-bit transfers cleanly to phi-4 (small positive effect, 67.6% of windows helped), and reverses on Qwen-2.5-14B (61.2% of windows hurt). No clean predictor — count of underbitted layers, tier distribution, architecture, parameter scale — sorts the result.

Full writeup of the transfer testing, the dose-response hypothesis that died on the clean protocol, and the discipline checks that caught a wrong prediction in real time is in Part 2 and a forthcoming Part 3.

Two Local-Agent Philosophies: Where Hermes Earns Its Design, and Where the Tradeoffs Invert

MxGuru — Tue, 19 May 2026 08:23:18 +0000

This is a submission for the Hermes Agent Challenge

I've spent the last five months building an offline multi-tier agent swarm on a single workstation — an RTX 5070, a Ryzen 9 9950X3D, and a hard rule that nothing crosses the network boundary without explicit permission. When the Hermes Agent Challenge came up, I sat down to write a "why I'd use Hermes" piece. Halfway through, I realised I had to write a different post: why Hermes is the right choice for most people building local agents, and why a specific class of deployments has to make the opposite call.

This isn't a criticism of Hermes. Nous Research designed something good. What I want to lay out is where the design choices stop applying — not because they're wrong, but because the threat model changes.

What Hermes is good at

The repo and docs are clear about the thesis: Hermes is "the agent that grows with you." Built-in learning loop. Creates skills from experience. Searches its own past conversations. Builds a deepening model of who you are across sessions. Runs on a $5 VPS, a GPU cluster, or serverless infrastructure. Use any model — Nous Portal, OpenRouter, NVIDIA NIM, your own endpoint. Switch with hermes model, no code changes.

That's a coherent design. The whole framework leans into a specific bet: that an agent operating with you over time, accumulating context and skills, gets more useful than an agent that starts from zero each session. For most use cases I can think of — personal productivity, research workflows, automating the weird operational stuff no SaaS product handles properly — that bet is the right one.

If I were building a Hermes-style workflow for myself, I'd lean on:

The session memory and conversation search — the operational benefit of an agent that already knows what I was working on yesterday is significant
The skill-creation loop — instead of re-typing the same chain of tool calls, the agent persists the pattern
The model flexibility — being able to swap providers without rewriting code is genuinely useful when you're testing what works
The cheap-to-idle infrastructure pattern — you can leave it running and it costs nearly nothing when nothing's happening
The client itself is lightweight — and that matters more than it sounds. JetBrains PyCharm, Windsurf, and the other heavyweight AI-augmented IDEs are CPU-intensive in a way you feel on a dev workstation that's already running real workloads. The Hermes client gets out of the way. When my machine is busy doing actual work, I'm not also paying a tax for the agent to exist. That's not a marquee feature in the docs, but it's the kind of detail that shows up after a few weeks of real use.

For an individual builder, a white hat researcher poking at things on their own time, a small team automating their own ops — this is well-shaped. The learning loop earns its complexity by paying off across sessions. The "talk to it from Telegram while it works on a cloud VM" pattern is genuinely powerful for people whose workflow benefits from continuity.

This isn't faint praise. Hermes is doing a real thing well.

Where the design choices flip

The deployments I've been building for operate under a different constraint set. Specifically: the threat model assumes the agent itself is a potential vector. Not because it's malicious by design — because anything that can modify its own behaviour over time can be steered into modifying it the wrong way, given enough adversarial pressure on its inputs.

The thing Hermes treats as its strength — the agent grows, learns from experience, creates skills, persists memory — is the exact behaviour my architecture is built to prevent.

That's not a Hermes problem. It's a security posture that decided "the agent should not be able to surprise me" was worth the cost of throwing away the productivity gains of learning-over-time.

The architectural decisions that follow from that posture are:

Hardcoded permission gates over emergent capability. Every privileged operation routes through a gate that knows what tier the requesting agent runs at and what operations that tier can perform. No bypass flag. No "trusted" internal path. If a new capability is needed, it gets added to the gate explicitly, by a human, in code review.
Knowledge stays read-only for the agent. There's a local Knowledge Vault that holds threat intelligence and audit logs. Agents read from it constantly. They write to specific append-only paths under their tier's permission. They cannot modify what's already there. A learning-loop agent that "improves its skills" would be writing to the very place I'm protecting from writes.
Tier is immutable for the agent's lifetime. You can't escalate yourself mid-run. To do privileged work, you spawn a child agent at a higher tier, and that spawn is audited. The thing Hermes calls a feature — an agent that grows — my architecture treats as a control failure mode.
No cross-session continuity by default. Session memory is per-session unless explicitly persisted by a gated operation. The "agent that knows what you were doing yesterday" is, in a high-security context, "an attack surface that yesterday's adversary can still influence today."

These aren't claims that Hermes' design is wrong. They're claims about a different threat model where the tradeoffs invert.

The bridge

Here's the part that I think actually matters for anyone reading this and trying to decide which way to build:

For typical consumer use and most white hat / research workflows, the security posture I'm describing is overkill. It costs a lot of operational ergonomics, demands real architectural discipline, and the threats it's defending against don't apply to someone running an agent on their own laptop to automate their own life. Hermes' learning loop is a net win in that context. The productivity from continuity dwarfs the theoretical risk surface.

But there's a class of deployments where total control over what the agent can do, in what order, with what authorisation, becomes the actual product. Adversarial security research, local Blue Team analysis where compromise of the tooling is part of the threat model, environments where the agent has access to data that simply cannot be corrupted by any process — that's where the bridge crosses.

On the consumer side of the bridge, Hermes is well-designed and the learning loop is a feature.

On the other side, the same loop becomes a property the architecture is built to prevent.

This isn't Hermes being wrong. It's that any local-agent framework has to commit to a stance on whether the agent should be able to surprise its operator. Hermes commits one way. A high-security swarm commits the other. Both are coherent.

Why measurement matters more than philosophy

The reason I trust the architectural decision I made — rather than just believing in it — is that the same project produces measurable, reproducible artifacts at every step. The quantization pipeline that runs inside this architecture logs per-layer sensitivity profiles, applies bit-width assignment under explicit budget constraints, and emits manifests that I can diff between runs. Recent runs on an 8B-class model produced bit-identical allocations across runs with 4× the calibration data, which tells me the underlying measurements are stable, not noise.

That property — runs produce the same artifact when given the same inputs — is exactly the property a hardcoded gate enforces and exactly the property a learning-loop architecture would compromise over time. Not in a bad way. The learning loop is supposed to change its output as it learns. That's the design. But for the security domain I'm working in, "the system's behaviour drifts over time even with identical inputs" is a property I'm specifically preventing, not enabling.

If you're operating in a context where reproducibility matters more than ergonomics — where you need to be able to prove that today's behaviour matches yesterday's, that no agent has quietly upgraded itself, that the audit trail is the truth — that pushes you toward gates and away from learning loops. Not because gates are better. Because in that context, reproducibility is what "better" means.

The takeaway

If you're building a local agent for yourself and want capability that compounds over time: Hermes is well-designed for that and the framework gives you a lot for free.

If you're building infrastructure where the agent should never be able to do something the operator didn't sign off on, in advance, with audit: build the boring version. Hardcoded gates. Immutable tier. Read-only state for the agent. No emergent behaviour. Yes, you'll do more work. Yes, you'll lose some operational productivity. That's the price of the security property you're buying.

Both stances are defensible. The mistake is using one framework in the other's domain.

For the Hermes Agent Challenge specifically: this isn't a piece I could have written without spending real time inside both philosophies. The framework is doing good work for the people it's designed for. I'm not one of those people right now — but I might be, on a different project, in a different threat model. And the same is true in reverse: if you're a Hermes user reading this and thinking "that security posture sounds excessive for what I'm doing," you're probably right, for what you're doing.

Pick the framework that matches your threat model. Don't pick the one that matches your aesthetic preferences. That's the actual lesson.

Built and tested on RTX 5070, Ryzen 9 9950X3D, fully local. Architecture details and empirical results are publicly available; the specific threat model and implementation internals are not, for reasons that should be obvious given the topic.

The Hive Are Evolving!

MxGuru — Tue, 19 May 2026 07:32:24 +0000

98% adversarial defense rate. 200 rounds. One $700 GPU.
I just finished benchmarking my defender swarm against six attacker models — three frontier cloud LLMs and three locally-hosted open models on my mate's machine over a Cloudflare Tunnel.
The result that surprised me most: the frontier cloud models were the worst attackers in the pool. Their breach rate sat at 0% — too aligned to red-team coherently. The genuine threats were uncensored mid-weight open models running on commodity hardware. The same stack any motivated attacker can spin up for $20 of cloud compute.
My defender swarm — five specialists at 1.5B–8B parameters, all running on a single RTX 5070 12GB — hit 98% defense rate. The smallest model (3B) led detection at 100%. Architecture > size, every time.
This is one of the flagship capabilities of Sovereign Hive: a local-first AI ops platform I'm building. Australian, Queensland-based, 100% Indigenous-owned.
Full architecture breakdown in the article 👇
sovereignhive.com.au
Defence engineers, AI researchers, anyone working on autonomous on-device AI — what's your read on this gap between frontier-model alignment and real-world attacker behaviour?

AISecurity #InfoSec #AdversarialML #SovereignAI #IndigenousBusiness #EdgeAI

What Gemma 4 Actually Unlocks for a Local Security Swarm (And Why I Don't Use the Same Variant Everywhere)

MxGuru — Mon, 18 May 2026 22:00:00 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I've been building an offline, multi-tier adversarial agent swarm on a single workstation — an RTX 5070 (12GB VRAM), a Ryzen 9 9950X3D, zero cloud calls, zero external dependencies, and zero vendor content restrictions. The swarm acts as an autonomous "Blue Team": it audits, scans, correlates threats, and, where appropriate, simulates the attacker side of an engagement against the assets it protects.

When the Gemma 4 family dropped, the question I had wasn't should I use it. A local-first, capable, open-license, multimodal model with a 128K context window is an automatic yes. The genuinely interesting question was: which variant goes where?

That's the question I think most "I tried the new model" posts skip past. The Gemma 4 lineup isn't just one model cut into three sizes. It's three distinct architectural answers to three different deployment problems. Picking the right one per role is where you find real leverage.

The Lineup, Architecturally

For anyone who hasn't pulled the spec sheet yet:

Gemma 4 E2B / E4B — Small effective-parameter models built for the edge: phones, browsers, ambient compute. Fast time-to-first-token, tiny VRAM footprint, and you can run many of them concurrently.
Gemma 4 26B MoE — Mixture-of-Experts. Total parameters are massive, but only a fraction activate per token. Designed for high throughput with strong reasoning on a per-task basis. It takes up space in memory, but it's computationally much cheaper to run than its parameter count suggests.
Gemma 4 31B Dense — Server-grade local. Every parameter fires on every token. Predictable inference cost and generally the strongest reasoning ceiling of the three, but carries the highest VRAM tax and latency floor.

All three share the same training lineage, the same 128K context window, and the same multimodal head. They differ entirely on activation patterns, footprint, and what kind of work they are built to absorb.

Casting Models by RBAC Tier

The swarm uses a 6-tier zero-trust Role-Based Access Control (RBAC) system. Tier 6 is the most privileged — supervisors that can spawn, terminate, and de-escalate other agents. Tier 5 is the least privileged — ambient scanners that watch logs, file changes, and network deltas. Every privileged action routes through a hardcoded PermissionGate that doesn't care what the model wants; if the tier doesn't permit it, the call dies.

This matters for model casting because higher tiers don't just need smarter agents — they need slower, more deliberate ones. A supervisor that fires off twenty execution plans a second is a massive liability. Conversely, an ambient scanner that thinks for three seconds before flagging a file change is useless.

So, the question per tier is: how much reasoning depth, how much latency tolerance, and how many instances do we need concurrently?

Where Each Variant Earns Its Slot

The E2B / E4B at the edges (Tiers 4–5). Ambient watchers, log diffing, simple anomaly flagging, and "is this string weird" classification. The work here is high-volume, mostly pattern-shaped, and low stakes per call. I need several of these running concurrently with zero VRAM drama. A small model that returns a token in tens of milliseconds and lets me run multiples in parallel easily beats a 31B Dense that locks the GPU for seconds. Edge Gemma 4 is built for exactly this shape of work.

The 26B MoE in the middle (Tiers 2–3). Triage, correlation, and threat synthesis ("you've got fifteen of these alerts — is an attack chain forming?"). The MoE architecture fits here for a specific reason: middle-tier work is bursty. You have quiet stretches followed by a sudden need to reason hard about a correlated set of events. MoE's sparse activation means we get 31B-class reasoning without the relentless compute tax of a dense model. The 128K context window pays for itself here too, allowing triage agents to ingest a long correlation window of events in a single shot.

The 31B Dense at the top (Tiers 5–6) — with caveats. Supervisors, planners, and adversarial scenario generation. Dense earns its slot here because top-tier reasoning needs to be predictable. When an MoE routes to a different expert mix on a similar query, you can occasionally get stochastic depth. For a supervisor agent deciding whether to spawn a sub-agent at a different privilege tier, I want mathematical uniformity more than peak throughput. Dense delivers that.

The Caveat: On a single-card 12GB 5070, a 31B Dense model is the heavyweight in the room. It cannot coexist concurrently with the MoE and a stack of edge models without aggressive quantization and careful orchestration. Mine gets gated through an HTTP inference queue — agents request inference, the gateway serializes the high-cost calls, and the small models keep running in their own lane. It's not glamorous infrastructure, but it's what makes the casting work.

What I Actively Avoid

Based on this architecture, here are a few patterns I actively avoid:

Don't use the 31B Dense everywhere just because it's the strongest. Latency at the bottom tier kills a swarm's situational awareness. You'll miss live events because your "ambient" watchers are blocked behind a heavy inference floor.

Don't put the MoE on supervisor duty. I like the model. I just don't want stochastic expert routing inside the agent that decides whether another agent gets disk-write permissions.

Don't put the E2B/E4B on triage. Edge models are great at answering "is this weird?", but weak at "what does it mean across these fifteen events?" Triage is the rung where context and parameter count win, not throughput.

The Takeaway

The Gemma 4 release is remarkable because the variants are legitimately different tools, not just three sizes of the same hammer. The MoE isn't "the 31B but smaller," and the E2B isn't "the E4B but worse." Each one is shaped for a specific class of work.

For a local-first, zero-cloud security swarm, the answer turned out to be all three at once, casting them by tier rather than picking a default. The model that wins on a benchmark is rarely the right model for every role inside a complex system.

That's the lesson I'd transfer out of this exercise: when a model family ships with real architectural variation, the lazy move is picking a favorite. The valuable move is asking which variant belongs in which slot — and building the orchestration to run them side by side.

Swarm-Consensus Defense Achieves 98.2% Against Cloud-LLM Adversarial Attacks

MxGuru — Sun, 17 May 2026 06:54:58 +0000

5-defender consensus swarm + autohealer hit 100% defense rate by round 400 after only 6 breaches in the first 100 (94%). Built on local Ollama, 3 cloud attackers, 13 attack categories. Smallest defender (llama3.2:3b, 4-bit) ran 500 rounds with zero misses.

Sovereign Hive v6.6 — 98% defense across 200 adversarial rounds on a single 5070

MxGuru — Sun, 17 May 2026 06:54:08 +0000

200 adversarial rounds. 4 breaches. 98% defense. Five 1.5–7B local models on a single RTX 5070 beat frontier cloud attackers (DeepSeek V3.2, Qwen 3.5 397B, Mistral Large 3). The Druid defender (llama3.2:3b) synthesised every new defence pattern. The big-model premium is shrinking.

99%% Defense Rate Across 500 Rounds: A Self-Healing Swarm on a $550 GPU

MxGuru — Sat, 16 May 2026 05:46:19 +0000

Executive Summary

Over five iterations and 676 total adversarial wargame rounds, we evolved a local AI swarm's defense rate from 73% to 99.0% — on a single RTX 5070 (12GB VRAM, $550). The final 500-round run produced just 5 breaches, with the last 300 rounds containing only a single breach. The swarm's auto-healing system instant-blocked 108 rounds (21.6%) without even engaging defenders.

All testing used cloud-scale attacker models (DeepSeek-V3.2 at 671B params, Qwen 3.5 at 397B, Gemma 4 at 31B) against local defenders ranging from 1.2B to 16B parameters. Zero cloud dependency. Zero API costs for defense.

The Five Iterations

Run	Config	Rounds	Defense	DeepSeek-V3.2 Breach Rate
v6.0	Nexus-tiny swarm	26	73%	78%
v6.1	+ soldier auditor (16B)	50	78%	45%
v6.2	+ Vanguard prompt injection	50	82%	50%
v6.3	+ Auto-healer	50	90%	20%
v6.4	+ 7B guardian + social specialist	500	99.0%	6.7%

v6.4: The 500-Round Run

Batch-by-Batch Performance

Batch	Defense	Breaches
R1-50	96.0%	2
R51-100	100%	0
R101-150	98.0%	1
R151-200	98.0%	1
R201-250	100%	0
R251-300	100%	0
R301-350	100%	0
R351-400	98.0%	1
R401-450	100%	0
R451-500	100%	0

Six of ten batches achieved perfect 100% defense. The final 100 rounds were undefeated.

The 6-Defender Swarm

Role	Model	Final Batch Detect Rate
auditor	DeepSeek-Coder-V2 16B	100%
trace_forensics	Qwen2.5-Coder 1.5B	100%
guardian	Qwen2.5 7B (upgraded)	94%
sentinel	nexus-vanguard 1.2B	86%
social_specialist	nexus-social 1.2B	80%
supply_chain	nexus-vanguard 1.2B	80%

Three Key Innovations

1. Vanguard Prompt Injection

We injected a "Defender Vanguard" system prompt into 1.2B parameter models, teaching them to think like red teamers when evaluating inputs. Same model weights, same VRAM footprint, different mindset. Authority escalation detection jumped from 44% to 90% overnight.

2. Auto-Healing Immune System

After every breach, the system:

Extracts attack patterns and adds them to a live blocklist
Creates antibodies — the full attack text is injected into all defenders' prompts as a "known threat"
Instant-blocks future rounds matching known patterns without engaging the swarm

Result: 108 of 500 rounds (21.6%) were blocked before defenders even fired. The blocklist grew to 7 patterns from just 5 breaches.

3. Threat Vaccine Agent

An autonomous agent that harvests new attack patterns from:

arXiv security papers (RSS feed)
MITRE ATLAS adversarial ML techniques
Gemini API for targeted analysis of weak categories

These vaccines are injected into the auto-healer proactively, hardening the swarm against attacks it hasn't seen yet.

What Remains Soft

Context poisoning accounted for 3 of 5 total breaches. The antibody system caught up by the final batches (zero context_poisoning breaches in the last 200 rounds), but it remains the most likely vector for a sophisticated attacker.

Hardware Reality

GPU: NVIDIA RTX 5070, 12GB VRAM, ~$550 retail
Peak utilization: ~50% (most time spent waiting on cloud attacker generation)
VRAM: 6 defenders loaded simultaneously (~10GB total)
Inference speed: 202.9 TPS on DeepSeek-Coder-V2 16B, 8ms TTFT

Methodology

Consensus mechanism: 6 defenders vote independently. Threshold: >=3/6 = defended.
Attacker models: Cloud-routed via Ollama — DeepSeek-V3.2 (671B), Qwen 3.5 (397B), Gemma 4 (31B)
Attack categories: authority_escalation, prompt_injection, social_engineering, logic_bomb, context_poisoning
All data, logs, and audit reports available on request

Conclusion

A $550 consumer GPU running six local models (largest: 16B parameters) achieved 99% defense against cloud-scale attackers up to 671B parameters. The key was not raw compute — it was architecture: multi-agent consensus, adversarial prompt engineering, and a self-healing immune system that converts every failure into permanent immunity.

Consumer hardware is not a limitation. It's a design constraint that forces better engineering.

Sovereign Hive is a local-first AI security platform. 100% Indigenous-owned. Built in Queensland, Australia. ABN 24 661 737 376.