Forem: Simon Morley

XDP: The Kernel-Level Powerhouse Behind Modern Network Defence

Simon Morley — Wed, 12 Nov 2025 12:54:28 +0000

Introduction

Traditional packet processing in Linux has always had one problem: latency, just like your Nan.

Packets climb an almost endless ladder through kernel subsystems before reaching user space. By which time your firewall has probably missed the critical window to act. Shame on you and your Nan.

eXpress Data Path (XDP) changes that completely. It's a fast-path hook that runs inside the kernel's network driver layer: before sockets, before Netfilter, before the kernel allocates a socket buffer (skb).

This means you can inspect, modify, drop, or redirect packets as they arrive on the NIC, with nanosecond-level performance.

It's like knowing who's going to turn up at the pub before they've left the house.

The Core Idea

XDP extends the Linux kernel with programmable packet handling at the driver level, using eBPF (extended Berkeley Packet Filter) programs compiled into bytecode.

Instead of pushing packets up the stack, XDP lets you attach logic that decides what happens next, directly in the NIC's receive path.

Execution flow:

NIC receives a packet.
XDP hook triggers before skb allocation.
eBPF program runs in the kernel VM.
Program returns one of several actions:
- XDP_PASS: let the packet continue to the normal stack
- XDP_DROP: discard it immediately
- XDP_TX: bounce it back out the same interface
- XDP_REDIRECT: forward it to another interface, CPU, or AF_XDP socket
- XDP_ABORTED: fail gracefully if something goes wrong

That's it.

Why It Matters

1. Performance

XDP can process millions of packets per second per core.

Facebook's Cilium team measured over 20 million packets per second, on commodity hardware.

That's like Mo Farah racing your Nan in an ultra marathon and finishing it 20 million times before she's even put her jeggings on.

2. Programmability

Unlike fixed-function firewalls or DPDK pipelines, XDP programs are just eBPF bytecode.

You can dynamically load and unload filters at runtime, without recompiling the kernel or restarting services.

3. Security

You can build kernel-resident security controls:

DDoS mitigation: drop floods at line rate
Port knocking or protocol filtering: block unwanted ports before TCP handshake
Inline IDS signatures: detect or throttle known attack patterns

4. Observability

Because XDP operates before skb allocation, it's ideal for high-fidelity telemetry.

You can capture packet metadata (MACs, IPs, ports, timestamps) and push structured events to user space with ring buffers — no packet copies, no pcap overhead.

XDP in the Wild

Meta (Facebook)

Meta was one of the earliest large-scale adopters of XDP.

They use it in Katran, their in-kernel load balancer, to handle tens of millions of connections per second while maintaining microsecond-level latency.

XDP replaced parts of their older DPDK-based stack, cutting CPU load and enabling dynamic policy updates through eBPF maps.

The same foundation powers Cilium’s kernel datapath and underpins parts of Meta’s edge networking infrastructure.

Cloudflare

Cloudflare also uses XDP to defend its global edge network against DDoS attacks.

By placing mitigation logic directly inside the kernel, they can absorb massive floods — up to hundreds of millions of packets per second — without userspace overhead.

Their engineers have written extensively about how XDP allows per-interface rate limiting, SYN flood filtering, and on-the-fly rules pushed from Go and Rust control planes.

It's effectively their last-line kernel shield before packets ever reach the proxy layer.

Together, Meta and Cloudflare have proven that XDP can scale from hyperscaler infrastructure to real-world production workloads, not just lab benchmarks.

Typical Use Cases

Category	Description	Example
DDoS Protection	Drop or rate-limit SYN floods directly in driver	`XDP_DROP` TCP SYNs after threshold
Load Balancing	Redirect packets to backend queues or CPUs	`XDP_REDIRECT` to AF_XDP sockets
Firewalling	Kernel-level ACLs	Filter by IP, port, or protocol
Telemetry	Stream header data to user space	XDP + perf ring buffer
Inline Remediation	Block C2 connections before userspace	Combine XDP + LSM hook

Writing an XDP Program (Example)

I've been writing this in Rust recently but you can do in c or go etc. Rust is the best IMHO.

// src/xdp.rs
use aya_bpf::{
    bindings::xdp_action,
    macros::{map, xdp},
    maps::HashMap,
    programs::XdpContext,
};

#[map(name = "SYN_COUNTER")]
static mut SYN_COUNTER: HashMap<u32, u64> = HashMap::<u32, u64>::with_max_entries(1024, 0);

#[xdp(name = "count_syns")]
pub fn count_syns(ctx: XdpContext) -> u32 {
    match try_count_syns(ctx) {
        Ok(ret) => ret,
        Err(_) => xdp_action::XDP_ABORTED,
    }
}

fn try_count_syns(ctx: XdpContext) -> Result<u32, ()> {
    let hdr = ctx.ip()?.ok_or(())?;
    if hdr.protocol != aya_bpf::bindings::IPPROTO_TCP as u8 {
        return Ok(xdp_action::XDP_PASS);
    }

    // Parse TCP header
    let tcp = ctx.transport::<aya_bpf::bindings::tcphdr>().ok_or(())?;
    let flags = unsafe { (*tcp).syn() as u8 };

    if flags == 1 {
        let key = hdr.protocol as u32;
        unsafe {
            let counter = SYN_COUNTER.get(&key).copied().unwrap_or(0);
            SYN_COUNTER.insert(&key, &(counter + 1), 0);
        }
    }

    Ok(xdp_action::XDP_PASS)
}

And a userspace loader:

// src/main.rs
use aya::{Bpf, programs::Xdp};
use std::{env, process};

fn main() -> Result<(), anyhow::Error> {
    let iface = env::args().nth(1).unwrap_or_else(|| {
        eprintln!("Usage: cargo run -- <iface>");
        process::exit(1);
    });

    let mut bpf = Bpf::load_file("target/bpfel-unknown-none/release/xdp-example")?;
    let program: &mut Xdp = bpf.program_mut("count_syns").unwrap().try_into()?;
    program.load()?;
    program.attach(&iface, aya::programs::XdpFlags::default())?;

    println!("XDP program attached to {}", iface);
    loop {
        std::thread::sleep(std::time::Duration::from_secs(60));
    }
}

As you can see, not much to get the basics going.

Advanced Scenarios

1. Dynamic Remediation

Combine XDP with userspace controllers.

For example, an agent monitors traffic patterns and pushes new eBPF maps into the kernel to block malicious IPs dynamically.

2. Programmable Rate Limiting

Use per-source counters in eBPF maps:

Count packets per IP
Apply backoff or redirect decisions
Synchronize with userspace via shared maps

3. Hybrid Visibility

Send metadata to userspace without full payloads:

bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &pkt_meta, sizeof(pkt_meta));

Challenges and Limitations

Hardware support varies by NIC driver.
Verifier constraints: programs must be bounded and safe.
Debugging can be non-trivial — bpftool prog trace helps, but it's still kernel space.
Portability: kernel versions differ in helper function availability.
Knowledge - you still gotta know what you're doing fu*king around in there, this ain't no lovable prompt party.s

Still, XDP is maturing fast, and the eBPF ecosystem around it (bpftool, libbpf, Cilium, Katran) makes development significantly easier.

Conclusion

XDP represents the most radical shift in Linux networking since Netfilter.

It lets you run programmable logic where it matters most — at the point of ingress — turning your kernel into a programmable network processor.

Whether you're building autonomous defenses, ultra-low-latency telemetry, or custom in-kernel routing, XDP gives you the foundation for it.

The kernel is no longer a bottleneck - it's a battlefield and XDP is the armour keeping your loved ones (your nan etc) safe.

Why Now?

I'm working on XDP applications for a couple of projects that are going on right now. More news on this soon.

Why Multi-Validator Hosts Break Traditional Security Scanning

Simon Morley — Wed, 05 Nov 2025 14:56:18 +0000

Determining a host is running a Sui validator is easy.

Step 1 - scan a couple of ports:

Port 8080? Sui network endpoint.

Port 9184? Sui metrics.

Step 2 - Done. Next host.

And this is fine, but how do we really know it's a Sui validator? Also, humour me, we know it's a Sui validator, because there's a list of them.

But it turns out, these Sui validators also have http (80) open frequently. Which muddies the water. I don't know why, we're still working on that.

How do we find an Ethereum node. Same idea, different ports.

How do we find out if the host is running Sui and Ethereum?

It gets really messy, really fast. False positives, false negatives. General confusion. The humans have to intervene.

Traditional scanning starts with understanding. Once that's clear, the scanning can commence.

The Problem Nobody Talks About (mostly because they're not interested, but if they are, they're not talking about it).

Validator operators don't run one chain per host like some kind of theoretical best-practice diagram.

They run multiple validators on the same infrastructure because:

Hardware costs money
Operations complexity scales with host count
A 32-core server running one validator is wasteful
Most chains don't max out resources simultaneously

So you get hosts running:

Sui + Ethereum
Solana + Cosmos
Ethereum + Polygon + Arbitrum
Some combination I've literally never seen before

And now your nice clean rule-based scanner that looks for "Sui signatures" doesn't know what to do.

I spent weeks trying to figure it out, ended up:

Asking the user what services are running - lil bit 2002.
Using AI to figure it out!

Using AI did a great job, sometimes. At a cost. I walked away for a bit, did something else.

The Overlapping Port Problem

It gets worse when chains use similar port ranges or standard services.

Multiple validators might expose:

Metrics endpoints (Sui on 9184, standard Prometheus on 9090)
JSON-RPC endpoints (Ethereum 8545, Solana 8899, Sui 8080)
P2P networking (Ethereum 30303, Solana 8000-10000, Sui 8084)
WebSocket connections (Solana 8900)
Monitoring stacks (all using Grafana on 3000)

You can't just say "port 9090 = Prometheus therefore monitoring only."

Because what if that Prometheus instance is exposing metrics for three different validators?

Now you need to:

Identify which metrics belong to which chain
Understand which validators are actually running
Map CVEs to the correct services
Determine risk posture across multiple chains

Rule-based scanning doesn't scale to this.

The Configuration Variance Problem

Even if you nail down the ports, validator configurations vary wildly.

Some operators:

Run validators in Docker (different process visibility)
Use non-standard ports (8545 becomes 18545 because reasons)
Proxy everything through nginx (now all you see is nginx)
Run custom monitoring stacks (Prometheus? Grafana? Both? Neither?)
Use systemd service names that don't match upstream defaults

So your rules for "detecting an Ethereum validator" need to account for:

Standard Geth on 30303/8545
Dockerized Geth on custom ports
Proxied Geth behind nginx on 443
Custom compiled Geth with a weird banner
Besu instead of Geth (different client, same chain)
Nethermind or Erigon (different again)

And that's one chain.

Multiply this across Sui, Solana, Cosmos, Polygon, Avalanche...

You see the problem.

Manual Verification Works (But Doesn't Scale)

The only reliable way to identify multi-chain hosts right now?

Human verification.

This works. It's accurate.

It's also slower than your nan.

If you're scanning hundreds of validator hosts across multiple clients, you can't manually verify every configuration.

And if the configuration changes (which it does), you have to verify again.

The Insight That Changes Everything

After manually verifying enough multi-chain hosts, I started to notice something:

They have a shape.

Not a shape you can easily encode in rules.

But a shape you can recognise.

A Sui+ETH host "feels different" than a Sui-only host.

A Solana+Cosmos host has a different "fingerprint" than either chain alone.

You can't write down the rules for why.

But you know it when you see it.

Your brain is doing something that rule-based scanners can't:

Pattern matching across multiple dimensions simultaneously.

You're not looking for specific ports.

You're looking at the whole configuration and recognizing similarity.

Port 8080 + 8545 + 30303 + 9090 + 3000?

That's a Sui + Ethereum setup with monitoring.

Port 8899 + 8900 + 8000-8020 + 9090?

That's Solana with standard monitoring.

Port 8080 + 9184 + 8545 + 8551 + 30303 + 8899 + 9090 + 3000?

That's a three-chain monster that needs close attention.

You're not consciously running through these rules.

You're just seeing the pattern.

Goodbye AI, or at least some of it.

Instead of writing rules, we could:

Manually verify a multi-chain host once (or get the user to do this)!
Store its "fingerprint" (ports, services, banners, everything)
When we see a new host, search for similar fingerprints
If it's similar to a verified host, inherit that classification
If it's novel, verify manually and add to the training set

Ok, we're still using openai's embeddings but that's a lot cheaper than using openai's api.

This AI called this scaled pattern matching with human-verified training data.!! Weheey.

What Comes Next

In Part 2, I'll show you how vector embeddings let you do exactly this: turn a server's full configuration into a numerical fingerprint, then search for "servers that look like this one."

It's actually quite boring and if you know me, I love boring. Meanwhile, ChatGPT inserted this "it's just high-dimensional similarity search" which I found fun so I left it.

Spoiler: Postgres + pgvector is good enough! Most of us don't need the MEGA VECTOR DBS

This is Part 1 of 3 on building better security scanning for multi-chain validator infrastructure. Part 2 covers vector embeddings as scaled pattern matching.

Building something for good over here.

How Google Mistook My Sui Node for a Bitcoin Farm (And Banned Me) (again)

Simon Morley — Thu, 30 Oct 2025 12:14:28 +0000

Google thought I was mining Bitcoin - mining it like it's 2018 baby. But no, wasn't doing that (again).

I was running a L1s validator test node—you know, the exact kind of blockchain infrastructure that legitimate DeFi platforms actually need. The kind that requires computational resources because that's how distributed consensus works. The kind that works really well on cloud providers...

But Google's threat detection AI apparently can't tell the difference between "crypto mining operation" and "blockchain validator infrastructure."

So they banned me.

Not just for that, mind you. The ban was a trifecta:

Running security scans (with explicit authorisation from targets)
Operating honeypots (literal security research)
Running that "Bitcoin miner" (a Sui test node m8)

One day I'm building away, on GCP because that's where I like to build things, the next day? Locked out. Account suspended. No more deploying. No more scanning infrastructure. No more anything.

The Irony Wasn't Lost On Me

Here I am, building a platform to help decentralised networks secure their infrastructure, and I get flagged as a threat by another AI system that can't distinguish between malicious activity and legitimate development.

Google: confused about blockchain workloads.
Simon: trying to eliminate false positives in security scanning.

The Unbanning Process (AKA: Purgatory)

I deleted this section because it was boring. Eventually, they unbanned me. Weheey!

I got back in—but with restrictions. Some things I could build on GCP. Other things? Not so much.

Conditionally reinstated.

I was annoyed at first - just sat there looking at my computer. I realise now this was a blessing in disguise.

It looked like all those hours in the data centres in the 2000s would finally pay off. Yeah, I am that old. Am I going back to bare metal!? Probably not but still, it's an option.

The Question I Should've Asked Earlier

Sitting there, freshly unbanned and afraid to breathe wrong, I had a realisation.

Even before the ban, I wasn't being smart. I'm there burning through AI API calls like they were free. Every scan would:

Discover ports
Identify services
Extract banners and metadata
Embed everything with OpenAI
Analyse it with a GPT

Port 22 open? Ask GPT-4 if it's SSH.

Port 443 responding? Better check with the AI if it's HTTPS.

Port 3000 with a Grafana banner? Let's spend $0.03 to confirm what we already know.

I was asking a $200 billion company's large language model to tell me things I learned in 2002.

Thousands of dollars. Thousands of API calls. Thousands of Nvidia chips spinning up to answer questions like "is port 22 usually SSH?"

Four hundred times.

Same ports. Same configurations. Same fucking validator setups across different hosts.

And I kept asking. Every. Single. Time.

The Constraint That Changes Everything

The GCP ban forced a question:

How do I operate smarter with fewer resources and less aggressive scanning?

But the real question—the one I'd been avoiding—was simpler:

Why am I re-asking questions I already know the answer to?

Your brain doesn't work like this. After twenty years of staring at security scans, you recognise patterns instantly. You see an open port configuration and you know—not because you're thinking hard, but because you've seen it before. That's not intelligence. That's memory.

What Comes Next

Getting banned taught me something valuable: constraints force innovation.

I couldn't scan aggressively anymore. I couldn't just throw compute at every problem. I had to be efficient.

So I stopped asking the AI everything.

Instead, I built a system that remembers.

In Part 2, I'll show you exactly how much money I was burning on redundant AI calls, why "asking the model" isn't the same as "being intelligent," and how a $200 Postgres instance became smarter than my entire AI pipeline.

Spoiler: Vector databases aren't about replacing AI. They're about remembering what you already figured out.

Building AI-powered security infrastructure for decentralised networks. This is Part 1 of 3 on getting banned from GCP and what it taught me about building smarter systems.

PGDN Sentinel — an OSS security toolkit for Sui validators, inside Discord

Simon Morley — Wed, 29 Oct 2025 11:40:42 +0000

When I published the State of Sui report, the biggest surprise wasn't the 39.6 % of voting power exposed — it was actually how little people 'cared' about external hygiene. I thought I was doing a good thing for the network. Alas, Sui really didn't seem that fussed.

The data was acknowledged and questioned and dismissed as 'not a bug bounty'. But that wasn't the aim of the project at all.

So I thought it would be cool to try and visualise this data without requiring Yet Another Dashboard Tool To Login In To (YADTTTIT)?

So I built a wee Discord Bot that would allow validators to have a look at their scores. And also for regular users, like me, to check how secure a validator was.

And here it is!

Why a Discord bot?

Most validator operators already live in Discord.

You're there for epoch coordination, validator channels, and announcements — so security should meet you there too.

PGDN Sentinel is a private, agentic security toolkit for Sui validators that runs entirely through Discord DMs.

No dashboards, no credentials, no installs.

Just slash commands. That's what she said.

I released this code as an open source project that you can use although I haven't worked out how to make the backend data public yet. That's just in a db for now.

Why Keep The Data Private Simon?!

As with most external analysis, I did uncover a large number of validators with actual issues, CVEs, misconfigurations etc. I figured that it probably wouldn't be the best idea to publish these.

That said, I did create a 'validation' logic that would allow the 'validators' to prove ownership and then get a list of these. And I've been offering some free advice to them too. Because that's how I roll.

What is the architecture, I hear you asking?!

I created two repos - the main 'bot' that subscribes to the Discord webhooks and an API. The API is connected to the db and I'm running this in a Kubernetes cluster. I guess in theory, the bot can run anywhere but I locked the API's ingress down.

It's all in Python. And Claude gave me a helping hand, as usual.

Why this matters

In Simulated Attack, I modelled how an attacker could disable enough validators to cross the 33 % halt threshold. Sentinel exists to close that gap — to make external posture checks routine and effortless.

You don't need a SOC team to know if your node is exposed.

Try it

➡️ Add PGDN Sentinel in Discord

Works in any server or direct DM.

The code can be found here, it's MIT licensed which means you are totally welcome to do what you want with it.

API: https://github.com/pgdn-oss/pgdn-api-discord

Bot: https://github.com/pgdn-oss/pgdn-discord

Get in touch

I'm a CTO with 20 years experience, most recently even managed to exit a crypto exchange. I would love to connect on Twitter - please do DM and follow me, I have limited frens on there still :) https://x.com/simonpmorley

It's true, the web3 world is as decentralised as your nan's underwear.

Simon Morley — Wed, 22 Oct 2025 14:06:42 +0000

The Tesla Generator Paradox And Why Web3 Is Still About as Decentralised as Your Nan’s Underwear

Simon Morley ・ Oct 22

#web3 #decentralization #blockchain #architecture

The Tesla Generator Paradox And Why Web3 Is Still About as Decentralised as Your Nan’s Underwear

Simon Morley — Wed, 22 Oct 2025 10:34:48 +0000

“If your blockchain goes down because AWS goes down, you’re not decentralized.”

— Ben Schiller, CoinDesk, Oct 2025

🧭 The Outage That Shook the Backbone

Bla bla bla, we've all heard the news. Maybe your Alexa broke. Whatever. For context:

On October 20–21, 2025, AWS suffered a major outage centered in the us-east-1 region.

Snapchat, Fortnite, Roblox, and even parts of Amazon itself went dark. Bla bla bla.

For nearly fifteen hours, the internet’s most “redundant” infrastructure was anything but.

If AWS sneezes, the internet catches a cold.

(I don't know who said that FYI).

This time the fallout went deeper — blockchains, RPC endpoints, and validator APIs started failing too.

The world’s “decentralized” systems suddenly felt very centralized.

🔍 The Tesla Generator Paradox

Web3 today is a bit like running a Tesla off a petrol generator. Which apparently people do, someone on Twitter even fact checked this image so it must be true.

Everyone is banging on about decentralisation but peel back the layers and you’ll find the same old fossil infrastructure humming underneath. I've been saying this for years, including during my tenure as a CTO at a DeFi startup. So, I must be right.

📊 What the Data Says

Because I haz got the datas, I did look at them datas!

If you've been reading my thrilling articles about the fact Sui is probably going to collapse quite soon, you'll know what's coming. I did some analysis of 122 Sui validators and public nodes - it reveals moderate decentralisation on paper — but concentration in practice.

Latitude.sh: 18.0 %
OVH: 18.0 %
The Constant Company: 7.4 %
Top 3 providers = 43 % of all validators
US jurisdiction: 31 %
Western Europe (UK + DE + FR + NL + IE): 41 %
AWS + GCP + Azure combined: 10.7 %

The Herfindahl–Hirschman Index (HHI) is 859 — “unconcentrated” by regulatory standards but that’s misleading when dozens of nodes share the same upstream providers, fibre routes, and power grids.

BTW - I learned myself something today - HHI - fancy new term for Simon!

You can see all the data here:

https://github.com/pgdn-oss/pgdn-research/blob/main/reports/2025-10-sui-decentralisation.md

Oh and here is something fun - Sui once told me that their infra runs in this super private network in Switzerland but actually they have their stuff in Google Cloud. Sweet.

"On paper it’s diverse. In practice, it’s the same house with many doors."

🧩 Web3 Built on Web2 Foundations

The irony is painful: even networks designed for fault tolerance often rely on a few hyperscalers for uptime. Hyperscalers! (ChatGPT put this word in here, I decided it sounded cool).

When AWS stumbles, so do RPC providers like Infura, Alchemy, and QuickNode.

When Cloudflare misconfigures, half of Solana RPC endpoints vanish.

When OVH catches fire (literally, in 2021), validators go dark.

It’s decentralisation in code, centralisation in practice. Fact. I think it's the damn thought leaders again saying words.

⚙️ What Real Infrastructure Decentralisation Requires

Jurisdictional diversity — not just different countries, but different regulators and risk domains.
Bare-metal or sovereign hosting — own or colocate hardware. Don’t rent from hyperscalers. Actually lattitude.sh claim to be be bare metal but the point is we need diversity!
Multi-provider topology — AWS + OVH + local datacenters + community nodes.
Transparent infrastructure maps — disclose where nodes live and how they’re connected.
Resilience testing — simulate region loss, BGP leaks, and power faults.
Agentic monitoring — track correlated risk across clouds, not just node count. I've put this in because this is what I am building, hint hint.

🧠 The Hard Truth

We call it Web3, but until our nodes can survive an AWS outage,

it’s still Web2 in disguise — centralised scaffolding painted in decentralised colours.

Decentralisation isn’t about the number of validators.

It’s about how many can survive when the lights go out.

The good news?

Tools like agentic scanners, peer diversity metrics, and open telemetry are starting to expose these weak points. The next step is acting on them — before the next outage does it for us.

Simulated Attack: How a 33% consensus risk puts Sui one incident away from a network halt

Simon Morley — Tue, 21 Oct 2025 10:28:55 +0000

TL;DR

In my external posture analysis of Sui validator infrastructure, I found ≈39.6% of voting power was externally vulnerable - above the 33% consensus halt threshold by ~6.6 percentage points (equivalent to 621 voting power in our dataset).

This simulated attack models how an attacker could chain public signals and operational misconfigurations to disable enough validators to cross that threshold.

The result is a resilience warning: the network was, at scan time, within striking distance of a service-impacting halt.

Full details here:

https://github.com/pgdn-oss/sui-network-report-250819/blob/main/simulated_attack.md

Ethics & scope

This was a non-exploitative simulation using only publicly-observable data. I did not access private systems, exfiltrate data, or run exploits. I have redacted IPs, hostnames, step-by-step exploit primitives and any reproduction commands that would enable misuse. Operators who find themselves in the report and need confidential help: open an issue on the repo or contact me privately via the repo's issue tracker. I follow coordinated disclosure best practices.

Why the numbers matter

33% halt threshold: Sui's consensus can be materially impacted if ≥33% of voting power goes offline or is disabled.
Observed exposure (~39.6%): My scans found roughly 39.6% of voting power had externally-observable vulnerabilities or misconfigurations that an attacker could plausibly target.
Delta: That is ~6.6 percentage points above the halt threshold — 621 voting power in our dataset. In plain terms: the network was within a single coordinated incident of crossing a critical resilience boundary.

This isn't an abstract metric — it maps operational exposure to consensus risk. When you combine exposed validator surfaces at scale, you stop abstracting "nodes" and start measuring real systemic fragility.

What the simulated attack actually shows

The simulated attack is a modelling exercise — it demonstrates attacker decision-making rather than executing an exploit. Steps (sanitised):

Reconnaissance: collect public signals (metrics, HTTP banners, management port responses).
Enrichment: parse metric labels and banners to infer roles and topology (which nodes are validators, leaders, etc.).
Prioritisation: rank targets by attacker attractiveness — validators with exposed metrics + reachable management surfaces are high-value.
Confirmatory enumeration: light, non-destructive probes to validate co-residency and service fingerprints.
Attack-path modelling: chain the signals into a plausible escalation path that, if realized, could disable selected validators (e.g., by misconfigurations, exposed management APIs, or operational errors), potentially pushing cumulative offline voting power above the halt threshold.

Key point: the simulation ties what is observable from the outside to what an attacker would prioritise. It’s the mapping from telemetry -> decisions -> systemic outcome.

Concrete quantitative findings (sanitised)

Total endpoints analysed: ~122 Sui-related endpoints.
Voting-power exposure observed: ≈39.6% of total voting power showed externally-observable vulnerabilities by our conservative scanner and confidence policy.
Consensus threshold context: The 33% threshold is a critical operational boundary for consensus liveness; the observed exposure exceeded this by ~6.6 percentage points (621 voting power in dataset terms).
Common signal types driving exposure: public metrics exposing role labels, management APIs reachable on common ports (container management fingerprints appeared repeatedly), and default HTTP/admin pages leaking product/type info.

(Exact tables, heatmaps, and per-validator rows are in the full report; I have redacted host-level identifiers from this article.)

What this means for Sui (and similar networks)

Resilience is operational, not just cryptographic. Excellent protocol design doesn't prevent nodes from being misconfigured or deployed insecurely. If enough validators share similar deployment mistakes, the protocol's liveness assumptions are at risk.
Decentralisation ≠ diversity of security posture. A network of validators operated similarly — including shared misconfigurations, concentrates systemic risk.
The practical impact of a coordinated incident: crossing the 33% threshold could cause temporary halts, delays in finality, staking reward disruption, and a loss of confidence among users and delegators. Even short outages can have outsized reputational costs for a young ecosystem.

What operators and the ecosystem should do now

For validators (immediate):

Inventory externally-exposed services for your validator and associated infra. If you can't list them, you're blind.
Close management APIs to the public (bind to localhost or private networks; require VPN/mTLS jump hosts).
Protect metrics — use private scraping or authenticated gateways; remove internal hostnames and role labels from public metrics.
Silence banners & versions that leak product/version info.
Run external posture checks against your own endpoints and triage findings immediately.

For the Sui ecosystem (coordination & incentives):

Require external-risk audits as part of validator onboarding. Make passing an external posture check a first-class requirement.
Incentivise ops maturity — link staking, eligibility, or onboarding checks to evidence of secure deployment.
Support operator tooling — provide vetted scanner tooling and an official remediation playbook.
Share anonymised telemetry so the community can track progress and systemic risk without exposing individual operators.

Limitations & responsible framing

The 39.6% figure is based on conservative heuristics and an externally-observable posture scan — operator verification can reduce false positives. Some "exposed" signals are port-only observations or default pages that do not necessarily imply 'compromiseability'.
This is not a claim that the network was attacked, only that the modeled conditions could — with additional operational error or a coordinated attack — cross the consensus threshold.
My goal is operational improvement: to turn a surprising statistic into urgent, practical action.

Reproducibility & where to find the data

Full dataset, scripts, heatmaps and appendices are in the report: https://github.com/pgdn-oss/sui-network-report-250819

If you operate validators or infrastructure that appear in the report and want private assistance, please open an issue on the repo and I will respond via coordinated disclosure.

And there's a cool discord bot called PGDN Sentinel that you can use too.

https://pgdn.ai/pgdn-sentinel-discord

Final word

This isn't an alarmist headline. It's a measured warning based on data: if multiple operators expose similar surfaces, consensus-level fragility is not hypothetical — it's quantifiable and fixable. The immediate wins (close management APIs, protect metrics, automate posture checks) dramatically reduce the chance of a coordinated incident.

(I’m working on something new here — automating external risk discovery at scale. I’ll share details soon.)

The state of Sui: What external-facing risk looks like (and why top engineers miss it)

Simon Morley — Mon, 20 Oct 2025 09:02:07 +0000

TL;DR
I analysed the externally-observable posture of 122 Sui network endpoints. What I found isn't about whether the Sui team build great software, it's about how even 'good' engineers can miss external operational risk: exposed services, misconfigured infrastructure, and public metrics that leak sensitive operational data. This piece summarises my main findings, why they matter, and practical steps operators can take today.

Why I did this

I wanted to show, with data, how external attack surface and operational misconfigurations can defeat even excellent engineering. The Sui protocol has strong engineering — my goal is educational: to help teams measure and close external exposure before an attacker finds it.

The data was shared with the Sui security team in August 2025.

What I scanned and how

Briefly (full methodology in the linked report):

I measured 122 Sui-related endpoints for externally reachable services (HTTP, RPC, Docker API, metrics endpoints, etc.).
My approach focused on externally observable posture — what an internet attacker can see and reach — not on private code or internal access.
I applied conservative confidence thresholds for version/CVE mapping and logged only reproducible findings.

See the full methodology and raw data in my published findings. (link: https://github.com/pgdn-oss/sui-network-report-250819)

Topline findings

A non-trivial percent of observed endpoints exposed services that should never be public (for example, metrics endpoints reachable from the internet, and port 2375 — Docker remote API — observed in a surprising number of hosts). SSH all over the shop.
Many public websites were default vendor landing pages or misconfigured web servers (these can leak service versions and admin consoles).
Only a small fraction had WAFs present when an HTTP endpoint existed.
Several hosts returned service banners or version strings that mapped to known CVEs (I used a conservative confidence policy; the “CVE-affected” label is an upper bound pending operator verification).
The distribution of problems is not uniform — some operators were well locked down, others left obvious signals that an external attacker could use.

(Full counts, tables and heatmaps are available in the full report.)

Why this matters

External visibility is an attacker’s map. Public metrics, misconfigured HTTP endpoints and exposed management APIs are high-value reconnaissance.
Automated attacks scale. An exposed metrics endpoint or Docker API is trivial for automated tooling to find and target at scale.
Engineers think inside-out. Teams often focus on consensus and cryptography (rightly), and under-invest in hardening the network/ops layer that faces the internet.

Concrete examples (anonymised)

Metrics endpoints reachable on the public internet that expose internal state and operational metrics.
Docker remote API (2375/tcp) responding with service banners — a trivial path to container escape or remote code execution in the wrong hands.
Default web server landing pages that leak version information or provide admin paths.

(Again — see the report for technical reproduction notes and timeline.)

Remediation checklist (for operators & you)

Inventory your externally reachable endpoints. If you can’t list them, you can’t secure them. Use internal scans + trusted external scans.
Close management interfaces to the public. Docker APIs, admin consoles, metrics scrape endpoints — bind them to localhost / private networks only.
Require auth and network controls. Where management APIs must be reachable externally, place them behind a mutual-TLS gateway, VPN, or tightly-scoped firewall rules.
Harden metrics endpoints. Don’t expose Prometheus or similar scrapers to the public internet. Use an internal scraper or secure gateway.
Remove verbose banners & version strings. Configure servers to not reveal build/versioning in HTTP headers or service banners.
Monitor for drift. Re-run external posture scans regularly and detect when previously-closed ports reappear.
Patch management. Track service versions and patch known CVEs promptly — but assume some versions may still be exposed until verified.

Limitations & ethics

My scans are non-invasive and focused on public-facing services. I do not exploit vulnerabilities, nor do I publish private data.
Some “port-only” observations require operator verification (e.g., distinguishing a ghost port from a genuine service).
The CVE mappings are conservative upper-bound estimates that need operator confirmation for actionable triage.

See the full methodology, opsec and reproducibility appendix for the exact scanner commands and the policy I used for CVE confidence. (link: https://github.com/pgdn-oss/pgdn-cve)

What I recommend to protocol teams and operators

Fund or mandate periodic external posture reviews as part of release processes.
Automate external smoke tests that confirm management APIs and metrics are not exposed.
Make “no-management-exposed” a documented runbook for deployment.
Share anonymised exposure telemetry so the community can learn and raise the bar.

Closing — why I published this

This is about shared risk and learning. Great protocol engineering doesn’t immunise an operator against mistakes in deployment and ops. My hope: this write-up becomes a practical resource for teams and operators to make the mesh of Sui (and similar networks) safer for everyone.

Full report (data, scripts, and appendices): https://github.com/pgdn-oss/sui-network-report-250819

I’ve been building something new that takes this kind of analysis much further — automating external risk discovery at scale. More on that soon.

Thanks for reading, Simon.

Teaching Security Scanners to Remember - Using Vector Embeddings to Stop Chasing Ghost Ports

Simon Morley — Tue, 14 Oct 2025 13:19:09 +0000

I've scanned the same 118 blockchain validator nodes probably 200 times over the past year. And for most of that time, my scanner was an idiot with amnesia - treating scan #200 exactly like scan #1, learning nothing.

Every single time, ports 2375 and 2376 showed up as "open." Every single time, my tools dutifully tested them for Docker APIs. Every single time, they found nothing. Ten seconds wasted per scan, multiplied by hundreds of scans, just... gone.

Then I had a thought: What if my scanner could remember?

The Ghost Port Problem

Here's what kept happening across all 118+ nodes, spanning multiple cloud providers and geographies:

Ports 2375/2376 (standard Docker API ports) responded to TCP handshakes
But curl hung. Netcat got EOF immediately. No banner, no service, nothing
Identical TCP fingerprints every time: TTL≈63, window=65408
These were otherwise hardened validator nodes with strict firewalls

Traditional security scanners reported these as "open/tcpwrapped" or "unknown service." Which meant:

Repeated Docker API testing (10+ seconds per port)
Manual investigation on every scan
False positives in my reports
Wasted scanning budget when cloud providers flagged excessive probes

After the 50th identical scan, I was done. There had to be a better way.

Vector Embeddings: Not Just for Chatbots

Vector embeddings are typically associated with NLP and RAG systems — turning text into high-dimensional vectors where semantically similar things cluster together. But the core concept is universal: represent complex data as points in space, then query "what's similar to this?"

What if each network scan became a vector representing:

Port combinations and states
TCP-level behaviors (TTL, window size, response timing)
Application-layer responses
Infrastructure context (hosting provider, network profile)

Then instead of treating every scan independently, I could query: "What have I learned from similar infrastructure before?"

The Architecture

I built a three-part system:

Alan (AI Planner): LLM-based decision engine that receives scan context and historical patterns, then generates optimized probe sequences

Stan (Executor): Runs the actual scanning commands (nmap, masscan, protocol probes) and captures behavioral metadata

Vince (Vector Memory): PostgreSQL with pgvector extension storing 1536-dimensional embeddings with cosine similarity search

The flow looks like:

Stan discovers open ports → [22, 80, 2375, 2376, 9000, 9184]
Vector memory finds similar historical scans
Alan gets enriched context with patterns
Alan generates optimized probe plan based on what worked before
Results stored with behavioral fingerprint
Embedding generated and indexed for future queries

Setting Up pgvector

I chose pgvector because it's PostgreSQL-native, mature, and way more cost-effective than managed vector databases:

CREATE EXTENSION vector;

ALTER TABLE validator_scans 
  ADD COLUMN embedding vector(1536);

CREATE INDEX validator_scans_embedding_idx 
  ON validator_scans 
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Similarity queries are simple:

SELECT id, host, ports, 
       1 - (embedding <=> query_embedding) as similarity
FROM validator_scans
WHERE created_at > NOW() - INTERVAL '90 days'
ORDER BY embedding <=> query_embedding
LIMIT 5;

For embeddings, I use OpenAI's text-embedding-ada-002 (1536 dimensions) because it's dirt cheap ($0.0001 per 1K tokens) and handles structured text well.

Beyond Simple Signatures

Traditional fingerprinting is rule-based:

IF port == 2375 AND banner contains "Docker" 
  THEN service = Docker API

Vector-based learning captures behavior:

"Port 2375: SYN-ACK succeeds, TTL=63, window=65408, 
 no banner, immediate FIN on data send, 
 appears alongside ports 9000+9184 (Sui consensus/metrics),
 ASN indicates Vultr hosting"

Similarity search returns:

"47 similar scans: 46 showed identical 'ghost port' behavior,
 1 had actual Docker (flagged as anomaly)"

Conclusion: 98% confidence this is NOT Docker, likely cloud infrastructure artifact

Watching It Learn

Scans 1-20 (Initial learning): System tests Docker APIs as expected, stores behavioral metadata showing timeouts and connection refusals.

Scans 21-50 (Pattern recognition): Vector similarity search starts clustering:

Query: Scan with ports [22, 80, 2375, 2376, 9000, 9184]

Top matches:
- Scan #14: 96% similarity → 2375/2376 ghost ports
- Scan #8:  94% similarity → 2375/2376 ghost ports  
- Scan #19: 93% similarity → 2375/2376 ghost ports

Pattern confidence: 0.85 (17/20 matching scans)
Recommendation: Skip Docker testing on 2375/2376

Scans 51+ (Optimized): High confidence behavioral signatures:

{
  "similar_scan_count": 47,
  "confidence": 0.96,
  "2375_behavior": "ghost_port - skip Docker probes",
  "estimated_time_saved": "45s per scan"
}

The Results

After 200 scans of the same infrastructure:

Time efficiency: 58 seconds per scan → 20 seconds per scan (66% reduction)

Probe efficiency: 7.2 probes per host → 3.8 probes per host (47% less network traffic)

False positives: 2.4 per scan → 0.3 per scan (87% reduction)

Pattern recognition speed: Confident patterns (>0.85 similarity) after just 18-25 similar scans

But here's the coolest part: anomaly detection. On scan #73, port 2375 actually responded with a Docker API. The system immediately flagged it: "Unusual behavior — historical data shows 0.02% Docker response rate." Turned out to be a misconfigured node that needed immediate attention.

Practical Considerations

Similarity thresholds matter:

Homogeneous infrastructure (like validators): 0.75-0.85
Mixed environments: 0.65-0.75
Pentesting diverse targets: 0.60-0.70

Cold start problem: First 10-20 scans of new infrastructure provide no optimization. Mitigation: seed database with known patterns.

Temporal drift: Infrastructure changes over time. I time-weight similarity to prefer recent scans.

Embedding overhead: Adds 50-100ms per scan. I generate embeddings asynchronously in production.

Why This Matters

Traditional security scanners treat every scan as a fresh start. They're like someone with no short-term memory, asking the same questions over and over. This made sense 20 years ago when each network was unique.

But modern security teams scan thousands of similar nodes repeatedly:

Development environments that clone production
Auto-scaling cloud infrastructure
Container clusters with identical configurations
Blockchain validator networks (my use case)

Vector-based behavioral fingerprinting lets scanners accumulate institutional knowledge that compounds over time. They get smarter with every scan, building confidence about what's normal and what's anomalous.

As cloud infrastructure grows more complex — with synthetic network responses, polymorphic services, and dynamic topologies — we need security tools that learn. Not just from signature databases, but from their own experience.

What's Next

I'm exploring:

Multi-modal embeddings combining text with numeric TCP fingerprints
Transfer learning: do patterns from Sui validators apply to Ethereum nodes?
Hierarchical clustering to automatically build infrastructure taxonomies
Tracking temporal pattern evolution to detect infrastructure migrations

The core insight stands: every scan is a training example. Stop forgetting. Start remembering.

I’m publishing the open source code here: github.com/pgdn-oss. Built with PostgreSQL, pgvector, and OpenAI embeddings. Part of a new venture, coming soon.

Forem: Simon Morley

XDP: The Kernel-Level Powerhouse Behind Modern Network Defence

Introduction

The Core Idea

Execution flow:

Why It Matters

1. Performance

2. Programmability

3. Security

4. Observability

XDP in the Wild

Meta (Facebook)

Cloudflare

Typical Use Cases

Writing an XDP Program (Example)

Advanced Scenarios

1. Dynamic Remediation

2. Programmable Rate Limiting

3. Hybrid Visibility

Challenges and Limitations

Conclusion

Why Now?

Further Reading

Why Multi-Validator Hosts Break Traditional Security Scanning

The Problem Nobody Talks About (mostly because they're not interested, but if they are, they're not talking about it).

The Overlapping Port Problem

The Configuration Variance Problem

Manual Verification Works (But Doesn't Scale)

The Insight That Changes Everything

Goodbye AI, or at least some of it.

What Comes Next

How Google Mistook My Sui Node for a Bitcoin Farm (And Banned Me) (again)

The Irony Wasn't Lost On Me

The Unbanning Process (AKA: Purgatory)

The Question I Should've Asked Earlier

The Constraint That Changes Everything

What Comes Next

PGDN Sentinel — an OSS security toolkit for Sui validators, inside Discord

Why a Discord bot?

Why Keep The Data Private Simon?!

What is the architecture, I hear you asking?!

Why this matters

Try it

Get in touch

It's true, the web3 world is as decentralised as your nan's underwear.

The Tesla Generator Paradox And Why Web3 Is Still About as Decentralised as Your Nan’s Underwear

Simon Morley ・ Oct 22

The Tesla Generator Paradox And Why Web3 Is Still About as Decentralised as Your Nan’s Underwear

🧭 The Outage That Shook the Backbone

🔍 The Tesla Generator Paradox

📊 What the Data Says

🧩 Web3 Built on Web2 Foundations

⚙️ What Real Infrastructure Decentralisation Requires

🧠 The Hard Truth

Simulated Attack: How a 33% consensus risk puts Sui one incident away from a network halt

Ethics & scope

Why the numbers matter

What the simulated attack actually shows

Concrete quantitative findings (sanitised)

What this means for Sui (and similar networks)

What operators and the ecosystem should do now

Limitations & responsible framing

Reproducibility & where to find the data

Final word

The state of Sui: What external-facing risk looks like (and why top engineers miss it)

Why I did this

What I scanned and how

Topline findings

Why this matters

Concrete examples (anonymised)

Remediation checklist (for operators & you)

Limitations & ethics

What I recommend to protocol teams and operators

Closing — why I published this

Teaching Security Scanners to Remember - Using Vector Embeddings to Stop Chasing Ghost Ports

The Ghost Port Problem

Vector Embeddings: Not Just for Chatbots

The Architecture

Setting Up pgvector

Beyond Simple Signatures