Forem: Stelia Developers

Introducing Stelia AI OS: the full-stack operating system for enterprise AI

Stelia Developers — Thu, 30 Apr 2026 14:00:25 +0000

Governed, sovereign, scalable: 5x faster speed to market; 50% reduction in LLM costs; deploy anywhere, run anywhere; future-proof your AI advantage.

Getting enterprise AI out of controlled pilot environments and into production at scale — reliably, with full control over your data, your infrastructure, and your outcomes — is where most enterprises are failing today.

That’s what Stelia AI OS was built for.

Stelia AI OS is not another orchestration layer or model gateway. It replaces the need to assemble and maintain an entire AI stack in the first place.

A full-stack operating system for production-grade AI that moves at commercial speed, scales to enterprise demand, and deploys the latest innovations safely, compliantly, and cost-effectively while staying under your control.

The result is faster time to market, lower running costs, higher value outcomes, and an architecture that’s built to evolve with the pace of AI.

Why we built Stelia AI OS

The industry challenge: point solutions, integration complexity, and the fragmented AI stack
Having deployed production-grade AI across some of the world’s most complex organisations, we’ve experienced this problem first-hand.

The challenge isn’t access to powerful models. It’s that engineering teams are operating with fragmented AI stacks that are expensive to maintain, difficult to scale, and almost impossible to govern end-to-end.

Most tools are built for a single layer of the AI stack — like orchestration, model access, or monitoring — solving isolated problems that teams still have to integrate, secure, monitor, and scale.

The result is that the majority of engineering capacity gets consumed by integration rather than building the capabilities that actually move the business forward. And as teams become tied into specific vendors across the stack, lock-in adds another layer of cost, complexity, and constraint.

Stelia AI OS for engineering leaders: the full-stack AI platform built for production

Stelia AI OS replaces fragmented tooling with a full-stack AI operating system that gives engineering teams a unified, production-ready foundation on which to build workflows and agents of genuine production quality, without having to construct, integrate, or maintain the complex infrastructure underneath.

It means dramatically higher engineering productivity, a governed and sovereign architecture with built-in access controls, auditability, and observability required for enterprise AI, and the ability to scale from first deployment to full production without rebuilding.

Smart model routing based on cost, latency, and task requirements, combined with optimised token consumption, can reduce LLM running costs by more than 50%. This gives engineering teams the ability to run more at lower cost, and the business a clear, measurable return on AI infrastructure investment.

And because AI OS is model-agnostic and built without vendor lock-in, your team retains full architectural freedom meaning you can switch providers, and adopt and fine-tune the latest models in the simple, easy-to-use platform. This enables you to adapt fast as the technology evolves without being held back by the platform beneath you.

Stelia AI OS for business leaders: delivering long-term competitive advantage

For business leaders, the challenge is one of momentum: AI investment is significant, pilots are promising, but translating that into competitive advantage remains elusive.

Stelia AI OS gives commercial teams and business leaders the speed to get to market faster, the scalability to drive real revenue from existing data, and the adaptability to capitalise on the latest models and innovations the moment they matter — opening up new revenue streams and product capabilities that simply aren’t possible with fragmented systems or rigid platforms that tie you to yesterday’s technology. It means moving fast without sacrificing on governance, and having the freedom to adapt as the market does without being constrained by your platform.

Stelia AI OS capabilities and demo

The platform is expansive, enabling you to build robust, resilient workflows with full observability and end-to-end governance without having to construct, integrate, or maintain the complex infrastructure underneath. Here are a few key capabilities:

Interested in seeing a demo on how you can fine-tune LLMs, image, and video generation models with your own proprietary data, in an easy to use interface that doesn’t require deep data science expertise? Contact us here.

Stelia AI OS deployment examples

Advertising
Automated generation, analysis, and optimisation of content using multi-modal inputs to maximise campaign performance.

→ Significantly increasing return on ad spend.

Customer service
Multi-step agent workflows integrating databases and live systems for real-time problem resolution.

→ Reducing cost to serve while maintaining high quality, consistent customer experiences.

Operational intelligence: Space
AI-powered systems integrating proprietary models with LLMs to deliver real-time insight from satellite and sensor data for monitoring, forecasting, and decision-making

→ Improving speed and accuracy of decision-making in high-value, data-intensive environments

Advertising showcase: an application built on top of Stelia AI OS

An example application built on Stelia AI OS to bring to life some of the platform's capabilities being deployed in media and entertainment enterprises, including robust workflows, proprietary data, LLM and ML integration, image generation, and video generation.

Who this is for

Stelia AI OS is for organisations that are ready to move from experimentation to building something that delivers lasting business value. Engineering teams who want to improve speed to market and focus on creating innovation rather than maintaining infrastructure. Business leaders who need AI to deliver measurable outcomes — new revenue, reduced cost, and diversified products.

We work with customers in the way that makes most sense for where they are and what they need. Organisations can use AI OS as the foundation their own teams build on, with Stelia providing the platform, the architecture, and the support to accelerate what they’re building.

For those who want to move fast without diverting internal resource, we deliver outcomes. Our engineering team handles the build and the production delivery for you. Either way, what gets built is yours: your data, your infrastructure, your competitive advantage.

And we have private, sovereign, and flexible deployment options so you can deploy AI OS in the environment that matches your security, compliance, and performance requirements without changing your architecture.

Interested in learning more about how you can accelerate production-quality AI development with Stelia AI OS, or build revenue-driving systems and outcomes that are fully governed, sovereign, and scalable? Get in touch to speak to our team.

About us

Stelia’s team of full-stack AI specialists are advancing the future operating systems of AI. Embedded in high-trust, data-intensive environments, we serve as the operational backbone for production-grade AI — making enterprise deployment faster, governed, and built without lock-in.

No more CPU fights: how we build truly isolated cloud compute

Stelia Developers — Thu, 16 Apr 2026 08:08:32 +0000

by Peter Bangert, Senior Platform Engineer, Stelia

At Stelia AI, we have our own cloud infrastructure for self-hosting and self-delivery. Among our products — ranging from managed Kubernetes to Slurm — we provide an isolated GPU/compute instance comparable to EC2.

Our goal is simple: deliver compute instances (read: virtual machines with strict CPU ownership, NUMA locality, and zero cross-tenant contention — all inside Kubernetes.)

To accomplish this, we run virtual machines inside pods and pass host file descriptors into the hypervisor to attach paravirtualised (virtio) networking and storage devices. This gives us the operational flexibility of Kubernetes while maintaining VM-level isolation.

But doing this correctly requires careful control of CPU, memory, NUMA topology, and thread placement.

Let’s walk through how we built it.

Overview

This article is meant to be a technical guide providing an introduction to the tech stack, namely, Virtual Machines running on Kubernetes, and the motivation behind these choices, particularly pertaining to how we, at Stelia AI, perceive the future of the AI Infrastructure industry. Then we will be providing an introduction to the Kubernetes Topology Manager and the basics you need to set up your own NUMA-aware pod scheduling. Lastly, we will go a step further and discuss our highly technical approach to ensuring vCPU thread isolation for our Virtual Machines.

Motivation: focusing on duty, self-mastery, and trust in divine order

Why Kubernetes

There is a prevailing trend that AI/ML workloads are migrating to Kubernetes; CNCF reports that as of early 2026, 66% of organisations use Kubernetes for inferencing.

While we at Stelia AI have experience working with other infrastructure stacks, i.e., OpenStack, or simply bare metal, we chose Kubernetes because it has evolved beyond a container orchestrator into a universal control plane. By leveraging Custom Resource Definitions (CRDs), we treat our isolated VMs as first-class objects, benefiting from K8s’ robust object serialisation, RBAC, and API versioning out of the box. Also, by using the “Reconciler” controller pattern, you move away from “fire-and-forget” imperative scripts/IAC to more fine-grained state control.

Why Virtual Machines

By running virtual machines inside pods, we are using Kubernetes as the management plane (scheduling, networking, and storage) and the VM as the execution boundary. This also provides security hardening as opposed to providing pod/container access via mitigating the escape risk of a syscall exploit, and near hardware native performance via PCIe passthrough or SR-IOV.

NUMA Locality in Kubernetes

Reference: Topology Aware Scheduling in Kubernetes

Given our requirements to build an AI inferencing/training suitable infrastructure, we need to take lessons from classic HPC.

Not all memory is created equal. Non-Uniform Memory Access (NUMA) means that a specific group of CPU cores has “local” access to a specific bank of RAM.

The current gold standard for everyday AI workloads is the AMD EPYC dual socket processor, a very flexible workhorse that plays a pivotal role in creating balance by delivering high-performance, efficiency, and security. Below is a diagram of a Dual Socket system. Ideally, for HPC/AI workloads, you want your processes and threads running in a single socket, or NUMA node, as this eliminates the NUMA Hop by requesting memory across the interconnect, which results in a 15–30% drop in performance.

GPU Locality: In addition, GPUs are connected via the PCIe bus, so if your workload is running on the wrong socket, the data path between the CPU and GPU becomes a congested highway resulting in jitter or tail-latency.

To add another dimension, the amount of NUMA Nodes per Socket is configurable on most systems via the UEFI/BIOS. There exist typically NPS levels of 1,2, and 4, meaning on a Dual Socket system, you can configure up to 8 NUMA nodes. For our use case, we configure our systems with a single NUMA node per socket, so 2 in total.

But Dual Socket Systems aren’t that complex, is this overkill?

Following SuperMicros announcement in May 2023, unveiling an 8 Socket Server based on Intel CPUs…

It seems Dual Sockets are only the beginning. The sooner we adapt our Kubernetes topology management in the fast-growing world of bigger servers and larger workloads, the more prepared we will be moving forward.

Solving Resource Contention

When building and operating GPU compute resources, it's imperative that when a customer requests 20 CPUs for their virtual instance, they get 20 CPUs. In order to dedicate CPUs to instances, we’ll need to make the other stuff running on the system avoid them.

The traditional approach for that was the isolcpus kernel argument, but that’s deprecated, and cpusets in cgroups are the way to go.

Previously, when running our infrastructure on OpenStack or Bare-Metal, we had a tedious approach involving:

Creating specific cgroups for different process types (system, instances, user)
Manual CPU pinning (taskset)
cpuset manipulation

Because we now operate inside Kubernetes, we can instead leverage three core components:

CPU Manager
Memory Manager
Device Manager
Coordinated by the Topology Manager

Kubernetes Topology Manager Deep Dive

Reference: Control Topology Management Policies on a node

The Topology Manager coordinates CPU, memory, and device allocation to ensure NUMA alignment. It acts as a central coordinator that polls three specific sub-managers to see what they have available:

CPU Manager: Decides which specific cores to assign (for Guaranteed QoS pods).
Memory Manager: Handles pinning memory pages to specific NUMA nodes.
Device Manager: Manages high-speed peripherals like GPUs or SR-IOV NICs.

Each manager sends back a Topology Hint, which is a bitmask saying, “I can satisfy this request using NUMA node 0 or node 1.”

The Topology Manager then merges these hints according to a configured policy.

Topology Manager Policies

The Topology Manager's responsibility is to coordinate how resources are allocated across NUMA nodes. Its goal is to align CPUs, memory, and devices so workloads avoid costly cross-socket communication, but there is a tunable policy to dictate its strictness.

The Topology Manager supports several policies:

none: No NUMA alignment is attempted. This is the default behavior and resources may come from any NUMA node.

best-effort: Attempts to align resources on the same NUMA node, but does not strictly enforce it. If alignment is not possible, the pod is still admitted.

restricted: Requires NUMA alignment when possible. If alignment cannot be achieved, the pod is rejected.

single-numa-node: The strictest policy. All requested resources must come from a single NUMA node, or the pod will not be scheduled.

Our Policy

Our goal is to provide performance-optimised resource allocation whenever possible, but ultimately we need to be flexible enough to account for instance requests which span well beyond a single NUMA node, therefore we chose to use the best-effort policy.

This is the configuration we are currently using in our kubelet configuration, please consult the kubernetes documents linked in this article to understand the purpose of the reserved memory/CPU requirements.

cpuManagerPolicy: static
memoryManagerPolicy: Static
topologyManagerPolicy: best-effort
topologyManagerScope: pod
reservedSystemCPUs: "0,64"
kubeReserved:
  cpu: "3"
  memory: "2Gi"
systemReserved:
  cpu: "1"
  memory: "1Gi"
evictionHard:
  memory.available: "1Gi"
  nodefs.available: "10%"
  imagefs.available: "15%"
reservedMemory:
- numaNode: 0
  limits:
    memory: "2Gi"
- numaNode: 1
  limits:
    memory: "2Gi"

Important: guaranteed QoS requirement

Reference: Kubernetes Pod Quality of Service Classes

Topology Manager alignment only works reliably for pods in the Guaranteed QoS class.

This requires that CPU and memory requests exactly match their limits, for example:

resources:
  requests:
    cpu: "4"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "8Gi"

When a pod is in the Guaranteed class:

the CPU Manager assigns exclusive CPU cores
the Memory Manager can enforce NUMA-local allocations
the Topology Manager can coordinate resource placement across subsystems

Pods in the Burstable or BestEffort QoS classes do not receive these guarantees, and their resources may be spread across NUMA nodes even when a topology policy is enabled.

The hidden problem: not all processes belong to the customer

If we look at all the running processes within the container that manages the hypervisor responsible for running the customer's virtual machine, we see something like this:

gpu-instance:/# ps
PID   USER     TIME  COMMAND
    1 root      0:00 instance-runner
   24 dnsmasq   0:00 dnsmasq 
   30 root      5:08 qemu-system-x86_64 
   35 root      5:08 virtiofsd
   37 root      0:00 bash
   43 root      0:00 ps
gpu-instance:/#

Even though Kubernetes gave the pod exclusive CPUs, all processes inside the pod share those CPUs.

That includes:

vCPU threads (customer workload)
I/O threads
QEMU management threads
Runtime threads from our controller
virtiofsd (shared filesystem)
dnsmasq (DHCP server) We needed isolation inside the container. Ideally, we would like everything besides the QEMU vCPU threads to be running on other cores elsewhere, meaning we need to move them to another cgroup.

About Control Group v2

As we discussed previously regarding the Pod Quality of Services classes, these classes are realised logically as different Cgroups by kubelet. Cgroups are a Linux Kernel feature that allows processes to be organised into hierarchical groups to limit, account for, and isolate resource usage.
Let's take a look at the Cgroups currently present on our worker node:

/sys/fs/cgroup/kubepods.slice
|   |-- kubepods-besteffort.slice
|   |   |-- kubepods-besteffort-pod0233c9aa_e18f_4614_97ba_94606228ec2f.slice
|   |   |   |-- cri-containerd-56b896a784d2bcb4c4a1877bc9350e0c412d9af98ec185ad3989aad78da0fead.scope
|   |   |   `-- cri-containerd-f486e076c03e5d720e7cf05abafd4456bff78d250ee896014ce79d65fd631d7b.scope
|   |   |-- kubepods-besteffort-pod232fe706_7428_4502_b880_19f28aa8ca3d.slice
|   |   |   |-- cri-containerd-4623cf1d300610228d5926a1f0c532dc9dd61db027d1d89a47a188e04871a73c.scope
|   |   |   |-- cri-containerd-72a8fb24580415a79e1b1b1142e6235ef61ca8ec2c2e7fec08da9574aea21810.scope
|   |-- kubepods-burstable.slice
|   |   |-- kubepods-burstable-pod332102fa_8018_4db4_9acc_50dd2f3a3460.slice
|   |   |   |-- cri-containerd-4fde7a33f9dc03c0d52d582266597e8f89d4d2bed6fd27232709eeb2dd34be0c.scope
|   |   |   |-- cri-containerd-7aca7509f4d290a8bcb7280c929cc85411fc803302dccf768aad3d937a169e95.scope
|   |-- kubepods-pod3ceda12c_3ee9_40fa_9f42_61428fe654a6.slice
|   |   |-- cri-containerd-0e16c65fb6a8d15dbd90aa3f1e07f6ba709d1adb47eee0af5aa6d1f92ffd4d04.scope
|   |   |-- cri-containerd-b02c37e654f2f8356cf2641f75a94194ecf3b8dbdb38d61de3644a67e49045a2.scope
|   |   `-- cri-containerd-eb6218013c1ba67fc5288216059150c3d8946da7ad6dd8cfdc7516621d80c260.scope

Here we see the three different QoS classes we previously described existing as different cgroup slices (unit of cgroups managed by systemd). Within each cgroup slice, we see pods, and within those pods, the individual containers.

For burstable and best-effort pod cgroups, you won't see much in terms of resource definitions, since kubelet and the topology manager don't provide any guarantees on resource isolation; however, you will see limits in place, such as memory.max.

root@worker1:/sys/fs/cgroup# cat kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod332102fa_8018_4db4_9acc_50dd2f3a3460.slice/cri-containerd-4fde7a33f9dc03c0d52d582266597e8f89d4d2bed6fd27232709eeb2dd34be0c.scope/memory.max
1073741824

However, for our guaranteed pods, you will see something more interesting: you will see cpuset.cpus defining the specific CPUs kubelet is reserving for the instance (in this case 4 cpus, 14–17)

root@worker1:/sys/fs/cgroup# cat kubepods.slice/kubepods-pod416de7c2_21de_472d_817c_fa9d4306cb7d.slice/cri-containerd-5baab67f1daf9297c60974e50ad9f94f288077161a7c41376478e79aae671e07.scope/cpuset.cpus
14-17

When a guaranteed pod is created, kubelet refers to its list of all currently guaranteed CPUs in its /var/lib/kubelet/cpu_manager_state, subtracts however many CPUs are being requested from its defaultCpuSet, allocates this for the requested pod, and then subtracts these CPUs from every burstable and best-effort pods' cgroup.
Let's take a look.

Currently our kubelet’s cpu_manager_state show us we have CPUs 2–11 on this 20 core system for shared use and the rest are reserved for instances.

{
  "policyName": "static",
  "defaultCpuSet": "2-11",
  "entries": {
    "3ceda12c-3ee9-40fa-9f42-61428fe654a6": {
      "instance": "0-1,12-13"
    },
    "416de7c2-21de-472d-817c-fa9d4306cb7d": {
      "instance": "14-17"
    },
    "b41f38d8-2c21-4168-af77-6d0cc5482dcb": {
      "instance": "18-19"
    }
  },
  "checksum": 3944432841
}

If we look at the cpusets allocated to the burstable and best-effort cgroups, they should coincide with the default cpuset:

root@worker1:/sys/fs/cgroup# cat kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0233c9aa_e18f_4614_97ba_94606228ec2f.slice/cri-containerd-f486e076c03e5d720e7cf05abafd4456bff78d250ee896014ce79d65fd631d7b.scope/cpuset.cpus
2-11
root@worker1:/sys/fs/cgroup# cat kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod332102fa_8018_4db4_9acc_50dd2f3a3460.slice/cri-containerd-4fde7a33f9dc03c0d52d582266597e8f89d4d2bed6fd27232709eeb2dd34be0c.scope/cpuset.cpus
2-11

So how can we isolate the threads within our cgroup from one another?

Use threaded cgroups?

A simple solution to isolate our worker threads from all the other processes running within our Pod would be to simply create two children cgroups of the current cgroup our Pod belongs to, make them threaded cgroups, and split the threads between them.

However, there exists a limitation within Control Groups v2, in that when certain domain controllers are enabled, children cgroups cannot be threaded due to a lack of supporting control over sub-resources.

Currently, the following controllers are threaded and can be enabled in a threaded cgroup:

cpu
cpuset
perf_event
pids

What this means is that, since we have the memory manager enabled, this is not an option (it requires hugetlb and memory controllers).

root@worker1:/sys/fs/cgroup# cat kubepods.slice/kubepods-pod416de7c2_21de_472d_817c_fa9d4306cb7d.slice/cgroup.subtree_control
cpuset cpu io memory hugetlb pids rdma misc

As a result of these limitations, in order to accomplish our goal, we must develop our own solution.

Instance Cgroup Controller

To outline the overall approach, first we will define the goals and criteria for what we want to accomplish.

Create an Instance Cgroup controller, which we will deploy as a Daemonset on all worker nodes
The Instance Cgroup controller will manage a float cgroup and will allow for Instances to register new cgroups.
New cgroup registration will take parameters read from the current instance's cgroup (cpuset)
Develop our Instance Runner to then move all processes to the newly registered cgroup
All non-QEMU worker pids/threads should be placed within the float cgroup, and all vCPU threads should be placed in the registered cgroup
Deregistration and placing pids back in the original place during teardown

An architecture of this approach can be seen below:

Step 1: Set up Instance Cgroup Controller

The Instance Cgroup Controller is responsible for managing CPU topology and cgroup placement for all virtual machine instances running on a node. Conceptually, it acts as a local resource coordinator that sits between the instance lifecycle manager and the Linux kernel’s cgroup interfaces.

The primary goals of the controller are:

Create and maintain cgroup heirarchy for float processes and instances
Registration and deregistration API for Instance Runner
Reconcile cpusets and cgroups statelessly, and manage cleanup

Establishing the Cgroup Hierarchy

The controller manages a dedicated subtree under the system cgroup hierarchy. The structure looks like:

steliа/
 ├─ instance-<pod-uuid>/
 ├─ float/

stelia/ is the root cgroup managed by our runtime.
instance-<pod-uuid>/ represents the cgroup assigned to a specific VM instance.
float/ is a shared execution pool where processes temporarily run before they are pinned to dedicated CPUs.

The next step will be to enable the required domain controllers for this cgroup hierarchy and then make these sub-cgroups (float, instances) threaded cgroup types.

It is important to note that we will only be enabling the cpuset domain controller in our cgroup hierarchy, as we are allocating hugepages to our virtual machine. Hugepage allocation is taken from kubelets allocation and will remain static as it's bound to the process at start time, so this should not be affected when the pid is moved, and the stelia cgroup does not have a hugetlb controller enabled (doesn't exist for threaded cgroups), so the kernel should not touch these resources.

Setting this up in Rust will appear as follows:

use std::fs;
use std::io::Result;
use std::path::Path;

/// Ensure the stelia cgroup topology exists and is configured correctly.
fn ensure_topology(
    stelia_cgroup: &Path, // e.g. /sys/fs/cgroup/stelia
    float_cgroup: &Path, // e.g. /sys/fs/cgroup/stelia/float
    total_cpus: &str, // e.g. "0-19 (from cpu_manager_state)"
    mems: &str, // e.g. "0,1" number of memory domains (from memory_manager_state)"
) -> Result<()> {

    // Create the root stelia cgroup
    fs::create_dir_all(stelia_cgroup)?;

    // Configure cpuset settings for the root cgroup
    fs::write(stelia_cgroup.join("cpuset.mems"), mems)?;
    fs::write(stelia_cgroup.join("cpuset.cpus"), total_cpus)?;

    // Enable cpuset controller for children
    fs::write(stelia_cgroup.join("cgroup.subtree_control"), "+cpuset")?;

    // Create the shared "float" cgroup
    fs::create_dir_all(float_cgroup)?;

    // Configure the float cgroup
    fs::write(float_cgroup.join("cgroup.type"), "threaded")?;
    fs::write(float_cgroup.join("cpuset.mems"), mems)?;
    fs::write(float_cgroup.join("cpuset.cpus"), total_cpus)?;

    Ok(())
}

Registration and Deregistration endpoints

Using JSON-RPC over a Unix domain socket is a good fit for communication between the Instance Runner and the Instance Cgroup Controller because it provides a lightweight RPC mechanism without introducing unnecessary networking overhead. Since both components run on the same node, there is no need to traverse the full IP or HTTP networking stack and they can instead communicate through a Unix filesystem socket.

A simplified JSON-RPC request used by the Instance Runner might look like, where the Instance Runner will provide its pod uuid and cpuset:

{
  "jsonrpc": "2.0",
  "method": "registerCgroup",
  "params": {
    "uuid": "123",
    "cpuset": "4-7"
  },
  "id": 123
}

The controller responds with the newly registered instance group path:

{
  "jsonrpc": "2.0",
  "result": {
    "cgroup_path": "/sys/fs/cgroup/stelia/instance-123"
  },
  "id": 123
}

Reconciliation loop and cleanup

Every time an event occurs (such as an instance registering or deregistering), the controller performs a reconciliation pass. The reconciliation process does two main things:

Recalculate Float CPUs

The float cgroup is updated using the current defaultCpuSet value from kubelet /var/lib/kubelet/cpu_manager_state. If the defaultCpuSet differs from the current cpuset which is assigned to the float cgroup, it is updated.

This is done by updating the /sys/fs/cgroup/stelia/float/cpuset.cpus

{
  "policyName": "static",
  "defaultCpuSet": "2-11",
  "entries": {
    "3ceda12c-3ee9-40fa-9f42-61428fe654a6": {
      "instance": "0-1,12-13"
    },
    "416de7c2-21de-472d-817c-fa9d4306cb7d": {
      "instance": "14-17"
    },
    "b41f38d8-2c21-4168-af77-6d0cc5482dcb": {
      "instance": "18-19"
    }
  },
  "checksum": 3944432841
}

Cleanup: remove stale cgroups

Stale instance cgroups are cleaned up if:

They do not appear in the kubelet CPU manager state
They contain no processes

This prevents resource leaks if an instance crashes or kubelet restarts. If the cgroups.procs and cgroups.threads is empty, it can be considered stale and can be safely removed.

Step 2: manage processes from Instance Runner

Once the cgroup hierarchy and CPU topology are established, the Instance Runner becomes responsible for placing VM processes into the correct cgroups and assigning them CPUs.

The process typically involves the following steps:

Register with the Instance Cgroup controller
Move all PIDs to the float cgroup
Discover QEMU worker threads
Move QEMU worker threads to the instance cgroup
Pin QEMU worker threads to specific CPUs

Registering with the controller

The instance runner is able to discover its pod-uuid and the cpuset allocated to this container/pod by reading from /proc/self as shown below:

gpu-instance:/# cat /proc/self/cpuset
/kubepods.slice/kubepods-podb9e4f71c_64a5_4f8d_8574_41363cbaf8e5.slice/cri-containerd-506962bc49f80b0ae7461edb1dbefdd626c4ece2b4bf103747a402982e91bf39.scope
gpu-instance:/# cat /sys/fs/cgroup/kubepods.slice/kubepods-podb9e4f71c_64a5_4f8d_8574_41363cbaf8e5.slice/cri-containerd-506962bc49f80b0ae7461edb1dbefdd626c4ece2b4bf103747a402982e91bf39.scope/cpuset.cpus
18-19

With this information, the Instance Runner can construct a registration request over a JSON-RPC Unix domain socket, which is provided as a volume mount to the Instance Runner pod. This provides a lightweight control plane without requiring shared state between components.

A typical registration request looks like:

{
  "jsonrpc": "2.0",
  "method": "registerCgroup",
  "params": {
    "uuid": "123"
    "cpuset": "4-7"
  },
  "id": 1
}

The response returns the assigned instance cgroup and CPU allocation.

Moving processes into the float cgroup

Having received our newly registered cgroup, we want to first move all the processes within the container to the float cgroup (runs on shared CPUs), with the goal of later moving the vCPU threads to the CPU-isolated cgroup.

Let's take a look back at our process namespace from before. This is a list of all running processes in our container.

gpu-instance:/# ps
PID   USER     TIME  COMMAND
    1 root      0:00 instance-runner
   24 dnsmasq   0:00 dnsmasq 
   30 root      5:08 qemu-system-x86_64 
   35 root      5:08 virtiofsd
   37 root      0:00 bash
   43 root      0:00 ps
gpu-instance:/#

The simplest way to manage moving all processes to another cgroup is to move pid 1 (main process) before spawning any subprocesses; this way, all spawned processes will inherit the parent cgroup.

So to move the main process to the float cgroup, this is as simple as writing the pid to the float cgroup's cgroup.procs file.

Example:

echo 1 > /sys/fs/cgroup/stelia/float/cgroup.procs

Caveat: preventing Kubernetes cgroup cleanup

Because instances are launched inside Kubernetes pods, we must account for Kubelet’s cgroup reconciliation logic. Kubelet periodically cleans up cgroups that it believes are unused. If all processes are moved out of a container cgroup, the kubelet will delete it.

To prevent this, we keep a placeholder process inside the original Kubernetes cgroup. For this, the Instance Runner will spawn an infinite sleep process within the container:

use std::process::Command;

fn spawn_placeholder() -> std::io::Result<()> {
    Command::new("sleep")
        .arg("infinity")
        .spawn()?;
    Ok(())
}

This process never participates in the instance runtime but acts as a cgroup anchor.

Discovering vCPU Threads via QMP

Now that all processes are moved into the float cgroup and are now able to utilise all CPUs on the node, we need to isolate our vCPU ‘worker’ threads. How do we discover these threads?

QEMU exposes runtime information through the QEMU Machine Protocol (QMP). We use this interface to discover the thread IDs for each virtual CPU. Specifically, we call the QMP command:

query-cpus-fast

This returns information about all vCPU threads. In Rust, this might look like:

async fn try_get_vcpu_threads() -> Result<Vec<i32>> {
    let mut qmp = QmpClient::connect().await?;
    let response = qmp
        .execute("query-cpus-fast", None)
        .await
        .context("query-cpus-fast failed")?;

    let vcpus: Vec<QmpCpuInfo> = serde_json::from_value(response)?;

    debug!("Found {} QEMU vCPU threads", vcpus.len());

    Ok(vcpus.into_iter().map(|v| v.thread_id).collect())
}

Each returned thread_id corresponds to a Linux thread representing a QEMU vCPU.

Moving vCPU Threads to the instance cgroup

Once the vCPU thread IDs are known, they are moved into the instance cgroup, similar to how we moved processes before, however instead to cgroup.threads :

echo <tid> > /sys/fs/cgroup/stelia/sgcinstance-xxx/cgroup.threads

Pinning threads to specific CPUs

After the threads are placed in the correct cgroup, the final step is to assign each thread a CPU affinity. This ensures:

minimal scheduling contention
NUMA locality (as provided by kubelet allocation)

Example Rust implementation:

use nix::sched::{sched_setaffinity, CpuSet};
use nix::unistd::Pid;

fn pin_thread(tid: i32, cpu: usize) -> nix::Result<()> {
    let mut set = CpuSet::new();
    set.set(cpu)?;
    sched_setaffinity(Pid::from_raw(tid), &set)
}

Each vCPU thread is pinned to one CPU from the instance’s allocated CPU set.

Resulting process layout

After initialisation, the respective cgroups end up with a structure similar to this:

stelia/
├─ float/
│   └─ qemu
│   ├─ virtiofsd
│   ├─ dnsmasq
│   └─ instance runner
│
├─ instance-123/
│   └─ vcpu threads (pinned)

While the original Kubernetes pod still contains:

kubepods/
└─ pod-abc123/
    └─ placeholder (sleep infinity)

This approach allows us to combine Kubernetes lifecycle management with low-level CPU isolation without fighting kubelet’s cgroup garbage-collection logic.

Running on a Dual Socket system

So, after all this work, what do the results look like on an actual Dual Socket system? After configuring the kubelet with the aforementioned configuration on our AMD Epyc with 128 cpus and 256 GiB of memory, let's first observe the NUMA domains of each CPU on the system. We have ~5.5k 2MB hugepages configured on each NUMA node (~113 GiB).

NUMA:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-31,64-95
  NUMA node1 CPU(s):         32-63,96-127

root@gpu-node:~# cat /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages
55552
55296

We will start by creating an instance with 40 CPUs and 80 GiB of hugepages. If we look at the CPU and hugepage allocation from kubelet in our containers cgroup, we see that the instances' CPUs and hugepages are localised to NUMA node 0:

root@gpu3:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-podce32ebd2_f905_449e_b74b_4231a84b88e1.slice/cri-containerd-8a7496cc1fc043b490ebf3c3ec51d939cbeb988febb88a7c96b1dc871ae6a4e4.scope/cpuset.cpus
1-20,65-84
root@gpu-node:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-podce32ebd2_f905_449e_b74b_4231a84b88e1.slice/cri-containerd-8a7496cc1fc043b490ebf3c3ec51d939cbeb988febb88a7c96b1dc871ae6a4e4.scope/hugetlb.2MB.numa_stat
total=8589934592 N0=8589934592 N1=0

After our controllers then complete their cgroup registration and process migrations, we can observe the following when running a script that checks for which cpuset a process may run on and which cpu they are currently running on. As we can see, every process is running on the defaultCpuSet (everything besides the allocated cpus for the instance), and there are exactly 40 QEMU threads which are pinned to a dedicated CPU:

root@gpu3:~# ./pid_affinity.sh

######################################################################
Inspecting: dnsmasq
######################################################################

======================================================================
==== dnsmasq PID: 3794608 ====
======================================================================

==== dnsmasq cgroup ====
0::/stelia-io/float

==== dnsmasq main thread affinity ====
pid 3794608's current affinity list: 0,21-64,85-127

==== Listing dnsmasq threads ====

TID        THREAD_NAME               ALLOWED_CPUS         RUNNING_ON
---------- ------------------------- -------------------  ----------
3794608    dnsmasq                   0,21-64,85-127       39


######################################################################
Inspecting: QEMU
######################################################################

======================================================================
==== QEMU PID: 3794651 ====
======================================================================

==== QEMU cgroup ====
0::/stelia-io/float

==== QEMU main thread affinity ====
pid 3794651's current affinity list: 0,21-64,85-127

==== Listing QEMU threads ====

TID        THREAD_NAME               ALLOWED_CPUS         RUNNING_ON
---------- ------------------------- -------------------  ----------
3794651    qemu-system-x86           0,21-64,85-127       62
3794652    qemu-system-x86           0,21-64,85-127       124
3794654    qemu-system-x86           0,21-64,85-127       61
3794656    qemu-system-x86           1                    1
3794657    qemu-system-x86           2                    2
3794658    qemu-system-x86           3                    3
3794659    qemu-system-x86           4                    4
3794660    qemu-system-x86           5                    5
3794661    qemu-system-x86           6                    6
3794663    qemu-system-x86           7                    7
3794664    qemu-system-x86           8                    8
3794665    qemu-system-x86           9                    9
3794666    qemu-system-x86           10                   10
3794667    qemu-system-x86           11                   11
3794668    qemu-system-x86           12                   12
3794669    qemu-system-x86           13                   13
3794670    qemu-system-x86           14                   14
3794671    qemu-system-x86           15                   15
3794672    qemu-system-x86           16                   16
3794673    qemu-system-x86           17                   17
3794674    qemu-system-x86           18                   18
3794675    qemu-system-x86           19                   19
3794676    qemu-system-x86           20                   20
3794677    qemu-system-x86           65                   65
3794678    qemu-system-x86           66                   66
3794679    qemu-system-x86           67                   67
3794680    qemu-system-x86           68                   68
3794681    qemu-system-x86           69                   69
3794682    qemu-system-x86           70                   70
3794683    qemu-system-x86           71                   71
3794684    qemu-system-x86           72                   72
3794685    qemu-system-x86           73                   73
3794686    qemu-system-x86           74                   74
3794687    qemu-system-x86           75                   75
3794688    qemu-system-x86           76                   76
3794689    qemu-system-x86           77                   77
3794690    qemu-system-x86           78                   78
3794691    qemu-system-x86           79                   79
3794692    qemu-system-x86           80                   80
3794693    qemu-system-x86           81                   81
3794694    qemu-system-x86           82                   82
3794695    qemu-system-x86           83                   83
3794696    qemu-system-x86           84                   84
3794737    kvm-nx-lpage-re           0,21-64,85-127       52


######################################################################
Inspecting: Instance Runner
######################################################################

======================================================================
==== Instance Runner PID: 3794442 ====
======================================================================

==== Instance Runner cgroup ====
0::/stelia-io/float

==== Instance Runner main thread affinity ====
pid 3794442's current affinity list: 0,21-64,85-127

==== Listing Instance Runner threads ====

TID        THREAD_NAME               ALLOWED_CPUS         RUNNING_ON
---------- ------------------------- -------------------  ----------
3794442    instance-runner           0,21-64,85-127       85
3794456    tokio-runtime-w           0,21-64,85-127       40
3794457    tokio-runtime-w           0,21-64,85-127       10
3794458    tokio-runtime-w           0,21-64,85-127       83
3794459    tokio-runtime-w           0,21-64,85-127       12
3794460    tokio-runtime-w           0,21-64,85-127       10
3794461    tokio-runtime-w           0,21-64,85-127       82

==== Done ====

Conclusion

In this article, we showed how to run NUMA-localised, thread-isolated virtual machines in Kubernetes for AI and HPC workloads. By leveraging Kubernetes’ CPU, Memory, and Topology Managers, we preserve exclusive CPU ownership and NUMA locality. Our Instance Cgroup Controller and Instance Runner isolate vCPU threads from runtime processes, pinning them to the allocated CPUs. The result is deterministic performance, low latency, and strong tenant isolation, all while fully operating inside Kubernetes.

Limitations and future work

NUMA localisation is critical for HPC and AI workloads. However, there is a known architectural limitation in Kubernetes:

Topology Manager complexity

The Topology Manager merges NUMA hints using bitmask intersections. In worst-case scenarios, the merging algorithm has exponential characteristics:

O(2^n)

Where n = number of NUMA nodes.

Today, this is manageable as most production systems are dual-socket (2 NUMA nodes), and AMD EPYC commonly presents 2 NUMA domains per socket configuration

But future systems are changing:

4-socket systems
8-socket high-density compute nodes
CXL memory expansion

If Kubernetes runs on 8-socket NUMA systems, the current merging algorithm may become computationally expensive during scheduling.

And currently, this is a known issue as the bitmask affinity suggestions, as well, don't seem to filter out non-preferential affinity assignments.

Production-Grade Container Scheduling and Management - kubernetes

When we create a 1-GPU pod on a machine with 8 NUMA nodes (AMD CPU + NVIDIA 4090D), the hint providers for CPU, memory, hugepages, and GPU each generate approximately 255 hints. During the hint merging phase, the topology manager needs to evaluate 255⁴ (over 4.2 billion) possible hint combinations. In our testing, this process took nearly 21 minutes.

Why we chose Ceph as part of our storage-related solutions for production-scale AI

Stelia Developers — Thu, 19 Mar 2026 13:23:01 +0000

In the fast-paced world of DevOps and cloud infrastructure, there is a natural gravitation toward tools that offer instant gratification. We value the "Day 1" experience: the single binary download, the five-minute setup, and the immediate results. When a tool allows you to go from zero to a working prototype in the time it takes to drink a coffee, it gains adoption rapid-fire.

However, when you are architecting modern AI-ready cloud infrastructure from the ground up, the laws of physics – and the definition of success – are fundamentally different. We aren't simply hosting static websites or lightweight user databases. We are building the high-throughput pipelines required to feed petabytes of training data into hungry H100/H200 GPU clusters. We are managing Retrieval-Augmented Generation (RAG) workflows where millisecond latency isn't just a metric; it’s the difference between a functional product and a failed user experience.

In this high-stakes environment, the pressure to take infrastructure shortcuts is overwhelming. For years, the industry standard advice for object storage has been MinIO. If you ask a room full of startup technical leaders what to use for S3-compatible storage, their answer will be MinIO because it’s simple, fast, and works out of the box.

And they are not wrong. MinIO is an impressive piece of engineering. It is incredibly fast and offers a developer experience that feels like magic on Day 1.

But at Stelia, we realised early on that we couldn't optimise for Day 1. We had to optimise for Day 1,000. We are building a fortress for organisations' models, not a playground for prototypes. When we examined the long-term trajectory of the storage landscape, we saw a divergence between the free code and the paid product that was becoming too wide to ignore.

We faced a critical architectural choice: build our platform on technologies that offer ease of use but introduce significant supply chain risk, or choose the hard option and undertake the engineering rigour required to build on a true, community-governed foundation.

We chose the hard option; we chose to invest in long-term durability. And as a result, we selected Ceph as one part of our storage-related solutions.

Below, we outline why we made that decision, and why we believe it ensures organisations' data is safer, cheaper, and more performant with us in the long run.

The evolution of open source business models

To understand why we moved away from the "easy" option, it is important to look at the business context without cynicism. Infrastructure companies need to monetise, and the "Open Core" model is a standard path. However, the strategies companies use to achieve profitability have profound downstream effects on the users building upon their software.

Over the last few years, we have witnessed a slow, calculated pivot in the object storage market. This wasn't an overnight change. It was a gradual evolution that has made it increasingly difficult for infrastructure providers to rely on certain open-source projects without incurring massive enterprise licensing costs or legal complexity.

The licensing complexity (AGPLv3)

The first sign of this shift occurred in 2021, when the licensing landscape for MinIO changed from the permissive Apache 2.0 license to the GNU AGPLv3.

For the uninitiated, the distinction between these licenses is massive. Apache 2.0 is the ‘do what you want, just give us credit' license. It allows for broad innovation and integration without legal strings attached.

AGPLv3, however, is designed to close the "SaaS loophole". It essentially states that if you modify the software and let users interact with it over a network (which is the definition of a cloud service), you must release your source code as well.

For a hobbyist or a student, this distinction is irrelevant. But for a corporation building a proprietary AI platform, AGPLv3 must be assessed with caution. It introduces legal ambiguity. The question is: “Does linking our internal orchestration layers to the storage backend potentially require us to open-source our proprietary app?”

The answer is "maybe." In the world of enterprise risk management, "maybe" is a stop sign. This licensing move forces many companies into a corner: purchase a commercial license to avoid the headache, or accept some compliance risks. We wanted a foundation where the legal ground wouldn't shift beneath our feet.

The feature gap

Beyond the license, we began to notice a growing feature delta – a widening gap between what is available in the GitHub repository and what is sold in the enterprise binary.

The most visible casualty of this shift was the Web Management Console. In earlier iterations, the open-source version provided a robust user interface for managing buckets, users, identity policies, and lifecycle rules. It was a true single pane of glass for administrators.

Over time, however, the community version of this console was stripped down. Critical administrative features – such as OpenID Connect (OIDC) and LDAP integration for identity management, tiering configurations, and deep observability metrics – were removed or hidden behind the enterprise paywall. Today, the open-source console functions primarily as a file browser.

If you want the full administrative suite to manage a multi-petabyte cluster, you are now expected to pay for the enterprise product. For us, this signalled that the open-source version was no longer viewed as a standalone product, but rather as a demo for the paid tier.

Entering maintenance mode

Perhaps the most challenging development for DevOps teams has been the operational friction introduced recently. With the open-source edition effectively entering what many in the community call "maintenance mode," the project has ceased to be a living, breathing foundation for new infrastructure.

Innovation has been bifurcated. Performance tuning, AI-specific optimisations, and advanced replication features are increasingly channelled exclusively into the commercial product. Even more disruptive was the change in how binaries and Docker images are distributed.

In a modern, containerised world, the inability to easily pull a verified, stable, and compliant image from a standard registry is a major hurdle. It forces teams to compile from source or rely on unverified third-party builds, introducing security risks into the supply chain. You cannot build a platform today on software that is essentially frozen in time.

The alternative: Ceph - an open-source ecosystem

When we decided to look for a different path, we turned to Ceph.

Ceph is an open-source ecosystem, not just a product. Often described as the ‘Linux of Storage’, Ceph is a distributed storage platform that delivers Object, Block, and File storage on top of a single, unified data plane.

The primary differentiator for us wasn't only the code; it was the governance.

MinIO is controlled by a single corporation.

Ceph, by contrast, is governed by the Ceph Foundation under the umbrella of the Linux Foundation. Its board includes representatives from industry giants like Red Hat, IBM, Canonical, and scientific organisations like CERN. There is no single leader who can wake up tomorrow and decide to deprecate the open-source version. The code truly belongs to the community.

This governance structure aligns perfectly with our philosophy. We wanted a storage layer that would be as open and reliable in ten years as it is today.

In fact, CERN is the ultimate showcase for Ceph. They don't just sit on the board; they rely on Ceph to manage over 100 petabytes of storage that underpins the IT infrastructure for the Large Hadron Collider. It is the high-performance backbone for their OpenStack cloud used by thousands of physicists to analyse particle collision data. For those sceptical about manageability, CERN's engineering team regularly publishes "Ten-year retrospective" talks on YouTube. These videos detail how a small team manages this massive, mission-critical environment using the exact same open-source code we use.

Technical deep dive: architecture & data placement

Governance aside, the technical differences between Ceph and its competitors are profound. If you are a developer or an architect, it is important to understand why Ceph is historically considered harder to use, and why that complexity buys you scalability that other systems struggle to match.

The core difference lies in how these systems answer a simple question: "Where do I put this file?"

The "pool" problem in rigid architectures

Many object storage systems use a hashing ring architecture combined with erasure coding. In an ideal world, this creates a ‘shared-nothing’ architecture where every node is identical. This is fantastic for speed in small, static setups.

However, this rigidity creates a massive problem when it's time to scale. In many of these systems, you cannot simply add one hard drive to a cluster. You generally have to scale by adding ‘server pools.’

Imagine you start with a cluster of 4 nodes, each with 4 drives (16 drives total). If you run out of space, you typically cannot just plug a new 20TB drive into an empty slot. To maintain the geometry of the erasure coding, you often have to add another symmetrical set of 16 drives. This step-function scaling is incredibly expensive.

Furthermore, these systems often lack automatic rebalancing. If you add a new pool of drives, new data is written there, but the old data stays on the old, full drives. You end up with "hot" and "cold" spots in your cluster. Your total throughput is limited by the performance of the new pool, rather than the aggregate power of the whole cluster.

Ceph and the CRUSH approach

Ceph takes a radically different approach. It eliminates the need for a central lookup table or rigid server pools using an algorithm called CRUSH (Controlled Replication Under Scalable Hashing).

In legacy storage systems, a central Metadata Server acts like a librarian.

Request: "Where is training_data_batch_1.json?"
Librarian: Checks database... "It is on Drive 4, Sector 2."

As clusters grow to petabyte scale, this ‘librarian’ becomes a bottleneck. If the database gets too big or the librarian gets overwhelmed, the entire cloud slows down.

Ceph fires the librarian.

Instead, Ceph distributes a "map" of the cluster to every client (your application).

Request: "I want to write training_data_batch_1.json."
Client: Runs the CRUSH algorithm locally. "Mathematically, given the current state of the cluster, this file must go to OSD #4."
Action: The client talks directly to OSD #4.

Because the clients calculate data placement themselves, there is no central gateway bottleneck. You can hammer a Ceph cluster with millions of IOPS, and because the clients are doing the maths, the cluster scales linearly.

Self-healing data

This architectural difference shines when hardware fails – which, at scale, happens inevitably.

In Ceph, if we add a single new hard drive, the cluster detects it. The CRUSH map updates to reflect the new capacity. The cluster then automatically begins moving data from full drives to the new empty drive in the background. It balances itself like water finding its level.

Conversely, if a drive dies, Ceph marks it as "down" and immediately begins reconstructing the missing data bits onto the remaining survivors using its internal redundancy. We can sleep through a drive failure and replace it during standard business hours, knowing the data has already healed itself.

The complexity myth and the Kubernetes solution

The strongest argument against Ceph has historically been: "But it's so hard to manage."

Five years ago, we would have agreed. Managing a Ceph cluster used to require deep expertise in Linux internals, manual editing of text configuration files, and hand-calculating placement groups. It was a beast.

But the landscape has changed dramatically with the rise of Kubernetes and Rook.

Rook is a Cloud Native Computing Foundation (CNCF) project that acts as an "operator" for Ceph. It brings cloud-native automation to storage. Rook handles the dirty work:

Deployment: It automates the rollout of the storage daemons.
Upgrades: Want to upgrade Ceph? Change one line of YAML, and Rook handles the rolling restart, ensuring data safety the whole time.
Expansion: Plug in new drives, and Rook detects them, provisions the Object Storage Daemons (OSDs), and begins the rebalancing process.

Rook has democratised Ceph. It brings the ‘Day 1’ experience of Ceph much closer to the simplicity of other tools, without sacrificing the Day 1,000 power and freedom.

The developer cheat sheet

For the engineers and architects evaluating their options, here is how the two stacks compare in the current landscape:

Don't rent your foundation

Our decision to choose Ceph wasn't about finding the easiest path, it was about finding the most sustainable one.

It was about moving away from platforms which historically demonstrated a willingness to remove features, change licenses, and freeze open-source code. Eventually, those costs trickle down to the customer – either in the form of higher prices to cover enterprise licensing fees or, worse, forced migrations when the free version becomes unmaintainable.

We will not pass that supply chain risk on to our customers.

We chose Ceph because it allows us to offer organisations a storage layer that is battle-tested, infinitely scalable, and free from the threat of vendor lock-in.

Ultimately:

We handle the complexity: Ceph is complex under the hood. We take on the burden of tuning CRUSH maps, managing deep scrubbing, and balancing placement groups so customers just get a fast, resilient S3 endpoint.

We control the costs: Because we aren't paying a per-terabyte tax to a proprietary software vendor, we don't have to charge customers one either. That means better egress rates and lower storage costs for your models.

In the AI gold rush, many vendors optimise for speed to market. We focus on building infrastructure that remains dependable, performant and resilient when systems reach production scale.

Why understanding application behaviour is the prerequisite for scaling AI

Stelia Developers — Tue, 10 Mar 2026 13:48:25 +0000

As AI systems move from experimental pilots into production-critical enterprise applications, the question of how to scale them reliably is front of mind.

Scaling AI and ML workloads has long been assumed to be achievable through the linear approach of adding more and more infrastructure, proven successful with previous web applications and databases. We see this approach baked into technical teams across the enterprise landscape, provisioning more GPUs as inference latency degrades and accelerating infrastructure procurement conversations as soon as training jobs stall.

But in reality, scaling AI applications for reliable and lasting performance doesn’t begin with the infrastructure, but first by determining application behaviour and ensuring that the solution designed supports the specific performance priorities required.

“You cannot scale what you do not understand. Understanding application behaviour dictates hosting and delivery success.” Dave Hughes, Stelia CTO

Decisions around scaling typically begin with how much, before answering what kind. By flipping these conversations on their head, we consider how different workload types express distinct behavioural traits, and how architecting with these traits in mind enables production-scale delivery of enterprise applications.

Why behavioural traits must define requirements

Every application has distinct objectives and operational constraints that shape how it behaves under load. Understanding these behavioural traits is key to revealing which architectural requirements matter most for achieving performance at scale.

For example, a multiplayer gaming server’s highest priority is supporting concurrent users, which in production translates to holding thousands of persistent connections with continuous bidirectional data flow. A Minecraft server with 100 players logged in for 19-hour sessions demands long-lived stateful connections where session state must survive server restarts and memory must remain stable over extended periods.

Comparing this to an e-commerce platform where users add items to a cart, triggering short-lived HTTP requests, stateless interactions and variable, bursty traffic – the performance priorities change completely.

Each application’s behavioural traits directly correspond to the unique architectural requirements that performance at scale demands. While a gaming server with these performance demands requires connection-aware load balancing and graceful connection draining, an e-commerce platform’s architectural challenge shifts entirely toward sudden traffic spikes that demand elastic compute provisioning and cache efficiency.

In practice, no single definition of “an application” should exist within scaling discussions, and application behaviour spans multiple patterns, each demanding different scaling strategies and deriving entirely varied architectural choices.

The table below illustrates some of the considerations different applications require:

The gap between theoretical scaling and enterprise reality

While beginning with application behaviour under load in mind is the ideal approach, the reality is that most enterprise applications evolve from prototypes designed organically for immediate functionality, without complete architectural foresight of expected requirements at production-scale.

At Stelia, we are often approached by teams struggling to progress successful pilots born from incremental feature additions, where scale was dismissed as a future problem until it became an urgent imperative. By this point, retrofitting an application designed without foresight costs both resource and time, as architectural decisions that made sense at prototype scale must be undone to remove production-scale blockers.

In the current market, understanding how an application actually behaves under load from the outset is both a technical and strategic priority. Organisations cannot afford to lose competitive advantage due to hidden scaling constraints that could have been addressed earlier. When behavioural constraints become visible early, modification can be targeted rather than speculative, enabling faster time to market and more reliable production performance.

How can enterprises change tact to enable effective scaling of AI workloads?

Closing the gap between a behaviour-first approach, and the reality of moving enterprise pilots to production scale requires a fundamental restructure of approach. This transformation begins with visibility, progresses through targeted modification, and concludes with infrastructure decisions that support the application’s actual behaviour rather than fighting against it.

1. Identify behavioural constraints from the outset.

With the goal to observe actual runtime characteristics, understanding application behaviour must begin with instrumentation under realistic load conditions, profiling to determine where time is actually spent, where memory grows, and how data moves through the system.

These observations will reveal the constraints that will determine whether the application is able to scale, and where modifications may be required.

2. Modify the application to remove scaling blockers.

With constraints in full view, the changes required will be based entirely on the application’s behavioural profile, and these application-level changes can be made before infrastructure compensations are implemented to hide inefficiencies.

Modifications made at this stage will create a dynamic whereby infrastructure supports well-behaved applications, not attempts to fix poorly architected ones.

3. Architect hosting aligned to true behaviour.

Only after understanding and modifying an application’s behaviour can infrastructure decisions then be made effectively, as instance types, orchestration patterns, and data locality strategies all flow directly from understanding an applications performance requirements under load.

The behavioural traits identified at the outset are able to translate into concrete architectural choices, and infrastructure becomes designed to support requirements rather than forcing the application to conform to available infrastructure.

4. Set appropriate governance and security boundaries.

Inevitably, different behavioural patterns demand different governance and security approaches. Real-time inference serving sensitive data operates under entirely different compliance and security requirements than batch training on anonymised datasets.

Data residency, access controls, and audit requirements must align with both the application’s behaviour and the sensitivity of the data it processes.

Why full-stack expertise is essential

Executing this approach successfully however, requires fluency across the entire stack. Application development, infrastructure provisioning, and performance optimisation are typically treated as separate disciplines with separate teams. But effective scaling demands understanding how these layers interact in operational environments.

Such fluency across the stack is rare. Most organisations have deep expertise in one layer but lack the cross-stack fluency needed to diagnose behavioural constraints, modify applications appropriately, and architect infrastructure that supports the resulting behaviour.

This is not a criticism of existing teams; it reflects how technical specialisation has evolved. But it does create a capability gap that must be addressed, either through building internal expertise or partnering with those who possess this holistic systems understanding. The teams that scale AI workloads successfully in this next phase of AI impact, will be those who understand how to treat operationalising AI at scale as a unified problem rather than separate isolated challenges.

Reframing the scaling question

Scaling AI workloads effectively doesn’t come down to a question of infrastructure capacity but instead one of understanding. Understanding how the application behaves under load, what constraints that behaviour creates, and how to architect systems that support rather than fight that behaviour.

The organisations moving successfully from pilot to production are those that begin with observation rather than procurement. They instrument to understand actual runtime characteristics, modify applications to address the constraints those characteristics reveal, and only then make infrastructure decisions based on how the modified application actually performs.

This approach requires a shift in how scaling problems are framed, flipping the conversation from how much infrastructure is required to what kind of application are we dealing with and what does it need to operate effectively at scale. Answer these questions first, and the infrastructure decisions follow naturally.