<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Stelia Developers</title>
    <description>The latest articles on Forem by Stelia Developers (@steliadevs).</description>
    <link>https://forem.com/steliadevs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3806265%2F1ff16a74-a81d-487e-aca1-8c47291b88d3.jpg</url>
      <title>Forem: Stelia Developers</title>
      <link>https://forem.com/steliadevs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/steliadevs"/>
    <language>en</language>
    <item>
      <title>Why we chose Ceph as part of our storage-related solutions for production-scale AI</title>
      <dc:creator>Stelia Developers</dc:creator>
      <pubDate>Thu, 19 Mar 2026 13:23:01 +0000</pubDate>
      <link>https://forem.com/steliadevs/why-we-chose-ceph-as-part-of-our-storage-related-solutions-for-production-scale-ai-22c5</link>
      <guid>https://forem.com/steliadevs/why-we-chose-ceph-as-part-of-our-storage-related-solutions-for-production-scale-ai-22c5</guid>
      <description>&lt;p&gt;In the fast-paced world of DevOps and cloud infrastructure, there is a natural gravitation toward tools that offer instant gratification. We value the "Day 1" experience: the single binary download, the five-minute setup, and the immediate results. When a tool allows you to go from zero to a working prototype in the time it takes to drink a coffee, it gains adoption rapid-fire.&lt;/p&gt;

&lt;p&gt;However, when you are architecting modern AI-ready cloud infrastructure from the ground up, the laws of physics – and the definition of success – are fundamentally different. We aren't simply hosting static websites or lightweight user databases. We are building the high-throughput pipelines required to feed petabytes of training data into hungry H100/H200 GPU clusters. We are managing Retrieval-Augmented Generation (RAG) workflows where millisecond latency isn't just a metric; it’s the difference between a functional product and a failed user experience.&lt;/p&gt;

&lt;p&gt;In this high-stakes environment, the pressure to take infrastructure shortcuts is overwhelming. For years, the industry standard advice for object storage has been MinIO. If you ask a room full of startup technical leaders what to use for S3-compatible storage, their answer will be MinIO because it’s simple, fast, and works out of the box.&lt;/p&gt;

&lt;p&gt;And they are not wrong. MinIO is an impressive piece of engineering. It is incredibly fast and offers a developer experience that feels like magic on Day 1.&lt;/p&gt;

&lt;p&gt;But at &lt;a href="//stelia.ai"&gt;Stelia&lt;/a&gt;, we realised early on that we couldn't optimise for Day 1. We had to optimise for Day 1,000. We are building a fortress for organisations' models, not a playground for prototypes. When we examined the long-term trajectory of the storage landscape, we saw a divergence between the free code and the paid product that was becoming too wide to ignore.&lt;/p&gt;

&lt;p&gt;We faced a critical architectural choice: build our platform on technologies that offer ease of use but introduce significant supply chain risk, or choose the hard option and undertake the engineering rigour required to build on a true, community-governed foundation.&lt;/p&gt;

&lt;p&gt;We chose the hard option; we chose to invest in long-term durability. And as a result, we selected Ceph as one part of our storage-related solutions.&lt;/p&gt;

&lt;p&gt;Below, we outline why we made that decision, and why we believe it ensures organisations' data is safer, cheaper, and more performant with us in the long run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The evolution of open source business models
&lt;/h2&gt;

&lt;p&gt;To understand why we moved away from the "easy" option, it is important to look at the business context without cynicism. Infrastructure companies need to monetise, and the "Open Core" model is a standard path. However, the strategies companies use to achieve profitability have profound downstream effects on the users building upon their software.&lt;/p&gt;

&lt;p&gt;Over the last few years, we have witnessed a slow, calculated pivot in the object storage market. This wasn't an overnight change. It was a gradual evolution that has made it increasingly difficult for infrastructure providers to rely on certain open-source projects without incurring massive enterprise licensing costs or legal complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The licensing complexity (AGPLv3)
&lt;/h2&gt;

&lt;p&gt;The first sign of this shift occurred in 2021, when the licensing landscape for MinIO changed from the permissive Apache 2.0 license to the GNU AGPLv3.&lt;/p&gt;

&lt;p&gt;For the uninitiated, the distinction between these licenses is massive. Apache 2.0 is the ‘do what you want, just give us credit' license. It allows for broad innovation and integration without legal strings attached.&lt;/p&gt;

&lt;p&gt;AGPLv3, however, is designed to close the "SaaS loophole". It essentially states that if you modify the software and let users interact with it over a network (which is the definition of a cloud service), you must release your source code as well.&lt;/p&gt;

&lt;p&gt;For a hobbyist or a student, this distinction is irrelevant. But for a corporation building a proprietary AI platform, AGPLv3 must be assessed with caution. It introduces legal ambiguity. The question is: “Does linking our internal orchestration layers to the storage backend potentially require us to open-source our proprietary app?”&lt;/p&gt;

&lt;p&gt;The answer is "maybe." In the world of enterprise risk management, "maybe" is a stop sign. This licensing move forces many companies into a corner: purchase a commercial license to avoid the headache, or accept some compliance risks. We wanted a foundation where the legal ground wouldn't shift beneath our feet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The feature gap
&lt;/h2&gt;

&lt;p&gt;Beyond the license, we began to notice a growing feature delta – a widening gap between what is available in the GitHub repository and what is sold in the enterprise binary.&lt;/p&gt;

&lt;p&gt;The most visible casualty of this shift was the Web Management Console. In earlier iterations, the open-source version provided a robust user interface for managing buckets, users, identity policies, and lifecycle rules. It was a true single pane of glass for administrators.&lt;/p&gt;

&lt;p&gt;Over time, however, the community version of this console was stripped down. Critical administrative features – such as OpenID Connect (OIDC) and LDAP integration for identity management, tiering configurations, and deep observability metrics – were removed or hidden behind the enterprise paywall. Today, the open-source console functions primarily as a file browser.&lt;/p&gt;

&lt;p&gt;If you want the full administrative suite to manage a multi-petabyte cluster, you are now expected to pay for the enterprise product. For us, this signalled that the open-source version was no longer viewed as a standalone product, but rather as a demo for the paid tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Entering maintenance mode
&lt;/h2&gt;

&lt;p&gt;Perhaps the most challenging development for DevOps teams has been the operational friction introduced recently. With the open-source edition effectively entering what many in the community call "maintenance mode," the project has ceased to be a living, breathing foundation for new infrastructure.&lt;/p&gt;

&lt;p&gt;Innovation has been bifurcated. Performance tuning, AI-specific optimisations, and advanced replication features are increasingly channelled exclusively into the commercial product. Even more disruptive was the change in how binaries and Docker images are distributed.&lt;/p&gt;

&lt;p&gt;In a modern, containerised world, the inability to easily pull a verified, stable, and compliant image from a standard registry is a major hurdle. It forces teams to compile from source or rely on unverified third-party builds, introducing security risks into the supply chain. You cannot build a platform today on software that is essentially frozen in time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative: Ceph - an open-source ecosystem
&lt;/h2&gt;

&lt;p&gt;When we decided to look for a different path, we turned to Ceph.&lt;/p&gt;

&lt;p&gt;Ceph is an open-source ecosystem, not just a product. Often described as the ‘Linux of Storage’, Ceph is a distributed storage platform that delivers Object, Block, and File storage on top of a single, unified data plane.&lt;/p&gt;

&lt;p&gt;The primary differentiator for us wasn't only the code; it was the governance.&lt;/p&gt;

&lt;p&gt;MinIO is controlled by a single corporation.&lt;/p&gt;

&lt;p&gt;Ceph, by contrast, is governed by the Ceph Foundation under the umbrella of the Linux Foundation. Its board includes representatives from industry giants like Red Hat, IBM, Canonical, and scientific organisations like CERN. There is no single leader who can wake up tomorrow and decide to deprecate the open-source version. The code truly belongs to the community.&lt;/p&gt;

&lt;p&gt;This governance structure aligns perfectly with our philosophy. We wanted a storage layer that would be as open and reliable in ten years as it is today.&lt;/p&gt;

&lt;p&gt;In fact, CERN is the ultimate showcase for Ceph. They don't just sit on the board; they rely on Ceph to manage over 100 petabytes of storage that underpins the IT infrastructure for the &lt;a href="https://home.cern/science/accelerators/large-hadron-collider" rel="noopener noreferrer"&gt;Large Hadron Collider&lt;/a&gt;. It is the high-performance backbone for their OpenStack cloud used by thousands of physicists to analyse particle collision data. For those sceptical about manageability, CERN's engineering team regularly publishes "Ten-year retrospective" talks on YouTube. These videos detail how a small team manages this massive, mission-critical environment using the exact same open-source code we use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical deep dive: architecture &amp;amp; data placement
&lt;/h2&gt;

&lt;p&gt;Governance aside, the technical differences between Ceph and its competitors are profound. If you are a developer or an architect, it is important to understand why Ceph is historically considered harder to use, and why that complexity buys you scalability that other systems struggle to match.&lt;/p&gt;

&lt;p&gt;The core difference lies in how these systems answer a simple question: "Where do I put this file?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The "pool" problem in rigid architectures
&lt;/h2&gt;

&lt;p&gt;Many object storage systems use a hashing ring architecture combined with erasure coding. In an ideal world, this creates a ‘shared-nothing’ architecture where every node is identical. This is fantastic for speed in small, static setups.&lt;/p&gt;

&lt;p&gt;However, this rigidity creates a massive problem when it's time to scale. In many of these systems, you cannot simply add one hard drive to a cluster. You generally have to scale by adding ‘server pools.’&lt;/p&gt;

&lt;p&gt;Imagine you start with a cluster of 4 nodes, each with 4 drives (16 drives total). If you run out of space, you typically cannot just plug a new 20TB drive into an empty slot. To maintain the geometry of the erasure coding, you often have to add another symmetrical set of 16 drives. This step-function scaling is incredibly expensive.&lt;/p&gt;

&lt;p&gt;Furthermore, these systems often lack automatic rebalancing. If you add a new pool of drives, new data is written there, but the old data stays on the old, full drives. You end up with "hot" and "cold" spots in your cluster. Your total throughput is limited by the performance of the new pool, rather than the aggregate power of the whole cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ceph and the CRUSH approach
&lt;/h2&gt;

&lt;p&gt;Ceph takes a radically different approach. It eliminates the need for a central lookup table or rigid server pools using an algorithm called CRUSH (Controlled Replication Under Scalable Hashing).&lt;/p&gt;

&lt;p&gt;In legacy storage systems, a central Metadata Server acts like a librarian.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request:&lt;/strong&gt; "Where is &lt;code&gt;training_data_batch_1.json&lt;/code&gt;?"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Librarian:&lt;/strong&gt; Checks database... "It is on Drive 4, Sector 2."&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As clusters grow to petabyte scale, this ‘librarian’ becomes a bottleneck. If the database gets too big or the librarian gets overwhelmed, the entire cloud slows down.&lt;/p&gt;

&lt;p&gt;Ceph fires the librarian.&lt;/p&gt;

&lt;p&gt;Instead, Ceph distributes a "map" of the cluster to every client (your application).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request:&lt;/strong&gt; "I want to write training_data_batch_1.json."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Client:&lt;/strong&gt; Runs the CRUSH algorithm locally. "Mathematically, given the current state of the cluster, this file must go to OSD #4."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The client talks directly to OSD #4.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the clients calculate data placement themselves, there is no central gateway bottleneck. You can hammer a Ceph cluster with millions of IOPS, and because the clients are doing the maths, the cluster scales linearly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-healing data
&lt;/h2&gt;

&lt;p&gt;This architectural difference shines when hardware fails – which, at scale, happens inevitably.&lt;/p&gt;

&lt;p&gt;In Ceph, if we add a single new hard drive, the cluster detects it. The CRUSH map updates to reflect the new capacity. The cluster then automatically begins moving data from full drives to the new empty drive in the background. It balances itself like water finding its level.&lt;/p&gt;

&lt;p&gt;Conversely, if a drive dies, Ceph marks it as "down" and immediately begins reconstructing the missing data bits onto the remaining survivors using its internal redundancy. We can sleep through a drive failure and replace it during standard business hours, knowing the data has already healed itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The complexity myth and the Kubernetes solution
&lt;/h2&gt;

&lt;p&gt;The strongest argument against Ceph has historically been: "But it's so hard to manage."&lt;/p&gt;

&lt;p&gt;Five years ago, we would have agreed. Managing a Ceph cluster used to require deep expertise in Linux internals, manual editing of text configuration files, and hand-calculating placement groups. It was a beast.&lt;/p&gt;

&lt;p&gt;But the landscape has changed dramatically with the rise of Kubernetes and Rook.&lt;/p&gt;

&lt;p&gt;Rook is a Cloud Native Computing Foundation (CNCF) project that acts as an "operator" for Ceph. It brings cloud-native automation to storage. Rook handles the dirty work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; It automates the rollout of the storage daemons.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Upgrades:&lt;/strong&gt; Want to upgrade Ceph? Change one line of YAML, and Rook handles the rolling restart, ensuring data safety the whole time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expansion:&lt;/strong&gt; Plug in new drives, and Rook detects them, provisions the Object Storage Daemons (OSDs), and begins the rebalancing process.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rook has democratised Ceph. It brings the ‘Day 1’ experience of Ceph much closer to the simplicity of other tools, without sacrificing the Day 1,000 power and freedom.&lt;/p&gt;

&lt;h2&gt;
  
  
  The developer cheat sheet
&lt;/h2&gt;

&lt;p&gt;For the engineers and architects evaluating their options, here is how the two stacks compare in the current landscape:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkl2ykthsrlv5y4hrkst.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkl2ykthsrlv5y4hrkst.jpg" alt="Developer cheat sheet table" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't rent your foundation
&lt;/h2&gt;

&lt;p&gt;Our decision to choose Ceph wasn't about finding the easiest path, it was about finding the most sustainable one.&lt;/p&gt;

&lt;p&gt;It was about moving away from platforms which historically demonstrated a willingness to remove features, change licenses, and freeze open-source code. Eventually, those costs trickle down to the customer – either in the form of higher prices to cover enterprise licensing fees or, worse, forced migrations when the free version becomes unmaintainable.&lt;/p&gt;

&lt;p&gt;We will not pass that supply chain risk on to our customers.&lt;/p&gt;

&lt;p&gt;We chose Ceph because it allows us to offer organisations a storage layer that is battle-tested, infinitely scalable, and free from the threat of vendor lock-in.&lt;/p&gt;

&lt;p&gt;Ultimately:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We handle the complexity:&lt;/strong&gt; Ceph is complex under the hood. We take on the burden of tuning CRUSH maps, managing deep scrubbing, and balancing placement groups so customers just get a fast, resilient S3 endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We control the costs:&lt;/strong&gt; Because we aren't paying a per-terabyte tax to a proprietary software vendor, we don't have to charge customers one either. That means better egress rates and lower storage costs for your models.&lt;/p&gt;

&lt;p&gt;In the AI gold rush, many vendors optimise for speed to market. We focus on building infrastructure that remains dependable, performant and resilient when systems reach production scale.&lt;/p&gt;

</description>
      <category>ceph</category>
      <category>cloudstorage</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why understanding application behaviour is the prerequisite for scaling AI</title>
      <dc:creator>Stelia Developers</dc:creator>
      <pubDate>Tue, 10 Mar 2026 13:48:25 +0000</pubDate>
      <link>https://forem.com/steliadevs/why-understanding-application-behaviour-is-the-prerequisite-for-scaling-ai-4m05</link>
      <guid>https://forem.com/steliadevs/why-understanding-application-behaviour-is-the-prerequisite-for-scaling-ai-4m05</guid>
      <description>&lt;p&gt;As AI systems move from experimental pilots into production-critical enterprise applications, the question of how to scale them reliably is front of mind.&lt;/p&gt;

&lt;p&gt;Scaling AI and ML workloads has long been assumed to be achievable through the linear approach of adding more and more infrastructure, proven successful with previous web applications and databases. We see this approach baked into technical teams across the enterprise landscape, provisioning more GPUs as inference latency degrades and accelerating infrastructure procurement conversations as soon as training jobs stall.&lt;/p&gt;

&lt;p&gt;But in reality, scaling AI applications for reliable and lasting performance doesn’t begin with the infrastructure, but first by determining application behaviour and ensuring that the solution designed supports the specific performance priorities required.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“You cannot scale what you do not understand. Understanding application behaviour dictates hosting and delivery success.” Dave Hughes, Stelia CTO&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Decisions around scaling typically begin with how much, before answering what kind. By flipping these conversations on their head, we consider how different workload types express distinct behavioural traits, and how architecting with these traits in mind enables production-scale delivery of enterprise applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why behavioural traits must define requirements
&lt;/h2&gt;

&lt;p&gt;Every application has distinct objectives and operational constraints that shape how it behaves under load. Understanding these behavioural traits is key to revealing which architectural requirements matter most for achieving performance at scale.&lt;/p&gt;

&lt;p&gt;For example, a multiplayer gaming server’s highest priority is supporting concurrent users, which in production translates to holding thousands of persistent connections with continuous bidirectional data flow. A Minecraft server with 100 players logged in for 19-hour sessions demands long-lived stateful connections where session state must survive server restarts and memory must remain stable over extended periods.&lt;/p&gt;

&lt;p&gt;Comparing this to an e-commerce platform where users add items to a cart, triggering short-lived HTTP requests, stateless interactions and variable, bursty traffic – the performance priorities change completely.&lt;/p&gt;

&lt;p&gt;Each application’s behavioural traits directly correspond to the unique architectural requirements that performance at scale demands. While a gaming server with these performance demands requires connection-aware load balancing and graceful connection draining, an e-commerce platform’s architectural challenge shifts entirely toward sudden traffic spikes that demand elastic compute provisioning and cache efficiency.&lt;/p&gt;

&lt;p&gt;In practice, no single definition of “an application” should exist within scaling discussions, and application behaviour spans multiple patterns, each demanding different scaling strategies and deriving entirely varied architectural choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The table below illustrates some of the considerations different applications require:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqldour2i6irbm0oruhz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqldour2i6irbm0oruhz.jpg" alt="A table showing application type, connection pattern, and primary challenges" width="800" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap between theoretical scaling and enterprise reality
&lt;/h2&gt;

&lt;p&gt;While beginning with application behaviour under load in mind is the ideal approach, the reality is that most enterprise applications evolve from prototypes designed organically for immediate functionality, without complete architectural foresight of expected requirements at production-scale.&lt;/p&gt;

&lt;p&gt;At Stelia, we are often approached by teams struggling to progress successful pilots born from incremental feature additions, where scale was dismissed as a future problem until it became an urgent imperative. By this point, retrofitting an application designed without foresight costs both resource and time, as architectural decisions that made sense at prototype scale must be undone to remove production-scale blockers.&lt;/p&gt;

&lt;p&gt;In the current market, understanding how an application actually behaves under load from the outset is both a technical and strategic priority. Organisations cannot afford to lose competitive advantage due to hidden scaling constraints that could have been addressed earlier. When behavioural constraints become visible early, modification can be targeted rather than speculative, enabling faster time to market and more reliable production performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  How can enterprises change tact to enable effective scaling of AI workloads?
&lt;/h2&gt;

&lt;p&gt;Closing the gap between a behaviour-first approach, and the reality of moving enterprise pilots to production scale requires a fundamental restructure of approach. This transformation begins with visibility, progresses through targeted modification, and concludes with infrastructure decisions that support the application’s actual behaviour rather than fighting against it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identify behavioural constraints from the outset.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the goal to observe actual runtime characteristics, understanding application behaviour must begin with instrumentation under realistic load conditions, profiling to determine where time is actually spent, where memory grows, and how data moves through the system.&lt;/p&gt;

&lt;p&gt;These observations will reveal the constraints that will determine whether the application is able to scale, and where modifications may be required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Modify the application to remove scaling blockers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With constraints in full view, the changes required will be based entirely on the application’s behavioural profile, and these application-level changes can be made before infrastructure compensations are implemented to hide inefficiencies.&lt;/p&gt;

&lt;p&gt;Modifications made at this stage will create a dynamic whereby infrastructure supports well-behaved applications, not attempts to fix poorly architected ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Architect hosting aligned to true behaviour.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Only after understanding and modifying an application’s behaviour can infrastructure decisions then be made effectively, as instance types, orchestration patterns, and data locality strategies all flow directly from understanding an applications performance requirements under load.&lt;/p&gt;

&lt;p&gt;The behavioural traits identified at the outset are able to translate into concrete architectural choices, and infrastructure becomes designed to support requirements rather than forcing the application to conform to available infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set appropriate governance and security boundaries.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inevitably, different behavioural patterns demand different governance and security approaches. Real-time inference serving sensitive data operates under entirely different compliance and security requirements than batch training on anonymised datasets.&lt;/p&gt;

&lt;p&gt;Data residency, access controls, and audit requirements must align with both the application’s behaviour and the sensitivity of the data it processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why full-stack expertise is essential
&lt;/h2&gt;

&lt;p&gt;Executing this approach successfully however, requires fluency across the entire stack. Application development, infrastructure provisioning, and performance optimisation are typically treated as separate disciplines with separate teams. But effective scaling demands understanding how these layers interact in operational environments.&lt;/p&gt;

&lt;p&gt;Such fluency across the stack is rare. Most organisations have deep expertise in one layer but lack the cross-stack fluency needed to diagnose behavioural constraints, modify applications appropriately, and architect infrastructure that supports the resulting behaviour.&lt;/p&gt;

&lt;p&gt;This is not a criticism of existing teams; it reflects how technical specialisation has evolved. But it does create a capability gap that must be addressed, either through building internal expertise or partnering with those who possess this holistic systems understanding. The teams that scale AI workloads successfully in this next phase of AI impact, will be those who understand how to treat operationalising AI at scale as a unified problem rather than separate isolated challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reframing the scaling question
&lt;/h2&gt;

&lt;p&gt;Scaling AI workloads effectively doesn’t come down to a question of infrastructure capacity but instead one of understanding. Understanding how the application behaves under load, what constraints that behaviour creates, and how to architect systems that support rather than fight that behaviour.&lt;/p&gt;

&lt;p&gt;The organisations moving successfully from pilot to production are those that begin with observation rather than procurement. They instrument to understand actual runtime characteristics, modify applications to address the constraints those characteristics reveal, and only then make infrastructure decisions based on how the modified application actually performs.&lt;/p&gt;

&lt;p&gt;This approach requires a shift in how scaling problems are framed, flipping the conversation from how much infrastructure is required to what kind of application are we dealing with and what does it need to operate effectively at scale. Answer these questions first, and the infrastructure decisions follow naturally.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>architecture</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
