Forem: Emin Mammadov

Setting up Ray on GKE: How I spent a week optimising Docker pulls?

Emin Mammadov — Sat, 09 May 2026 20:27:03 +0000

I spent a week debugging slow Ray cluster starts on GKE. The fix was a region mismatch that is not very obvious from the docs.

We've been running Ray on GKE (with Anyscale) for over a year on the AI Platform team at Geotab. As self-hosted LLM workloads grow, Ray is one of the tools that makes scaling them practical. Introducing Ray and making it a go-to platform for multiple teams has been a rewarding but challenging path. One issue I kept running into: slow Ray cluster spawn times. Here's where the time actually went, and what helped.

1. GKE node provisioning: 2-3 minutes

When Ray's autoscaler asks for a new node, GKE has to allocate a VM, boot the OS, register the kubelet, and join the cluster. GPU nodes add another 30-50 seconds for driver install. We treated this as a baseline cost - no point optimizing anything else until the node exists. That recently changed a bit as GCP introduced GKE Active Buffer that aims to minimize that time. I haven't tested it yet, but it's on the list.

2. Image pull: 10+ minutes (and where I lost a week)

Ray + ML container images are big. An LLM-flavored image hits 10-15 GB easily; even a classic CV image with PyTorch lands at 13 GB+. Pulling that fresh on every new node was 15-20 minutes.

GKE Image Streaming is supposed to fix this as containers start before the full image is pulled. However, even after enabling it, pulls were still occasionally taking 20+ minutes.

What made this brutal to debug:

It didn't fail consistently. Users didn't always report it.
Anyscale assigns new pod names on each restart, so by the time I went looking, the original pod was gone and pulling logs on pod level was impossible.
The log volume is high. Without precise timestamps, finding the relevant entries is painful.

The detail not obvious from the docs: Image Streaming requires your Artifact Registry repo to be in the same region as your GKE nodes. Cluster in us-central1 + repo in the us multi-region doesn’t enable streaming and it silently falls back to a normal pull.

The very first step is to ensure that Image Streaming is actually enabled:

gcloud container clusters describe <cluster_name> \
   --location=<control_plane_location> \
   --flatten "nodePoolDefaults.nodeConfigDefaults"

Look for:

gsfsConfig:
  enabled: True

To verify it's actually engaging on a specific node, this Cloud Logging query was what cracked the case for me:

resource.type="k8s_node"
resource.labels.node_name="<name_of_node>"

The logs showed Image Streaming was enabled but not engaging, which led me to the regional requirement.

3. Disk speed on the nodes themselves

Once images do pull, they have to land on disk. We were using HDD-backed nodes. Switching to SSD cut Docker load time by ~30% and brought total spawn time from 15-20 minutes down to 5-6.

Unglamorous, but worth checking. If you're on HDD, you're paying for it on every cold start.

TLDR:

If your Ray cluster spawns feel slow on GKE, the diagnostic order I'd suggest:

Confirm Image Streaming is actually engaging (don't trust "enabled" - check logs).
Verify your Artifact Registry region matches your cluster region.
Check what disk type your node pool is using.

Distributed Model Serving Patterns

Emin Mammadov — Sat, 05 Apr 2025 20:33:07 +0000

Intro

The goal of every company is to make money and AI Models are seen as an integral part of the business. As machine learning models move from experimentation to production, serving them becomes a challenge. Serving them at a scale becomes even a larger issue. Having a model that shows high accuracy isn't enough - we need an infrastructure that will be robust, efficient, and scalable. In this article, I will dive deeper into main model serving patterns. This would be useful for anyone building model ML Platform systems that need to operate reliably despite large number of users (and requests) or large data.

What is Model Serving?

Model serving is the process of loading a previously trained machine learning model with the ultimate goal of generating predictions or in general, perform inference on new and unseen data.

Replicated Services Pattern

Imagine a very simple prediction server. You have a use case where users upload their photos or videos, and the trained ML model will automatically label people in them. In general, any API should be stateless; each request is processed independently and is treated as a completely new transaction without knowing anything about a client. On a small-scale, it would be possible to run predictions with a single node. However, a growing number of user requests will inevitably lead to delays in getting responses, as those requests will be processed in sequence. To solve this bottleneck, Replicated Services Pattern is used.

The Replicated Services pattern is what is typically meant when discussing Web Server Scalability 101. The core concept here is adding multiple instances of the same model server; an instance here is a copy (or replica) of the original web server with a different address. As an API is stateless, adding or reducing equivalent servers (AKA Horizontal Scaling) allows for seamless scaling of inference and ensures High Availability (HA). To ensure that requests go to appropriate servers Load Balancers are deployed.

Additionally to the information above, Replicated Services pattern helps with reducing latency as replicas can be put closer to a client's geographic location, and thus, minimizing latency.

Sharded Services Pattern

In the previous pattern, the goal was to distribute many requests to ensure that clients get responses fast. However, it's very common to have large datasets when it comes to inference in ML domain. An expectation of serving large amounts of data is one of the core differences between a regular web server and a web server designed for ML inferences. Thus, it is common to rely on yet another serving pattern, called Sharded Services Pattern.

Replicated Services Pattern has fixed, identical computational resources. Regular web servers are not expected to do computationally-intensive tasks but it's something that is expected out of ML-specific ones and thus, those computational resources would be a bottleneck. In Sharded Services pattern, large request is divided into smaller pieces, where each piece (or segment) is processed independently by a model server shard, which is a partition of a larger model server. After processing each segment, the results are merged into a final output.

Sharding Services pattern is useful not only with large datasets but also in use cases where each shard is responsible for a specific task (Natural Language Processing on one shard and Computer Vision model on another). Another use case would include shards accounting for certain data characteristics such as geographic regions.

One of the core concepts of this pattern is sharding function. A sharding function acts as an intelligent router that determines which shard process a sub-request should go to. Sharding function is conceptually very similar to hash functions used in more traditional distributed applications. Important characteristics of sharding functions would be:

Uniform distribution: It is important to distribute load evenly to prevent "hot shards" that become overloaded while others are underutilized.
Minimal resharding impact: If the number of shards changes, the workload redistribution should be minimized. This is conceptually similar to consistent hashing algorithms like Ring Hash.
Context Awareness: For ML workloads specifically, the sharding function will need to understand model-specific characteristics and routing must be done based on input size, computational complexity, and/or data characteristics.

It would be important to note that a load balancer would employ more stateful algorithms that will provide some information about the client.

Event-Driven Processing Pattern

Event-driven Processing Pattern compliments the previous ones. In this pattern, the system reacts on demand, allocating resources only when requests arrive. In this model, the system operates on demand, allocating resources only when inference requests arrive rather than maintaining constantly active services. This approach leverages a shared resource pool where compute capacity is dynamically borrowed based on current load requirements, enabling efficient utilization across the entire infrastructure. A critical consideration in this architecture is implementing robust defenses against denial-of-service attacks, as both accidental (buggy clients) and malicious overuse can potentially overwhelm the system. Protection mechanisms typically include rate limiting to control request processing velocity, along with intelligent queuing systems that buffer excess requests and process them at a manageable pace without losing data.

Conclusion

Hope you enjoyed reading this post and in the future ones, I will dive deeper into the rest of Machine Learning Infrastructure setup.