<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Lalit Somavarapha</title>
    <description>The latest articles on Forem by Lalit Somavarapha (@lalitlouis).</description>
    <link>https://forem.com/lalitlouis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3701987%2F107766ca-a6de-4fed-be53-9715e8410465.jpg</url>
      <title>Forem: Lalit Somavarapha</title>
      <link>https://forem.com/lalitlouis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/lalitlouis"/>
    <language>en</language>
    <item>
      <title>Optimizing GPU Workload Placement in Kubernetes with NVLink-Aware Scheduling</title>
      <dc:creator>Lalit Somavarapha</dc:creator>
      <pubDate>Tue, 27 Jan 2026 22:33:50 +0000</pubDate>
      <link>https://forem.com/lalitlouis/optimizing-gpu-workload-placement-in-kubernetes-with-nvlink-aware-scheduling-20n7</link>
      <guid>https://forem.com/lalitlouis/optimizing-gpu-workload-placement-in-kubernetes-with-nvlink-aware-scheduling-20n7</guid>
      <description>&lt;h2&gt;
  
  
  The hidden performance tax
&lt;/h2&gt;

&lt;p&gt;You bought GPUs with NVLink interconnects. You're probably not using them effectively.&lt;/p&gt;

&lt;p&gt;NVLink provides high-bandwidth, low-latency communication between GPUs—up to 900 GB/s on modern hardware compared to ~64 GB/s over PCIe. For distributed training workloads, this difference is massive. Gradient synchronization, tensor parallelism, and model sharding all depend on fast GPU-to-GPU communication.&lt;/p&gt;

&lt;p&gt;Here's the problem: Kubernetes doesn't know NVLink exists.&lt;/p&gt;

&lt;p&gt;The default scheduler sees GPUs as interchangeable resources. Request 4 GPUs, get any 4 GPUs. But on a node with 8 GPUs arranged in two NVLink domains of 4 GPUs each, placement matters enormously. Four GPUs within the same NVLink domain can communicate at full NVLink speed. Four GPUs split across domains fall back to slower PCIe interconnects.&lt;/p&gt;

&lt;p&gt;We measured up to 40% degradation in multi-GPU communication performance from suboptimal placement. That's a 40% tax on every distributed training job—paid invisibly, every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  NVLink topology primer
&lt;/h2&gt;

&lt;p&gt;Modern GPU nodes organize GPUs into NVLink domains—groups of GPUs with direct high-speed interconnects. A typical configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Developer nodes (G1):&lt;/strong&gt; 2 GPUs, 1 NVLink domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production nodes (G2):&lt;/strong&gt; 4 GPUs, 2 NVLink domains (2 GPUs each)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-performance nodes (G2-Expansion):&lt;/strong&gt; 8 GPUs, 2 NVLink domains (4 GPUs each)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within a domain, GPUs communicate via NVLink. Across domains, they fall back to PCIe or NVSwitch fabric—significantly slower.&lt;/p&gt;

&lt;p&gt;The scheduling goal is simple: keep workloads within NVLink domain boundaries whenever possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: NVLink-aware scoring
&lt;/h2&gt;

&lt;p&gt;We built a Kubernetes scheduler plugin that scores nodes based on NVLink topology awareness. It operates in the Score phase of the scheduling cycle, evaluating candidate nodes after filtering and before final selection.&lt;/p&gt;

&lt;p&gt;The plugin integrates with DCGM (Data Center GPU Manager) via Prometheus to track real-time GPU allocations, then applies a scoring algorithm that considers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Domain integrity&lt;/strong&gt; — Can the workload fit within a single NVLink domain?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node-workload matching&lt;/strong&gt; — Is this the right-sized node for this workload?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain completion&lt;/strong&gt; — Does this placement fill a partially-used domain?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource efficiency&lt;/strong&gt; — What's the node utilization after placement?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Scoring algorithm
&lt;/h2&gt;

&lt;p&gt;Each node starts with a base score of 50 points, then receives adjustments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain Integrity (primary signal)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the core optimization. Keeping GPUs within the same NVLink domain is the whole point.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workload fits single domain: +50 points&lt;/li&gt;
&lt;li&gt;Exact domain fit (e.g., 2 GPUs on 2-GPU domain): +30 bonus&lt;/li&gt;
&lt;li&gt;Cross-domain placement: -20 per additional domain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Node-Workload Matching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right-size workloads to nodes. Don't waste an 8-GPU node on a 1-GPU job.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small workloads (1–2 GPUs) on small NVLink nodes: +60 to +80 points&lt;/li&gt;
&lt;li&gt;Small workloads on oversized nodes: -40 points&lt;/li&gt;
&lt;li&gt;Large workloads (5+ GPUs) on large nodes: +80 to +100 points&lt;/li&gt;
&lt;li&gt;Large workloads on undersized nodes: -80 points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Domain Completion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bin-pack within domains before spilling to new ones.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Placement completes a partially-filled domain: +40 points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resource Efficiency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Favor high utilization to reduce fragmentation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High utilization (&amp;gt;70%) after placement: +20 to +30 points&lt;/li&gt;
&lt;li&gt;Low utilization (&amp;lt;30%): -20 points&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real-time allocation tracking
&lt;/h2&gt;

&lt;p&gt;Static scoring isn't enough. The plugin needs to know which GPUs are currently allocated to make intelligent placement decisions.&lt;/p&gt;

&lt;p&gt;We integrate with DCGM metrics exposed via Prometheus. This tells us which GPUs have active workloads and which pods own them. The plugin reconstructs domain state in real-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Domain 0: total=2, allocated=1, available=1
Domain 1: total=2, allocated=0, available=2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a new 1-GPU workload arrives, the plugin recognizes that placing it in Domain 0 completes that domain (+40 bonus), while Domain 1 would leave fragmentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: Domain completion in action
&lt;/h2&gt;

&lt;p&gt;Consider a 2-GPU node (single NVLink domain) with one GPU already allocated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incoming workload:&lt;/strong&gt; 1 GPU&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoring breakdown:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Base:              50
Domain integrity:  +50 (single domain)
Node matching:     +80 (perfect fit)
Domain completion: +40 (completes the domain)
Efficiency:        +30 (100% utilization)
───────────────────────
Final:             250 → normalized: 92/100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare to placing the same workload on an empty 8-GPU node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Base:              50
Domain integrity:  +50 (single domain)
Node matching:     -40 (oversized node)
Domain completion:  0  (not completing anything)
Efficiency:        -20 (12.5% utilization)
───────────────────────
Final:             40 → normalized: 31/100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 2-GPU node wins decisively, preserving the 8-GPU node for workloads that actually need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Topology awareness compounds.&lt;/strong&gt; The 40% improvement isn't just about individual job performance—it's about cluster-wide efficiency. Better placement means less fragmentation, which means more workloads scheduled successfully, which means higher overall utilization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DCGM integration is essential.&lt;/strong&gt; Without real-time allocation data, the plugin would make decisions based on stale information. The Prometheus integration adds minimal overhead but provides critical visibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoring weights need tuning.&lt;/strong&gt; Different clusters have different workload mixes. A cluster dominated by small jobs might want stronger penalties for oversized placement. We exposed the key parameters for operator customization.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes-sigs/scheduler-plugins" rel="noopener noreferrer"&gt;Kubernetes Scheduler Plugins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;NVIDIA DCGM Exporter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>nvidia</category>
      <category>gpu</category>
      <category>scheduling</category>
    </item>
    <item>
      <title>Reclaiming Idle GPUs in Kubernetes: Why We Built a Custom Scheduler Plugin</title>
      <dc:creator>Lalit Somavarapha</dc:creator>
      <pubDate>Fri, 09 Jan 2026 08:38:51 +0000</pubDate>
      <link>https://forem.com/lalitlouis/reclaiming-idle-gpus-in-kubernetes-why-we-built-a-custom-scheduler-plugin-34k2</link>
      <guid>https://forem.com/lalitlouis/reclaiming-idle-gpus-in-kubernetes-why-we-built-a-custom-scheduler-plugin-34k2</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;GPUs are expensive — and yours are probably sitting idle right now. A single NVIDIA A100 can cost $10,000+, and in a Kubernetes cluster running AI workloads, you might have dozens of them. Here's the uncomfortable truth: most of the time, they're sitting idle. If you're struggling with GPU scheduling in Kubernetes or looking for ways to reclaim idle GPUs, you're not alone.&lt;/p&gt;

&lt;p&gt;A data scientist spins up a training job, requests 4 GPUs, runs for two hours, then leaves for lunch. The GPUs sit allocated but unused. Meanwhile, another team's job is queued, waiting for resources that technically exist but aren't available.&lt;/p&gt;

&lt;p&gt;Standard Kubernetes scheduling doesn't help here. It sees allocated resources as unavailable — period. It doesn't care whether those GPUs are actually being used.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kubernetes Gets Wrong About GPUs
&lt;/h2&gt;

&lt;p&gt;Kubernetes was built for CPUs. Its scheduling model assumes resources are either allocated or free, with nothing in between. For CPUs, this mostly works — a pod using 10% of its requested CPU isn't blocking others in the same way.&lt;/p&gt;

&lt;p&gt;GPUs are different. They're discrete, expensive, and often requested in large quantities. A pod requesting 4 GPUs gets exactly 4 GPUs, even if it's only actively using them 20% of the time. This is the core challenge of GPU resource management in Kubernetes — the scheduler has no concept of actual utilization.&lt;/p&gt;

&lt;p&gt;The default Kubernetes preemption mechanism (&lt;code&gt;DefaultPreemption&lt;/code&gt;) can evict lower-priority pods to make room for higher-priority ones. But it only considers priority — not actual utilization. A pod sitting completely idle has the same protection as one running a critical training job, as long as their priorities match.&lt;/p&gt;

&lt;p&gt;We looked for existing solutions. NVIDIA's device plugin handles allocation but not reclamation. Cluster autoscaler can add nodes but won't reclaim idle resources on existing ones. Various GPU sharing approaches exist, but they don't address the fundamental scheduling problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea: Utilization-Aware Preemption
&lt;/h2&gt;

&lt;p&gt;We needed utilization-aware preemption that considers what GPUs are actually doing, not just what they've been allocated. The solution: a custom Kubernetes scheduler plugin for idle GPU reclaim that replaces the default preemption logic with something smarter.&lt;/p&gt;

&lt;p&gt;The plugin, which we called &lt;code&gt;ReclaimIdleResource&lt;/code&gt;, operates in the PostFilter phase of the scheduling cycle. This is where Kubernetes looks for preemption candidates when a pod can't be scheduled normally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb4yb3p17imigrb1yw5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb4yb3p17imigrb1yw5w.png" alt="ReclaimIdleResource plugin" width="800" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the key insight: instead of just comparing priorities, we also query Prometheus for actual GPU utilization metrics from DCGM (NVIDIA's Data Center GPU Manager). A pod is only eligible for preemption if:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Its priority is below the preemptor's threshold&lt;/li&gt;
&lt;li&gt;It's been running long enough to establish a usage pattern&lt;/li&gt;
&lt;li&gt;Its actual GPU utilization is below a configured threshold&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means an idle pod with priority 1000 can be preempted by a pod with priority 500, if the idle pod isn't actually using its GPUs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffiixlnpjis3pniz5ehbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffiixlnpjis3pniz5ehbd.png" alt="System architecture" width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The plugin hooks into the scheduler as a PostFilter extension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;profiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;schedulerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-scheduler&lt;/span&gt;
  &lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;postFilter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ReclaimIdleResource&lt;/span&gt;
      &lt;span class="na"&gt;disabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DefaultPreemption&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a GPU-requesting pod can't be scheduled, the plugin:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Checks cooldown&lt;/strong&gt; — Has this pod recently triggered preemption? If so, wait. This prevents thrashing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scans potential victims&lt;/strong&gt; — Finds all lower-priority pods on candidate nodes that have GPUs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Evaluates each victim&lt;/strong&gt; — Parses its PriorityClass for reclaim policy annotations, checks if it's still in its "toleration period" (grace period after scheduling), queries Prometheus for average GPU utilization over the monitoring window, and compares utilization against the idle threshold.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Selects minimal victims&lt;/strong&gt; — Sorts eligible victims by GPU count (descending) and priority (ascending), then selects the minimum set needed to free enough GPUs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Validates the decision&lt;/strong&gt; — Runs filter plugins to confirm the preemptor will actually fit after preemption.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The policy is defined per-PriorityClass through annotations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PriorityClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch-workload&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Pods can be preempted if idle (&amp;lt;10% GPU) for 1 hour, by priority ≥10000&lt;/span&gt;
    &lt;span class="na"&gt;reclaim-idle-resource.scheduling.x-k8s.io/minimum-preemptable-priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10000"&lt;/span&gt;
    &lt;span class="na"&gt;reclaim-idle-resource.scheduling.x-k8s.io/toleration-seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3600"&lt;/span&gt;
    &lt;span class="na"&gt;reclaim-idle-resource.scheduling.x-k8s.io/resource-idle-seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3600"&lt;/span&gt;
    &lt;span class="na"&gt;reclaim-idle-resource.scheduling.x-k8s.io/resource-idle-usage-threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.0"&lt;/span&gt;
&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This says: pods in this priority class can tolerate preemption for one hour after scheduling, and can be preempted if their GPU usage stays below 10% for an hour — but only by pods with priority 10000 or higher.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0d7e6z8d4rpp5jzygd52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0d7e6z8d4rpp5jzygd52.png" alt="Preemption Decision logic" width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why PriorityClass annotations?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We considered a custom CRD, but PriorityClass already exists in the scheduling mental model. Teams already think about priority when designing workloads. Adding reclaim policy as annotations keeps the configuration close to where people expect it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why a monitoring window instead of instant utilization?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GPU workloads are bursty. A training job might spike to 100% utilization during forward/backward passes, then drop to near-zero during data loading. Instant measurements would give false positives. We use a configurable window (typically 30–60 minutes) to capture the true usage pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why query Prometheus instead of using in-memory metrics?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The scheduler runs as a single replica. We needed utilization data that survives scheduler restarts and can be queried historically. DCGM exports to Prometheus naturally, and most GPU clusters already have this pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why a cooldown period?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without it, a preemptor pod could trigger preemption, fail to schedule for unrelated reasons, and immediately trigger another preemption attempt. The 30-second cooldown prevents rapid-fire preemption storms.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tuning matters more than we expected.&lt;/strong&gt; The idle threshold and monitoring window need to match your workload patterns. Too aggressive and you'll preempt jobs mid-training. Too conservative and you won't reclaim much.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability is essential.&lt;/strong&gt; We added extensive logging and Kubernetes events so operators can understand why preemption decisions were made. When someone's job gets preempted, they want to know why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MIG complicates everything.&lt;/strong&gt; NVIDIA's Multi-Instance GPU feature means a single physical GPU can be partitioned. We had to add partition-size compatibility checks to avoid preempting pods on nodes where the preemptor couldn't run anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This article covers the "what" and "why." In the next post, I'll walk through building a Kubernetes scheduler plugin from scratch in Go — the project structure, the interfaces you need to implement, and the gotchas we hit along the way.&lt;/p&gt;

&lt;p&gt;If you're running GPU workloads on Kubernetes and wrestling with utilization challenges, I'd love to hear how you're approaching it — drop a comment below!&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>gpu</category>
      <category>go</category>
    </item>
  </channel>
</rss>
