<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: 霓漠Nimbus</title>
    <description>The latest articles on Forem by 霓漠Nimbus (@nimbus_7e671c0df3b80bcf).</description>
    <link>https://forem.com/nimbus_7e671c0df3b80bcf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2544288%2F568abb25-2b60-4d88-94d8-f07aae9eb22a.png</url>
      <title>Forem: 霓漠Nimbus</title>
      <link>https://forem.com/nimbus_7e671c0df3b80bcf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nimbus_7e671c0df3b80bcf"/>
    <language>en</language>
    <item>
      <title>A Quick Take on K8s 1.34 GA DRA: 7 Questions You Probably Have</title>
      <dc:creator>霓漠Nimbus</dc:creator>
      <pubDate>Thu, 11 Sep 2025 13:39:19 +0000</pubDate>
      <link>https://forem.com/nimbus_7e671c0df3b80bcf/a-quick-take-on-k8s-134-ga-dra-7-questions-you-probably-have-nb</link>
      <guid>https://forem.com/nimbus_7e671c0df3b80bcf/a-quick-take-on-k8s-134-ga-dra-7-questions-you-probably-have-nb</guid>
      <description>&lt;h2&gt;
  
  
  The 7 questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What problem does DRA solve?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does “dynamic” mean hot-plugging a GPU to a running Pod or in-place GPU memory resize?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What real-world use cases (and “fun” possibilities) does DRA enable?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How does DRA relate to the DevicePlugin? Can they coexist?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What’s the status of GPU virtualization under DRA? What about HAMi?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Which alpha/beta features around DRA are worth watching?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When will this be production-ready at scale?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Before we dive in, here’s a mental model that helps a lot:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Know HAMi + know PV/PVC ≈ know DRA.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More precisely: DRA borrows the &lt;em&gt;dynamic provisioning&lt;/em&gt; idea from PV/PVC and adds a &lt;strong&gt;structured, standardized abstraction&lt;/strong&gt; for device requests. The core insight is simple:&lt;/p&gt;

&lt;p&gt;Previously, the DevicePlugin didn’t surface enough structured information for the scheduler to make good decisions. DRA fixes that by richly describing devices and requests in a way the scheduler (and autoscaler) can reason about.&lt;/p&gt;

&lt;p&gt;In plain English: &lt;strong&gt;report more facts, and make the scheduler aware of them.&lt;/strong&gt; That’s DRA’s “structured parameters” in a nutshell.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you’re familiar with &lt;strong&gt;HAMi’s Node &amp;amp; Pod annotation–based mechanism for conveying device constraints to the scheduler&lt;/strong&gt;, DRA &lt;strong&gt;elevates the same idea into first-class, structured API objects&lt;/strong&gt; that the native scheduler and Cluster Autoscaler can reason about directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A bit of history (why structured parameters won)
&lt;/h2&gt;

&lt;p&gt;The earliest DRA design wasn’t structured. Vendors proposed opaque, driver-owned CRDs. The scheduler couldn’t see global availability or interpret those fields, so it had to orchestrate a multi-round “dance” with the vendor controller:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduler writes a &lt;strong&gt;candidate node list&lt;/strong&gt; into a temp object&lt;/li&gt;
&lt;li&gt;Driver controller removes unfit nodes&lt;/li&gt;
&lt;li&gt;Scheduler picks a node&lt;/li&gt;
&lt;li&gt;Driver tries to allocate&lt;/li&gt;
&lt;li&gt;Allocation status is written back&lt;/li&gt;
&lt;li&gt;Only then does the scheduler try to bind the Pod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqipnkowe1445xbnwxpj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqipnkowe1445xbnwxpj4.png" alt="Early unstructured DRA design — scheduler and driver had to do a multi-round " width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every step risked &lt;strong&gt;races, stale state, retries&lt;/strong&gt;—hot spots on the API server, pressure on drivers, and long-tail scheduling latency. Cluster Autoscaler (CA) also had poor predictive power because the scheduler itself didn’t understand the resource constraints.&lt;/p&gt;

&lt;p&gt;That approach was dropped in favor of &lt;strong&gt;structured parameters&lt;/strong&gt;, so &lt;strong&gt;scheduler and CA can reason directly&lt;/strong&gt; and participate in the decision upfront.&lt;/p&gt;




&lt;h2&gt;
  
  
  Now the Q&amp;amp;A
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) What problem does DRA actually solve?
&lt;/h3&gt;

&lt;p&gt;It solves this: &lt;strong&gt;“DevicePlugin’s reported info isn’t enough, and if you report it elsewhere the scheduler can’t see it.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DRA introduces &lt;strong&gt;structured, declarative descriptions&lt;/strong&gt; of device needs and inventory so the &lt;strong&gt;native scheduler can decide intelligently&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Does “dynamic” mean hot-plugging GPUs into a running Pod, or in-place VRAM up/down?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Neither.&lt;/strong&gt; Here, &lt;em&gt;dynamic&lt;/em&gt; primarily means &lt;strong&gt;flexible, declarative device selection at scheduling time&lt;/strong&gt;, plus the ability for drivers to &lt;strong&gt;prepare/cleanup&lt;/strong&gt; around &lt;em&gt;bind&lt;/em&gt; and &lt;em&gt;unbind&lt;/em&gt;. Think of it as &lt;strong&gt;flexible resource allocation&lt;/strong&gt;, not live GPU hot-plugging or in-place VRAM resizing.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) What new toys does DRA bring? Where does it shine?
&lt;/h3&gt;

&lt;p&gt;DRA adds four key concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeviceClass&lt;/strong&gt; → think &lt;strong&gt;StorageClass&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ResourceClaim&lt;/strong&gt; → think &lt;strong&gt;PVC&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ResourceClaimTemplate&lt;/strong&gt; → think &lt;strong&gt;VolumeClaimTemplate&lt;/strong&gt; (flavor or “SKU” you’d expose on a platform)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ResourceSlice&lt;/strong&gt; → a &lt;strong&gt;richer, extensible inventory record&lt;/strong&gt;, i.e., a supercharged version of what DevicePlugin used to advertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes &lt;strong&gt;inventory and SKU management&lt;/strong&gt; feel native. A lot of the real “fun” lands with features that are α/β today (see below), but even at GA the &lt;strong&gt;information model&lt;/strong&gt; is the big unlock.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) What’s the relationship with DevicePlugin? Can they coexist?
&lt;/h3&gt;

&lt;p&gt;DRA is &lt;strong&gt;meant to replace&lt;/strong&gt; the legacy DevicePlugin path over time. To make migration smoother, there’s &lt;strong&gt;KEP-5004 (DRA Extended Resource Mapping)&lt;/strong&gt; which lets a DRA driver &lt;strong&gt;map devices to extended resources&lt;/strong&gt; (e.g., &lt;code&gt;nvidia.com/gpu&lt;/code&gt;) during a transition.&lt;/p&gt;

&lt;p&gt;Practically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can run &lt;strong&gt;both&lt;/strong&gt; in the &lt;strong&gt;same cluster&lt;/strong&gt; during migration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A single node cannot expose the &lt;em&gt;same named&lt;/em&gt; extended resource from both&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You can migrate apps and nodes &lt;strong&gt;gradually&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5) What about GPU virtualization? And HAMi?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Template-style (MIG-like) partitioning&lt;/strong&gt;: see &lt;strong&gt;KEP-4815 – DRA Partitionable Devices&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible (capacity-style) sharing&lt;/strong&gt; like HAMi: the community is building on &lt;strong&gt;KEP-5075 – DRA Consumable Capacity&lt;/strong&gt; (think “share by capacity” such as VRAM or bandwidth).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni0mk8pzvgkcl3z2xel3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni0mk8pzvgkcl3z2xel3.png" alt="DRA extensions for GPUs — partitionable devices (MIG-like templates) and consumable capacity (HAMi-style flexible sharing)." width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;HAMi’s DRA driver (demo branch) lives here:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;https://github.com/Project-HAMi/k8s-dra-driver/tree/demo&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  6) What α/β features look exciting?
&lt;/h3&gt;

&lt;p&gt;Already mentioned, but here’s the short list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KEP-5004 – DRA Extended Resource Mapping&lt;/strong&gt;: smoother migration from DevicePlugin&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KEP-4815 – Partitionable Devices&lt;/strong&gt;: MIG-like templated splits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KEP-5075 – Consumable Capacity&lt;/strong&gt;: share by capacity (VRAM, bandwidth, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And more I’m watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;KEP-4816 – Prioritized Alternatives in Device Requests&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let a request specify &lt;strong&gt;ordered fallbacks&lt;/strong&gt;—prefer “A”, accept “B”, or even &lt;strong&gt;prioritize allocating “lower-end” first&lt;/strong&gt; to keep “higher-end” free.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;KEP-4680 – Resource Health in Pod Status&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Device health surfaces directly in &lt;strong&gt;PodStatus&lt;/strong&gt; for &lt;strong&gt;faster detection and response&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;KEP-5055 – Device Taints/Tolerations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Taint devices&lt;/strong&gt; (by driver or humans) e.g., “nearing decommission” or “needs maintenance”, and control placement with tolerations.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7) When will this be broadly production-ready?
&lt;/h3&gt;

&lt;p&gt;For &lt;strong&gt;wide, low-friction production use&lt;/strong&gt;, you typically want &lt;strong&gt;β maturity + ecosystem drivers&lt;/strong&gt; to catch up. A &lt;strong&gt;rough&lt;/strong&gt; expectation: &lt;strong&gt;~ 8–16 months&lt;/strong&gt; for most shops, depending on vendors and your risk posture.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Virtualizing Any GPU on AWS with HAMi: Free Memory Isolation</title>
      <dc:creator>霓漠Nimbus</dc:creator>
      <pubDate>Thu, 11 Sep 2025 13:08:27 +0000</pubDate>
      <link>https://forem.com/nimbus_7e671c0df3b80bcf/virtualizing-any-gpu-on-aws-with-hami-free-memory-isolation-214g</link>
      <guid>https://forem.com/nimbus_7e671c0df3b80bcf/virtualizing-any-gpu-on-aws-with-hami-free-memory-isolation-214g</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: This guide spins up an AWS EKS cluster with two GPU node groups (T4 and A10G), installs HAMi automatically, and deploys three vLLM services that &lt;strong&gt;share&lt;/strong&gt; a single physical GPU per node using &lt;strong&gt;free memory isolation&lt;/strong&gt;. You’ll see GPU‑dimension binpack in action: multiple Pods co‑located on the &lt;strong&gt;same GPU&lt;/strong&gt; when limits allow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why HAMi on AWS?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Project-HAMi/HAMi" rel="noopener noreferrer"&gt;HAMi brings GPU‑model‑agnostic virtualization to Kubernetes—spanning consumer‑grade to data‑center GPUs.&lt;/a&gt; On AWS, that means you can take common NVIDIA instances (e.g., &lt;strong&gt;g4dn.12xlarge&lt;/strong&gt; with T4s, &lt;strong&gt;g5.12xlarge&lt;/strong&gt; with A10Gs), and then &lt;strong&gt;slice GPU memory&lt;/strong&gt; to safely pack multiple Pods on a single card—no app changes required.&lt;/p&gt;

&lt;p&gt;In this demo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two nodes&lt;/strong&gt;: one T4 node, one A10G node (each with &lt;strong&gt;4 GPUs&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HAMi&lt;/strong&gt; is installed via Helm as part of the Terraform apply.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; workloads request fractions of GPU memory so two Pods can run on one GPU.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  One‑Click Setup
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Repo: github.com/dynamia-ai/hami-ecosystem-demo&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  0) Prereqs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Terraform or OpenTofu&lt;/li&gt;
&lt;li&gt;AWS CLI v2 (and &lt;code&gt;aws sts get-caller-identity&lt;/code&gt; succeeds)&lt;/li&gt;
&lt;li&gt;kubectl, jq&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1) Provision AWS + Install HAMi
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dynamia-ai/hami-ecosystem-demo.git
&lt;span class="nb"&gt;cd &lt;/span&gt;infra/aws
terraform init
terraform apply &lt;span class="nt"&gt;-auto-approve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When finished, configure kubectl using the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform output &lt;span class="nt"&gt;-raw&lt;/span&gt; kubectl_config_command
&lt;span class="c"&gt;# Example:&lt;/span&gt;
&lt;span class="c"&gt;# aws eks update-kubeconfig --region us-west-2 --name hami-demo-aws&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2) Verify Cluster &amp;amp; HAMi
&lt;/h3&gt;

&lt;p&gt;Check that HAMi components are running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; hami

hami-device-plugin-mtkmg             2/2     Running   0          3h6m
hami-device-plugin-sg5wl             2/2     Running   0          3h6m
hami-scheduler-574cb577b9-p4xd9      2/2     Running   0          3h6m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;List registered GPUs per node (HAMi annotates nodes with inventory):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.hami\.io/node-nvidia-register}{"\n"}{end}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see four entries per node (T4 x4, A10G x4), with UUIDs and memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ip-10-0-38-240.us-west-2.compute.internal   GPU-f8e75627-86ed-f202-cf2b-6363fb18d516,10,15360,100,NVIDIA-Tesla T4,0,true,0,hami-core:GPU-7f2003cf-a542-71cf-121f-0e489699bbcf,10,15360,100,NVIDIA-Tesla T4,0,true,1,hami-core:GPU-90e2e938-7ac3-3b5e-e9d2-94b0bd279cf2,10,15360,100,NVIDIA-Tesla T4,0,true,2,hami-core:GPU-2facdfa8-853c-e117-ed59-f0f55a4d536f,10,15360,100,NVIDIA-Tesla T4,0,true,3,hami-core:
ip-10-0-53-156.us-west-2.compute.internal   GPU-bd5e2639-a535-7cba-f018-d41309048f4e,10,23028,100,NVIDIA-NVIDIA A10G,0,true,0,hami-core:GPU-06f444bc-af98-189a-09b1-d283556db9ef,10,23028,100,NVIDIA-NVIDIA A10G,0,true,1,hami-core:GPU-6385a85d-0ce2-34ea-040d-23c94299db3c,10,23028,100,NVIDIA-NVIDIA A10G,0,true,2,hami-core:GPU-d4acf062-3ba9-8454-2660-aae402f7a679,10,23028,100,NVIDIA-NVIDIA A10G,0,true,3,hami-core:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deploy the Demo Workloads
&lt;/h2&gt;

&lt;p&gt;Apply the manifests (two A10G services, one T4 service):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; demo/workloads/a10g.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; demo/workloads/t4.yaml
kubectl get pods &lt;span class="nt"&gt;-o&lt;/span&gt; wide

NAME                                       READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
vllm-a10g-mistral7b-awq-5f78b4c6b4-q84k7   1/1     Running   0          172m   10.0.50.145   ip-10-0-53-156.us-west-2.compute.internal   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
vllm-a10g-qwen25-7b-awq-6d5b5d94b-nxrbj    1/1     Running   0          172m   10.0.49.180   ip-10-0-53-156.us-west-2.compute.internal   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
vllm-t4-qwen25-1-5b-55f98dbcf4-mgw8d       1/1     Running   0          117m   10.0.44.2     ip-10-0-38-240.us-west-2.compute.internal   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
vllm-t4-qwen25-1-5b-55f98dbcf4-rn5m4       1/1     Running   0          117m   10.0.37.202   ip-10-0-38-240.us-west-2.compute.internal   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What the two key annotations do
&lt;/h3&gt;

&lt;p&gt;In the Pod templates you’ll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/use-gputype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A10G"&lt;/span&gt;   &lt;span class="c1"&gt;# or "T4" on the T4 demo&lt;/span&gt;
    &lt;span class="na"&gt;hami.io/gpu-scheduler-policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;binpack"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nvidia.com/use-gputype&lt;/code&gt; restricts scheduling to the named GPU model (e.g., &lt;code&gt;A10G&lt;/code&gt;, &lt;code&gt;T4&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hami.io/gpu-scheduler-policy: binpack&lt;/code&gt; tells HAMi to &lt;strong&gt;co‑locate&lt;/strong&gt; Pods on the &lt;strong&gt;same physical GPU&lt;/strong&gt; when memory/core limits permit (GPU‑dimension binpack).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How the free memory isolation is requested
&lt;/h3&gt;

&lt;p&gt;Each container sets &lt;strong&gt;GPU memory limits&lt;/strong&gt; via HAMi resource names so multiple Pods can safely share one card:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On T4: &lt;code&gt;nvidia.com/gpumem: "7500"&lt;/code&gt; (MiB) with 2 replicas ⇒ both fit on a 16 GB T4.&lt;/li&gt;
&lt;li&gt;On A10G: &lt;code&gt;nvidia.com/gpumem-percentage: "45"&lt;/code&gt; for each Deployment ⇒ two Pods fit on a 24 GB A10G.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;HAMi enforces these limits inside the container, so Pods can’t exceed their assigned GPU memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expected Results: GPU Binpack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T4 deployment&lt;/strong&gt; (&lt;code&gt;vllm-t4-qwen25-1-5b&lt;/code&gt; with &lt;code&gt;replicas: 2&lt;/code&gt;): both replicas are scheduled to the &lt;strong&gt;same T4 GPU&lt;/strong&gt; on the T4 node.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A10G deployments&lt;/strong&gt; (&lt;code&gt;vllm-a10g-mistral7b-awq&lt;/code&gt; and &lt;code&gt;vllm-a10g-qwen25-7b-awq&lt;/code&gt;): both land on the &lt;strong&gt;same A10G GPU&lt;/strong&gt; on the A10G node (45% + 45% &amp;lt; 100%).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to verify co‑location &amp;amp; memory caps
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;In‑pod verification (&lt;code&gt;nvidia-smi&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# A10G pair&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;p &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;vllm-a10g-mistral7b-awq &lt;span class="nt"&gt;-o&lt;/span&gt; name&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
           kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;vllm-a10g-qwen25-7b-awq &lt;span class="nt"&gt;-o&lt;/span&gt; name&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"== &lt;/span&gt;&lt;span class="nv"&gt;$p&lt;/span&gt;&lt;span class="s2"&gt; =="&lt;/span&gt;
    &lt;span class="c"&gt;# Show the GPU UUID (co‑location check)&lt;/span&gt;
    kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;p&lt;/span&gt;&lt;span class="p"&gt;#pod/&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;uuid &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
    &lt;span class="c"&gt;# Show memory cap (total) and current usage inside the container view&lt;/span&gt;
    kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;p&lt;/span&gt;&lt;span class="p"&gt;#pod/&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;name,memory.total,memory.used &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
    &lt;span class="nb"&gt;echo
&lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The two A10G Pods print the &lt;strong&gt;same GPU UUID&lt;/strong&gt; → confirms &lt;strong&gt;co‑location on the same physical A10G&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory.total&lt;/code&gt; inside each container ≈ &lt;strong&gt;45% of A10G VRAM&lt;/strong&gt; (slightly less due to driver/overhead; e.g., ~&lt;strong&gt;10,3xx MiB&lt;/strong&gt;), and &lt;code&gt;memory.used&lt;/code&gt; stays below that cap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;==&lt;/span&gt; pod/vllm-a10g-mistral7b-awq-5f78b4c6b4-q84k7 &lt;span class="o"&gt;==&lt;/span&gt;
GPU-d4acf062-3ba9-8454-2660-aae402f7a679
NVIDIA A10G, 10362 MiB, 7241 MiB

&lt;span class="o"&gt;==&lt;/span&gt; pod/vllm-a10g-qwen25-7b-awq-6d5b5d94b-nxrbj &lt;span class="o"&gt;==&lt;/span&gt;
GPU-d4acf062-3ba9-8454-2660-aae402f7a679
NVIDIA A10G, 10362 MiB, 7355 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;








&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# T4 pair (2 replicas of the same Deployment)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;p &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;vllm-t4-qwen25-1-5b &lt;span class="nt"&gt;-o&lt;/span&gt; name&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"== &lt;/span&gt;&lt;span class="nv"&gt;$p&lt;/span&gt;&lt;span class="s2"&gt; =="&lt;/span&gt;
        kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;p&lt;/span&gt;&lt;span class="p"&gt;#pod/&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;uuid &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
        kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;p&lt;/span&gt;&lt;span class="p"&gt;#pod/&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;name,memory.total,memory.used &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
    &lt;span class="nb"&gt;echo
&lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both replicas print the &lt;strong&gt;same T4 GPU UUID&lt;/strong&gt; → confirms &lt;strong&gt;co‑location on the same T4&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory.total&lt;/code&gt; = &lt;strong&gt;7500 MiB&lt;/strong&gt; (from &lt;code&gt;nvidia.com/gpumem: "7500"&lt;/code&gt;) and &lt;code&gt;memory.used&lt;/code&gt; stays under it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;==&lt;/span&gt; pod/vllm-t4-qwen25-1-5b-55f98dbcf4-mgw8d &lt;span class="o"&gt;==&lt;/span&gt;
GPU-f8e75627-86ed-f202-cf2b-6363fb18d516
Tesla T4, 7500 MiB, 5111 MiB

&lt;span class="o"&gt;==&lt;/span&gt; pod/vllm-t4-qwen25-1-5b-55f98dbcf4-rn5m4 &lt;span class="o"&gt;==&lt;/span&gt;
GPU-f8e75627-86ed-f202-cf2b-6363fb18d516
Tesla T4, 7500 MiB, 5045 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Inference Checks
&lt;/h2&gt;

&lt;p&gt;Port‑forward each service locally and send a tiny request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T4 / Qwen2.5‑1.5B&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/vllm-t4-qwen25-1-5b 8001:8000

curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://127.0.0.1:8001/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-binary&lt;/span&gt; @- &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="sh"&gt;' | jq -r '.choices[0].message.content'
{
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "temperature": 0.2,
  "messages": [
    {
      "role": "user",
      "content": "Summarize this email in 2 bullets and draft a one-sentence reply:&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;nSubject: Renewal quote &amp;amp; SSO&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;nHi team, we want a renewal quote, prefer monthly billing, and we need SSO by the end of the month. Can you confirm timeline?&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;n— Alex"
    }
  ]
}
&lt;/span&gt;&lt;span class="no"&gt;JSON
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Summary:
- Request for renewal quote with preference for monthly billing.
- Need Single Sign-On (SSO) by the end of the month.

Reply:
Thank you, Alex. I will ensure that both the renewal quote and SSO request are addressed promptly. We aim to have everything ready before the end of the month.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A10G / Mistral‑7B‑AWQ&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/vllm-a10g-mistral7b-awq 8002:8000

curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://127.0.0.1:8002/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-binary&lt;/span&gt; @- &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="sh"&gt;' | jq -r '.choices[0].message.content'
{
  "model": "solidrust/Mistral-7B-Instruct-v0.3-AWQ",
  "temperature": 0.3,
  "messages": [
    {
      "role": "user",
      "content": "Write a 3-sentence weekly update about improving GPU sharing on EKS with memory capping. Audience: non-technical executives."
    }
  ]
}
&lt;/span&gt;&lt;span class="no"&gt;JSON
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In our ongoing efforts to optimize cloud resources, we're pleased to announce significant progress in enhancing GPU sharing on Amazon Elastic Kubernetes Service (EKS). By implementing memory capping, we're ensuring that each GPU-enabled pod on EKS is allocated a defined amount of memory, preventing overuse and improving overall system efficiency. This update will lead to reduced costs and improved performance for our GPU-intensive applications, ultimately boosting our competitive edge in the market.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A10G / Qwen2.5‑7B‑AWQ&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/vllm-a10g-qwen25-7b-awq 8003:8000

curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://127.0.0.1:8003/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-binary&lt;/span&gt; @- &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="sh"&gt;' | jq -r '.choices[0].message.content'
{
  "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
  "temperature": 0.2,
  "messages": [
    {
      "role": "user",
      "content": "You are a customer support assistant for an e-commerce store.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;Task:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;1) Read the ticket.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;2) Return ONLY valid JSON with fields: intent, sentiment, order_id, item, eligibility, next_steps, customer_reply.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;3) Keep the reply friendly, concise, and action-oriented.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;Ticket:&lt;/span&gt;&lt;span class="se"&gt;\n\"&lt;/span&gt;&lt;span class="sh"&gt;Order #A1234 — Hi, I bought running shoes 26 days ago. They’re too small. Can I exchange for size 10? I need them before next weekend. Happy to pay the price difference if needed. — Jamie&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;"
    }
  ]
}
&lt;/span&gt;&lt;span class="no"&gt;JSON
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Request for exchange"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Neutral"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"order_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A1234"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"item"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Running shoes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"eligibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Eligible for exchange within 30 days"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"next_steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"We can exchange your shoes for size 10. Please ship back the current pair and we'll send the new ones."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"customer_reply"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Thank you! Can you please confirm the shipping details?"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Clean Up
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;infra/aws
terraform destroy &lt;span class="nt"&gt;-auto-approve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Coming next (mini-series)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Advanced scheduling&lt;/strong&gt;: &lt;strong&gt;GPU &amp;amp; Node&lt;/strong&gt; binpack/spread, anti‑affinity, &lt;strong&gt;NUMA‑aware&lt;/strong&gt; and &lt;strong&gt;NVLink‑aware&lt;/strong&gt; placement, UUID pinning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container‑level monitoring&lt;/strong&gt;: simple, reproducible checks for allocation &amp;amp; usage; shareable dashboards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Under the hood&lt;/strong&gt;: HAMi scheduling flow &amp;amp; HAMi‑core memory/compute capping (concise deep dive).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DRA&lt;/strong&gt;: community feature under active development; we’ll cover &lt;strong&gt;support progress &amp;amp; plan&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem demos&lt;/strong&gt;: Kubeflow, vLLM Production Stack, Volcano, Xinference, JupyterHub. (&lt;em&gt;**vLLM Production Stack&lt;/em&gt;&lt;em&gt;, &lt;/em&gt;&lt;em&gt;Volcano&lt;/em&gt;&lt;em&gt;, and &lt;/em&gt;&lt;em&gt;Xinference&lt;/em&gt;&lt;em&gt; already have native integrations.&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>aws</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
