<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yocheved k</title>
    <description>The latest articles on Forem by Yocheved k (@yocheved).</description>
    <link>https://forem.com/yocheved</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3651708%2F37d87c64-4e7d-4562-b0d8-8193e40f007d.png</url>
      <title>Forem: Yocheved k</title>
      <link>https://forem.com/yocheved</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/yocheved"/>
    <language>en</language>
    <item>
      <title>Deploying NVIDIA Dynamo &amp; LMCache for LLMs: Installation, Containers, and Integration</title>
      <dc:creator>Yocheved k</dc:creator>
      <pubDate>Thu, 11 Dec 2025 11:01:57 +0000</pubDate>
      <link>https://forem.com/yocheved/deploying-nvidia-dynamo-lmcache-for-llms-installation-containers-and-integration-32of</link>
      <guid>https://forem.com/yocheved/deploying-nvidia-dynamo-lmcache-for-llms-installation-containers-and-integration-32of</guid>
      <description>&lt;p&gt;As large language models continue to scale, they consistently exceed the memory and compute limits of any single GPU. Tensor parallelism addresses the capacity issue by distributing layers across multiple GPUs-and often across multiple servers—but it introduces a new challenge: &lt;strong&gt;how do we synchronize shards, route requests, and share KV-cache efficiently enough to behave like a single cohesive accelerator?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This orchestration gap is exactly what NVIDIA Dynamo is designed to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is NVIDIA Dynamo?
&lt;/h2&gt;

&lt;p&gt;NVIDIA Dynamo is a distributed orchestration layer that enhances LLM inference by intelligently coordinating multi-GPU and multi-node workloads. It is inference-engine-agnostic and plugs seamlessly into frameworks such as TRT-LLM, vLLM, SGLang, and others.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faf8zrya3z3ubuh12vlk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faf8zrya3z3ubuh12vlk4.png" alt=" " width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Dynamo introduces several LLM-specific capabilities that dramatically improve system performance:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Disaggregated prefill &amp;amp; decode inference&lt;/strong&gt;
Maximizes GPU utilization and enables fine-grained latency/throughput trade-offs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic GPU scheduling&lt;/strong&gt;
Adapts resource allocation based on real-time workload demand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-aware request routing&lt;/strong&gt;
Eliminates redundant KV-cache recomputation for faster inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accelerated data transfer (NIXL)&lt;/strong&gt;
Reduces inter-GPU communication overhead and improves response times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV-cache offloading&lt;/strong&gt;
Leverages multi-tier memory hierarchies (HBM, DRAM, SSD) for higher throughput at lower cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Altogether, Dynamo provides the distributed intelligence required to make large-scale LLM inference behave as though all hardware resources were a single unified accelerator.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation &amp;amp; Setup Guide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone the Dynamo Repository&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git clone --branch v0.4.1 --depth 1 https://github.com/ai-dynamo/dynamo.git&lt;br&gt;
cd dynamo&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Build the Docker Image&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker compose -f deploy/docker-compose.yml up -d&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;



&lt;p&gt;&lt;code&gt;./container/build.sh --framework VLLM&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Create and Run the Container&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;./container/run.sh -it --framework VLLM [--mount-workspace]&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Or attach to an existing one:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker exec -it &amp;lt;container_name&amp;gt; bash&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Running Dynamo on a Single Node
&lt;/h2&gt;

&lt;p&gt;Inside the container, launch Dynamo with a specified model:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python -m dynamo.vllm --model &amp;lt;path_to_model&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;If HBM capacity is limited, extend model length via:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--max-model-len &amp;lt;size&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Then start the backend services:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cd components/backends/vllm&lt;br&gt;
bash launch/agg.sh&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Running Dynamo with LMCache Integration
&lt;/h2&gt;

&lt;p&gt;To enable LMCache and configure CPU offload size:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;LMCACHE_MAX_LOCAL_CPU_SIZE=500 \&lt;br&gt;
python -m dynamo.vllm --model &amp;lt;path_to_model&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Launch the LMCache-enabled backend:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cd components/backends/vllm&lt;br&gt;
bash launch/agg_lmcache.sh&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

</description>
      <category>devops</category>
      <category>performance</category>
      <category>llm</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
