<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Pavan Shiraguppi</title>
    <description>The latest articles on Forem by Pavan Shiraguppi (@shiraguppipavan).</description>
    <link>https://forem.com/shiraguppipavan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1139865%2Ff6d31a95-d8d9-4fd2-bc5e-5a3230f17e85.jpeg</url>
      <title>Forem: Pavan Shiraguppi</title>
      <link>https://forem.com/shiraguppipavan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/shiraguppipavan"/>
    <language>en</language>
    <item>
      <title>Multi-tenancy in Kubernetes using Vcluster</title>
      <dc:creator>Pavan Shiraguppi</dc:creator>
      <pubDate>Thu, 24 Aug 2023 09:40:18 +0000</pubDate>
      <link>https://forem.com/cloudraft/multi-tenancy-in-kubernetes-using-vcluster-2ib9</link>
      <guid>https://forem.com/cloudraft/multi-tenancy-in-kubernetes-using-vcluster-2ib9</guid>
      <description>&lt;p&gt;Kubernetes has revolutionized how organizations deploy and manage containerized applications, making it easier to orchestrate and scale applications across clusters. However, running multiple heterogeneous workloads on a shared Kubernetes cluster comes with challenges like resource contention, security risks, lack of customization, and complex management.&lt;/p&gt;

&lt;p&gt;There are several approaches to implementing isolation and multi-tenancy within Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes namespaces&lt;/strong&gt;: Namespaces allow some isolation by dividing cluster resources between different users. However, namespaces share the same physical infrastructure and kernel resources. So there are limits to isolation and customization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes distributions&lt;/strong&gt;: Popular Kubernetes distributions like &lt;a href="https://www.redhat.com/en/technologies/cloud-computing/openshift" rel="noopener noreferrer"&gt;Red Hat OpenShift&lt;/a&gt; and &lt;a href="https://www.rancher.com/" rel="noopener noreferrer"&gt;Rancher&lt;/a&gt; support virtual clusters. These leverage Kubernetes-native capabilities like namespaces, RBAC, and network policies more efficiently. Other benefits include centralized control planes, pre-configured cluster templates, and easy-to-use management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical namespaces&lt;/strong&gt;: In a traditional Kubernetes cluster, each namespace is independent of the others. This means that users and applications in one namespace cannot access resources in another namespace unless they have explicit permissions. Hierarchical namespaces solve this problem by allowing you to define a parent-child relationship between namespaces. This means that a user or application with permissions in the parent namespace will automatically have permissions in all of the child namespaces. This makes it much easier to manage permissions across multiple namespaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vcluster project&lt;/strong&gt;: The virtual cluster (vcluster) project addresses these pain points by dividing a physical Kubernetes cluster into multiple isolated software-defined clusters. vcluster allows organizations to provide development teams, applications, and customers with dedicated Kubernetes environments with guaranteed resources, security policies, and custom configurations.
This post will dive deep into vcluster - its capabilities, different implementation options, use cases, and challenges. We will also look into the best practices for maximizing utilization and simplifying the management of vcluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  What is Vcluster?
&lt;/h1&gt;

&lt;p&gt;vcluster is an open-source tool that allows you to create and manage virtual Kubernetes clusters. A virtual Kubernetes cluster is a fully functional Kubernetes cluster that runs on top of another Kubernetes cluster. vcluster works by creating a virtual cluster inside a namespace of the underlying Kubernetes cluster. The virtual cluster has its own control plane, but it shares the worker nodes and networking of the underlying cluster. This makes vcluster a lightweight solution that can be deployed on any Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;When you create a vcluster, you specify the number of worker nodes that you want the virtual cluster to have. The vcluster CLI will then create the virtual cluster and start the control plane pods on the worker nodes. You can then deploy workloads to the virtual cluster using the kubectl CLI.&lt;/p&gt;

&lt;p&gt;You can learn more about vcluster on the vcluster &lt;a href="https://vcluster.com" rel="noopener noreferrer"&gt;website&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Benefits of Using Vcluster
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Resource Isolation
&lt;/h2&gt;

&lt;p&gt;vcluster allows you to allocate a portion of the central cluster's resources like CPU, memory, and storage to individual virtual clusters. This prevents noisy neighbor issues when multiple teams share the same physical cluster. Critical workloads can be assured of the resources they need without interference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access Control
&lt;/h2&gt;

&lt;p&gt;With vcluster, access policies can be implemented at the virtual cluster level, ensuring only authorized users have access. For example, sensitive workloads like financial applications can run in an isolated vcluster. Restricting access is much simpler compared to namespace-level policies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F839d230e5e7af9a310459ea7ae559f9bf81dcef4%2Fded1f%2Fdocs%2Fmedia%2Fdiagrams%2Fvcluster-architecture.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F839d230e5e7af9a310459ea7ae559f9bf81dcef4%2Fded1f%2Fdocs%2Fmedia%2Fdiagrams%2Fvcluster-architecture.svg" alt="vcluster architecture"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source: &lt;a href="https://www.vcluster.com/docs/architecture/basics" rel="noopener noreferrer"&gt;Basics | vcluster docs | Virtual Clusters for&lt;br&gt;
Kubernetes&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Customization
&lt;/h2&gt;

&lt;p&gt;vcluster allows extensive customization for individual teams' needs - different Kubernetes versions, network policies, ingress rules, and resource quotas can be defined. Developers can have permission to modify their vcluster without impacting others.&lt;/p&gt;
&lt;h2&gt;
  
  
  Multitenancy
&lt;/h2&gt;

&lt;p&gt;Organizations often need to provide Kubernetes access to multiple internal teams or external customers. vcluster makes multi-tenancy easy to implement by creating separate isolated environments in the same physical cluster. Refer to this article for more information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Frafay.co%2Fwp-content%2Fuploads%2F2023%2F03%2F1674585426944-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Frafay.co%2Fwp-content%2Fuploads%2F2023%2F03%2F1674585426944-1.png" alt="vcluster architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://rafay.co/the-kubernetes-current/key-considerations-when-implementing-virtual-kubernetes-clusters/" rel="noopener noreferrer"&gt;Implementing Virtual Kubernetes Clusters | Rafay&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Easy Scaling
&lt;/h2&gt;

&lt;p&gt;Additional vcluster can be quickly spun up or down to handle dynamic workloads and scale requirements. New development and testing environments can be provisioned instantly without having to scale the entire physical cluster.&lt;/p&gt;
&lt;h1&gt;
  
  
  Workload Isolation Approaches Before vcluster
&lt;/h1&gt;

&lt;p&gt;Organizations have leveraged various Kubernetes native features to enable some workload isolation before virtual clusters emerged as a solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Namespaces&lt;/strong&gt; - Namespaces segregate cluster resources between different teams or applications. They provide basic isolation via resource quotas and network policies. However, there is no hypervisor-level isolation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Policies&lt;/strong&gt; - Granular network policies restrict communication between pods and namespaces. This creates network segmentation between workloads. However, resource contention can still occur.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Taints and Tolerations&lt;/strong&gt; - Applying taints to nodes prevents specified pods from scheduling onto them. Pods must have tolerances to match taints. This enables restricting pods to certain nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Virtual Networks&lt;/strong&gt; - On public clouds, using multiple virtual networks helps isolate Kubernetes cluster traffic. But pods within a cluster can still communicate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third-Party Network Plugins&lt;/strong&gt; - CNI plugins like Calico, Weave, and Cilium enable building overlay networks and fine-grained network policies to segregate traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom Controllers&lt;/strong&gt; - Developing custom Kubernetes controllers allows programmatically isolating resources. But this requires significant programming expertise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Demo of vcluster
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Install vcluster CLI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubectl (check via kubectl version)&lt;/li&gt;
&lt;li&gt;helm v3 (check with helm version)&lt;/li&gt;
&lt;li&gt;a working kube-context with access to a Kubernetes cluster (check with kubectl get namespaces)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use the following command to download the vcluster CLI binary for arm64-based Ubuntu machines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; vcluster &lt;span class="s2"&gt;"https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-arm64"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo install&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; 0755 vcluster /usr/local/bin &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To confirm that vcluster CLI is successfully installed, test via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For installations on other machines, please refer to the following link.&lt;br&gt;
&lt;a href="https://www.vcluster.com/docs/getting-started/setup" rel="noopener noreferrer"&gt;Install vcluster CLI&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Deploy vcluster
&lt;/h2&gt;

&lt;p&gt;Let's create a virtual cluster my-first-vcluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster create my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Connection to the vcluster
&lt;/h2&gt;

&lt;p&gt;To connect to the vcluster enter the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster connect my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use kubectl command to get the namespaces in the connected vcluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get namespaces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deploy an application to the vcluster
&lt;/h2&gt;

&lt;p&gt;Now let's deploy a sample nginx deployment inside the vcluster. To create a deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace demo-nginx
kubectl create deployment nginx-deployment &lt;span class="nt"&gt;-n&lt;/span&gt; demo-nginx &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will isolate the application in a namespace demo-nginx inside the vcluster.&lt;/p&gt;

&lt;p&gt;You can check that this demo deployment will create pods inside the vcluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; demo-nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check deployments from the host cluster
&lt;/h2&gt;

&lt;p&gt;Now that we have confirmed the deployments in the vcluster, let us now try to check the deployments from the host cluster.&lt;/p&gt;

&lt;p&gt;To disconnect from the vcluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcluster disconnect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will move the kube context back to the host cluster. Now let us check if there are any deployments available in the host cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get deployments &lt;span class="nt"&gt;-n&lt;/span&gt; vcluster-my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There will be no resources found in the &lt;em&gt;vcluster-my-vcluster&lt;/em&gt; namespace. This is because the deployment is isolated in the vcluster that is not accessible from other clusters.&lt;/p&gt;

&lt;p&gt;Now let us check if any pods are running in all of the namespaces using the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; vcluster-my-first-vcluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voila! We can now see that the nginx container is running in the &lt;em&gt;vcluster&lt;/em&gt; namespace.&lt;/p&gt;

&lt;h1&gt;
  
  
  Vcluster Use Cases
&lt;/h1&gt;

&lt;p&gt;Virtual clusters enable several important use cases by providing isolated and customizable Kubernetes environments within a single physical cluster. Let's explore some of these in more detail:&lt;/p&gt;

&lt;h2&gt;
  
  
  Development and Testing Environments
&lt;/h2&gt;

&lt;p&gt;Allocating dedicated virtual clusters for developer teams allows them to fully control the configuration without affecting production workloads or other developers.&lt;br&gt;
Teams can customize their vclusters with required Kubernetes versions, network policies, resource quotas, and access controls. Development teams can rapidly spin up and tear down vclusters to test different configurations.&lt;br&gt;
Since vclusters provide guaranteed compute and storage resources, developers don't have to compete. They also won't impact the performance of applications running in other vclusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Application Isolation
&lt;/h2&gt;

&lt;p&gt;Enterprise applications like ERP, CRM, and financial systems require predictable performance, high availability, and strict security. Dedicated vclusters allow these production workloads to operate unaffected by other applications.&lt;br&gt;
Mission-critical applications can be allocated reserved capacity to avoid resource contention. Custom network policies guarantee isolation. Vclusters also allow granular role-based access control to meet regulatory compliance needs.&lt;br&gt;
Rather than overprovisioning large clusters to avoid interference, vclusters provide guaranteed resources at a lower cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multitenancy
&lt;/h2&gt;

&lt;p&gt;Service providers and enterprises with multiple business units often need to securely provide Kubernetes access to different internal teams or external customers.&lt;br&gt;
vclusters simplify multi-tenancy by creating separate self-service environments for each tenant with appropriate resource limits and access policies applied. Providers can easily onboard new customers by spinning up additional vclusters.&lt;br&gt;
This removes noisy neighbor issues and allows a high density of workloads by packing vclusters according to actual usage rather than peak needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Regulatory Compliance
&lt;/h2&gt;

&lt;p&gt;Heavily regulated industries like finance and healthcare have strict security and compliance requirements around data privacy, geography, and access controls.&lt;br&gt;
Dedicated vclusters with internal network segmentation, role-based access control, and resource isolation make it easier to host compliant workloads safely alongside other applications in the same cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Temporary Resources
&lt;/h2&gt;

&lt;p&gt;vclusters allow instantly spinning up temporary Kubernetes environments to handle use cases like&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Testing cluster upgrades&lt;/strong&gt; - New Kubernetes versions can be deployed to lower environments with no downtime or impact on production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluating new applications&lt;/strong&gt; - Applications can be deployed into disposable vclusters instead of shared dev clusters to prevent conflicts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity spikes&lt;/strong&gt; - New vclusters provide burst capacity for traffic spikes versus overprovisioning the entire cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Special events&lt;/strong&gt; - vClusters can be created temporarily for workshops, conferences, and other events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the need is over, these vclusters can simply be deleted with no lasting footprint on the cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workload Consolidation
&lt;/h2&gt;

&lt;p&gt;As organizations scale their Kubernetes footprint, there is a need to consolidate multiple clusters onto shared infrastructure without interfering with existing applications.&lt;br&gt;
Migrating applications into vclusters provides logical isolation and customization allowing them to run seamlessly alongside other workloads. This improves utilization and reduces operational overhead.&lt;br&gt;
vclusters allow enterprise IT to provide a consistent Kubernetes platform across the organization while preserving isolation.&lt;br&gt;
In summary, vclusters are an essential tool for optimizing Kubernetes environments via workload isolation, customization, security, and density. The use cases highlight how they benefit diverse needs from developers to Ops to business units within an organization.&lt;/p&gt;

&lt;h1&gt;
  
  
  Challenges with vclusters
&lt;/h1&gt;

&lt;p&gt;While delivering significant benefits, some downsides to weigh includes:&lt;/p&gt;

&lt;h2&gt;
  
  
  Complexity
&lt;/h2&gt;

&lt;p&gt;Managing multiple virtual clusters, albeit smaller ones, introduces more operational overhead compared to a single large Kubernetes cluster.&lt;br&gt;
Additional tasks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provisioning and configuring multiple control planes&lt;/li&gt;
&lt;li&gt;Applying security policies and access controls consistently across vclusters&lt;/li&gt;
&lt;li&gt;Monitoring and logging across vclusters&lt;/li&gt;
&lt;li&gt;Maintaining designated resources and capacity for each vcluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, a cluster administrator has to configure and update RBAC policies across 20 vclusters rather than a single cluster. This takes more effort compared to the centralized management of a single cluster. The static IP addresses and ports on Kubernetes might cause conflicts or errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource allocation and management
&lt;/h2&gt;

&lt;p&gt;Balancing the resource consumption and performance of vclusters can be tricky, as they may have different demands or expectations.&lt;/p&gt;

&lt;p&gt;For example, vclusters may need to scale up or down depending on the workload or share resources with other vclusters or namespaces. A vcluster sized for an application's peak demand may have excess unused capacity during non-peak periods that sits idle and cannot be leveraged by other vclusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limited Customization
&lt;/h2&gt;

&lt;p&gt;The ability to customize vclusters varies across implementations. Namespaces offer the least flexibility, while Cluster API provides the most. Tools like OpenShift balance customization with simplicity.&lt;br&gt;
For example, namespaces cannot run different Kubernetes versions or network plugins. The Cluster API allows full customization but with more complexity.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Vcluster empowers Kubernetes users to customize, isolate and scale workloads within a shared physical cluster. By allocating dedicated control plane resources and access policies, vclusters provide strong technical isolation. For use cases like multitenancy, vclusters deliver simplified and more secure Kubernetes management.&lt;/p&gt;

&lt;p&gt;Vcluster can also be used to reduce Kubernetes cost overhead and can be used for ephemeral environments.&lt;br&gt;
Tools like OpenShift, Rancher, and Kubernetes Cluster API make deploying and managing vclusters much easier. As adoption increases, we can expect more innovations in the vcluster space to further simplify operations and maximize utilization. While vclusters have some drawbacks, for many organizations the benefits outweigh the added complexity.&lt;/p&gt;

&lt;p&gt;We are working on some exciting projects using vcluster to build a large scale system. Feel free to &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;contact us&lt;/a&gt; to discuss how to use vcluster for your usecase.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>vcluster</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Deploy LLM on Kubernetes using OpenLLM</title>
      <dc:creator>Pavan Shiraguppi</dc:creator>
      <pubDate>Wed, 16 Aug 2023 06:32:17 +0000</pubDate>
      <link>https://forem.com/cloudraft/deploy-llms-on-kubernetes-using-openllm-3g9c</link>
      <guid>https://forem.com/cloudraft/deploy-llms-on-kubernetes-using-openllm-3g9c</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Natural Language Processing (NLP) has evolved significantly, with Large Language Models (LLMs) at the forefront of cutting-edge applications. Their ability to understand and generate human-like text has revolutionized various industries. Deploying and testing these LLMs effectively is crucial for harnessing their capabilities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/bentoml/OpenLLM" rel="noopener noreferrer"&gt;OpenLLM&lt;/a&gt; is an open-source platform for operating large language models (LLMs) in production. It allows you to run inference on any open-source LLMs, fine-tune them, deploy, and build powerful AI apps with ease.&lt;/p&gt;

&lt;p&gt;This blog post explores the deployment of LLM models using the OpenLLM framework on a Kubernetes infrastructure. For the purpose of the demo, I am using a hardware setup consisting of an RTX 3060 GPU and an Intel i7 12700K processor, we delve into the technical aspects of achieving optimal performance.&lt;/p&gt;

&lt;h1&gt;
  
  
  Environment Setup and Kubernetes Configuration
&lt;/h1&gt;

&lt;p&gt;Before diving into LLM deployment on Kubernetes, we need to ensure the environment is set up correctly and the Kubernetes cluster is ready for action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing the Kubernetes Cluster
&lt;/h2&gt;

&lt;p&gt;Setting up a Kubernetes cluster requires defining worker nodes, networking, and orchestrators. Ensure you have Kubernetes installed and a cluster configured. This can be achieved through tools like &lt;code&gt;kubeadm&lt;/code&gt;, &lt;code&gt;minikube&lt;/code&gt;, kind or managed services such as Google Kubernetes Engine (GKE) and Amazon EKS.&lt;/p&gt;

&lt;p&gt;If you are using kind cluster, you can create cluster as following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kind create cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installing Dependencies and Resources
&lt;/h2&gt;

&lt;p&gt;Within the cluster, install essential dependencies such as NVIDIA GPU drivers, CUDA libraries, and Kubernetes GPU support. These components are crucial for enabling GPU acceleration and maximizing LLM performance.&lt;/p&gt;

&lt;p&gt;To use CUDA on your system, you will need the following installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A CUDA-capable GPU&lt;/li&gt;
&lt;li&gt;A supported version of Linux with a gcc compiler and toolchain&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/cuda-downloads" rel="noopener noreferrer"&gt;CUDA Toolkit 12.2 at NVIDIA Developer portal&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Using OpenLLM to Containerize and Load Models
&lt;/h1&gt;

&lt;h2&gt;
  
  
  OpenLLM
&lt;/h2&gt;

&lt;p&gt;OpenLLM supports a wide range of state-of-the-art LLMs, including Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. It also provides flexible APIs that allow you to serve LLMs over RESTful API or gRPC with one command, or query via WebUI, CLI, our Python/Javascript client, or any HTTP client.&lt;/p&gt;

&lt;p&gt;Some of the key features of OpenLLM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support for a wide range of state-of-the-art LLMs&lt;/li&gt;
&lt;li&gt;Flexible APIs for serving LLMs&lt;/li&gt;
&lt;li&gt;Integration with other powerful tools&lt;/li&gt;
&lt;li&gt;Easy to use&lt;/li&gt;
&lt;li&gt;Open-source&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To use OpenLLM, you need to have Python 3.8 (or newer) and &lt;code&gt;pip&lt;/code&gt; installed on your system. We highly recommend using a Virtual Environment (like conda) to prevent package conflicts.&lt;/p&gt;

&lt;p&gt;You can install OpenLLM using pip as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;openllm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To verify if it's installed correctly, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openllm &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To start an LLM server, for example, to start an Open Pre-trained transformer model aka &lt;a href="https://huggingface.co/docs/transformers/model_doc/opt" rel="noopener noreferrer"&gt;OPT&lt;/a&gt; server, do the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openllm start opt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Selecting the LLM Model
&lt;/h2&gt;

&lt;p&gt;OpenLLM framework supports various pre-trained LLM models like GPT-3, GPT-2, and BERT. When selecting a large language model (LLM) for your application, the main factors to consider are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model size&lt;/strong&gt; - Larger models like GPT-3 have more parameters and can handle more complex tasks, while smaller ones like GPT-2 are better for simpler usecases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt; - Models optimized for generative AI like GPT-3 or understanding (e.g. BERT) align with different use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training data&lt;/strong&gt; - More high-quality, diverse data leads to better generalization capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning&lt;/strong&gt; - Pre-trained models can be further trained on domain-specific data to improve performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alignment with usecase&lt;/strong&gt;- Validate potential models on your specific application and data to ensure the right balance of complexity and capability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ideal LLM matches your needs in terms of complexity, data requirements, compute resources, and overall capability. Thoroughly evaluate options to select the best fit. For this demo, we will be using the Dolly-2 model with 3B parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loading the Chosen Model within a Container
&lt;/h2&gt;

&lt;p&gt;Containerization enhances reproducibility and portability. Package your LLM model, OpenLLM dependencies, and other relevant libraries within a Docker container. This ensures a consistent runtime environment across different deployments.&lt;/p&gt;

&lt;p&gt;With OpenLLM, you can easily build a Bento for a specific model, like &lt;code&gt;dolly-v2-3b&lt;/code&gt;, using the &lt;code&gt;build&lt;/code&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openllm build dolly-v2 &lt;span class="nt"&gt;--model-id&lt;/span&gt; databricks/dolly-v2-3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this demo, we are using BentoML, an MLOps platform and also the parent organization behind OpenLLM project. A &lt;a href="https://docs.bentoml.com/en/latest/concepts/bento.html#what-is-a-bento" rel="noopener noreferrer"&gt;Bento&lt;/a&gt;, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artifacts, and dependencies.&lt;/p&gt;

&lt;p&gt;To Containerize your Bento, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bentoml containerize &amp;lt;name:version&amp;gt; &lt;span class="nt"&gt;-t&lt;/span&gt; dolly-v2-3b:latest &lt;span class="nt"&gt;--opt&lt;/span&gt; &lt;span class="nv"&gt;progress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;plain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This generates an OCI-compatible docker image that can be deployed anywhere docker runs.&lt;/p&gt;

&lt;p&gt;You will be able to locate the docker image in &lt;code&gt;$BENTO_HOME\bentos\stabilityai-stablelm-tuned-alpha-3b-service\$id\env\docker&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Model Inference and High Scalability using Kubernetes
&lt;/h1&gt;

&lt;p&gt;Executing model inference efficiently and scaling up when needed are key factors in a Kubernetes-based LLM deployment. The reliability and scalability features of Kubernetes can help efficiently scale the model for the production usecase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running LLM Model Inference
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pod Communication&lt;/strong&gt;: Set up communication protocols within pods to manage model input and output. This can involve RESTful APIs or gRPC-based communication.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenLLM has a gRPC server running by default on port 3000. We can have a deployment file as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-3b:latest&lt;/span&gt;
          &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note&lt;/em&gt;&lt;/strong&gt;: We will be assuming that the image is available locally with the name dolly-v2-3b with the latest tag. If the image is pushed to the repository, then make sure to remove the imagePullPolicy line and provide the credentials to the repository as secrets if it is a private repository.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Service&lt;/strong&gt;: Expose the deployment using services to distribute incoming inference requests evenly among multiple pods.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We set up a &lt;code&gt;LoadBalancer&lt;/code&gt; type service in our Kubernetes cluster that gets exposed on port 80. If you are using Ingress then it will be &lt;code&gt;ClusterIP&lt;/code&gt; instead of &lt;code&gt;LoadBalancer&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalancer&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Horizontal Scaling and Autoscaling&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Autoscaling (HPA)&lt;/strong&gt;: Configure HPAs to automatically adjust the number of pods based on CPU or custom metrics. This ensures optimal resource utilization.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can declare an HPA yaml for CPU configuration as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;targetCPUUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For GPU configuration, To gather GPU metrics in Kubernetes, follow this blog to install the DCGM server: &lt;a href="https://iamajayr.medium.com/kubernetes-hpa-using-gpu-metrics-e366ddbfedb7" rel="noopener noreferrer"&gt;Kubernetes HPA using GPU metrics&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After installation of the DCGM server, we can use the following to create HPA for GPU memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1beta1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Object&lt;/span&gt;
      &lt;span class="na"&gt;object&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dolly-v2-deployment&lt;/span&gt; &lt;span class="c1"&gt;# kubectl get svc | grep dcgm&lt;/span&gt;
        &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DCGM_FI_DEV_MEM_COPY_UTIL&lt;/span&gt;
        &lt;span class="na"&gt;targetValue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Autoscaling&lt;/strong&gt;: Enable cluster-level autoscaling to manage resource availability across multiple nodes, accommodating varying workloads. Here are the key steps to configure cluster autoscaling in Kubernetes:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Install the Cluster Autoscaler plugin:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kubernetes/autoscaler/releases/download/v1.20.0/cluster-autoscaler-component.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Configure auto scaling by setting min/max nodes in your cluster config.&lt;/li&gt;
&lt;li&gt;Annotate node groups you want to scale automatically:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl annotate node POOL_NAME cluster-autoscaler.kubernetes.io/safe-to-evict&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Deploy an auto scaling-enabled application, like an HPA-based deployment. The autoscaler will scale the node pool when pods are unschedulable.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Configure auto scaling parameters as needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adjust scale-up/down delays with &lt;code&gt;--scale-down-delay&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set scale-down unneeded time with &lt;code&gt;--scale-down-unneeded-time&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Limit scale speed with &lt;code&gt;--max-node-provision-time&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Monitor your cluster autoscaling events:&lt;br&gt;&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get events | &lt;span class="nb"&gt;grep &lt;/span&gt;ClusterAutoscaler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Performance Analysis of LLMs in a Kubernetes Environment
&lt;/h1&gt;

&lt;p&gt;Evaluating the performance of LLM deployment within a Kubernetes environment involves latency measurement and resource utilization assessment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency Evaluation
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Measuring Latency&lt;/strong&gt;: Use tools like &lt;code&gt;kubectl exec&lt;/code&gt; or custom scripts to measure the time it takes for a pod to process an input prompt and generate a response. Refer the below python script to determine latency metrics of the GPU.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Python Program to test Latency and Tokens/sec.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;

&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;databricks/dolly-v2-3b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sample text for benchmarking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="n"&gt;times&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enable_timing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enable_timing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Start timer
&lt;/span&gt;    &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Model inference
&lt;/span&gt;    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;
    &lt;span class="c1"&gt;# End timer
&lt;/span&gt;    &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Sync and get time
&lt;/span&gt;    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;elapsed_time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate TPS
&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;tps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Calculate latency
&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="c1"&gt;# in ms
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avg TPS: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tps&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avg Latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Comparing Latency using Aviary&lt;/strong&gt;: &lt;a href="https://aviary.anyscale.com/" rel="noopener noreferrer"&gt;Aviary&lt;/a&gt; is a valuable tool for developers who want to get started with LLMs, or who want to improve the performance and scalability of their LLM-based applications. It is easy to use and provides a number of features that make it a great choice for both beginners and experienced developers.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Resource Utilization and Scalability Insights
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring Resource Consumption&lt;/strong&gt;: Utilize Kubernetes dashboard or monitoring tools like Prometheus and Grafana to observe resource usage patterns across pods.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability Analysis&lt;/strong&gt;: Analyze how Kubernetes dynamically adjusts resources based on demand, ensuring resource efficiency and application responsiveness.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;We have tried to put up an in-depth technical analysis that demonstrates the immense value of leveraging Kubernetes for LLM deployments. By combining GPU acceleration, specialized libraries, and Kubernetes orchestration capabilities, LLMs can be deployed with significantly improved performance and for a large scale. In particular, GPU-enabled pods achieved over 2x lower latency and nearly double the inference throughput compared to CPU-only variants. Kubernetes autoscaling also allowed pods to be scaled horizontally on demand, so query volumes could increase without compromising responsiveness.&lt;/p&gt;

&lt;p&gt;Overall, the results of this analysis validate that Kubernetes is the best choice for deploying LLMs at scale. The synergy between software and hardware optimization on Kubernetes unlocks the true potential of LLMs for real-world NLP use cases.&lt;/p&gt;

&lt;p&gt;If you are looking for help implementing LLMs on Kubernetes, we would love to hear how you are scaling LLMs. Please &lt;a href="https://cloudraft.io/contact-us" rel="noopener noreferrer"&gt;contact us&lt;/a&gt; to discuss your specific problem statement.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>docker</category>
      <category>devops</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
