<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Matthew</title>
    <description>The latest articles on Forem by Matthew (@thegalah).</description>
    <link>https://forem.com/thegalah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1111817%2Fca32f60f-4077-4f29-a332-24e180b89e10.jpeg</url>
      <title>Forem: Matthew</title>
      <link>https://forem.com/thegalah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thegalah"/>
    <language>en</language>
    <item>
      <title>My Experience Running HeadJobs: Generative AI at Home</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Sat, 09 Sep 2023 19:12:12 +0000</pubDate>
      <link>https://forem.com/thegalah/my-experience-running-headjobs-generative-ai-at-home-10d4</link>
      <guid>https://forem.com/thegalah/my-experience-running-headjobs-generative-ai-at-home-10d4</guid>
      <description>&lt;p&gt;This blog was originally posted &lt;a href="https://www.thegalah.com/my-experience-running-a-cheap-resilient-generative-ai-cluster-using-consumer-gpus/"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: The High Cost of Cloud Computing and non-existent inventory
&lt;/h2&gt;

&lt;p&gt;As the administrator of &lt;a href="https://www.headbot.ai/?s=dev2-resilient-gpu-cluster"&gt;Headbot&lt;/a&gt;, a generative AI project that creates personalised AI avatars, I found myself grappling with the exorbitant costs and limitations of cloud-based GPU services. Running GPU nodes on Google, Azure, and AWS could set me back between $1-2 per hour. These numbers quickly add up, especially when you're on a tight budget and can't afford data centre-level GPU nodes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4lh3RqU9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/odtmghlhdy5ly0wvdd9a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4lh3RqU9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/odtmghlhdy5ly0wvdd9a.png" alt="NVIDIA Tesla A100 used GPU price September 2023 via Amazon" width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cloud providers have also been consistently out of capacity for GPU quota since December 2022. It was clear that a different solution was needed.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iPGTM9vI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fk3r22mzdwfo6sfr67t4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iPGTM9vI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fk3r22mzdwfo6sfr67t4.png" alt="Cloud Providers have been out of capcity for 9 months!" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Concept of "HeadJobs": The Engine Behind Headbot
&lt;/h2&gt;

&lt;p&gt;The core function lies in Headbot Jobs or what I've named "HeadJobs" for short. These are specific computational tasks spun up to create personalised AI avatars. Once you upload 10+ portraits of yourself, a HeadJob kicks into gear. It's trained to capture your unique facial and body features, right down to your preferred clothing styles. Each HeadJob spends time learning what “you” look like. It will absorb your facial expressions, hair and even your fashion sense.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wgTvWani--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ufv8n5oz6jo2pm3vkdbv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wgTvWani--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ufv8n5oz6jo2pm3vkdbv.png" alt='A SwoleBama "Greek God" Job via Headbot' width="312" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  My Infrastructure: Consumer GPUs on Home Servers
&lt;/h2&gt;

&lt;p&gt;My cost-effective yet powerful solution runs on my home server, utilising Kubernetes on RKE2. The setup includes three Dell R730xd nodes to run the web app and API, along with two RTX 3090 GPUs for the heavy lifting—i.e., running the actual generative AI jobs. These 3090s are a godsend; not only are they cost-efficient, but they also offer just the right performance to handle one Headbot job per node.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mWbSNmfO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cnmjstqcudapo11ku82v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mWbSNmfO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cnmjstqcudapo11ku82v.png" alt="Best value GPU in September 2023 for our purposes" width="630" height="284"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Spiky Nature of Workloads
&lt;/h3&gt;

&lt;p&gt;Due to the high variability of incoming jobs, there are times I need to scramble for additional computational resources. This is where my buddy John comes in with his spare RTX 4090 node, which we utilise during peak demands.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Technicalities: Software, Drivers, and More
&lt;/h3&gt;

&lt;p&gt;Maintaining this setup isn't a walk in the park, even with all the cost advantages. One of the major challenges lies in keeping the GPU drivers up-to-date, for which I use Lambda Stack. The customer GPU jobs are executed using PyTorch, and each job is specifically designed to capture not just facial features but body features and clothing styles as well.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dPuuXt1g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/v1/__GHOST_URL__/content/images/2023/09/image.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dPuuXt1g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/v1/__GHOST_URL__/content/images/2023/09/image.png" alt="" width="" height=""&gt;&lt;/a&gt;The Lambda Stack Value Proposition&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fun of It: A Problem for the Sake of Problem-Solving
&lt;/h3&gt;

&lt;p&gt;Let's be clear: running Headbot this way isn't solving a world-crisis level problem. It's essentially a problem concocted for the sheer joy of solving it—turning AI and machine learning into a form of high-tech artistry where predefined shapes and poses are painted with your personalised features. This isn’t going to be the next Silicon Valley Unicorn.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Complexities: A Snapshot of the Challenges
&lt;/h2&gt;

&lt;p&gt;Running this setup isn't plug-and-play. It requires intricate Kubernetes configurations, real-time monitoring for spiky workloads, and constant driver updates via Lambda Stack. Each component needs fine-tuning and ongoing attention, making it a far cry from a "set and forget" operation.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---f5c7d24--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b1vhyr1l48ym8zj36arh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---f5c7d24--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b1vhyr1l48ym8zj36arh.jpg" alt="Muscle Mogul Mark: Zuckerberg's Zany Zumba Zone via Headbot" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verdict: Do It Yourself or Try Headbot
&lt;/h2&gt;

&lt;p&gt;If you're not keen on dealing with the hassles and complexities of setting up your own generative AI project, you might want to check out Headbot. But if you're up for the challenge, running your own cluster can save you substantial amounts of money. In my case, the cost comes down to just 3 cents per hour—peanuts compared to what you'd shell out on cloud platforms.&lt;/p&gt;

&lt;p&gt;So, are you up for the challenge, or would you rather take the easy route and create your personalised AI avatar on &lt;a href="https://www.headbot.ai/?s=thegalah-resilient-gpu-cluster-blog"&gt;Headbot&lt;/a&gt;? The choice is yours.&lt;/p&gt;

&lt;p&gt;Feel free to create your own AI avatars with Headbot, or if you're interested in the nitty-gritty, stay tuned for more detailed breakdowns of my setup.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cuda</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>2023 On Prem Kubernetes Container Attached Storage Options</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Sun, 02 Jul 2023 06:14:59 +0000</pubDate>
      <link>https://forem.com/thegalah/2023-on-prem-kubernetes-container-attached-storage-options-1o9l</link>
      <guid>https://forem.com/thegalah/2023-on-prem-kubernetes-container-attached-storage-options-1o9l</guid>
      <description>&lt;h2&gt;
  
  
  Problem
&lt;/h2&gt;

&lt;p&gt;As some of you may know I’ve been running my in-home on premises server for the last 6 years. Originally that was accomplished through a single-node micro k8s kubernetes cluster. However due to increased workloads I had to expand to a multi node kubernetes cluster. It became clear to me that my existing storage solution microk8s-hostpath would not be sufficient since true production storage had the following requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage requirements
&lt;/h2&gt;

&lt;p&gt;In order of importance:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resilient to node failure&lt;/strong&gt;: Given the need to perform regular security and operating system level node upgrades as well as hardware failures, the data had to be resilient to the temporary or complete catastrophic failure of a machine.&lt;br&gt;
Diverse workload support: I run a plethora of different applications each with their own unique storage needs. The storage solution had to be versatile, providing support for block file and object storage&lt;br&gt;
&lt;strong&gt;Ease of management&lt;/strong&gt;: To maintain a lean operation and minimise the risk of human error, I needed a solution that was easy to install, configure, and manage. Moreover, an ideal storage solution would offer automatic healing of damaged data nodes and provide a straightforward path to disaster recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OpenEBS Mayastor - &lt;a href="https://mayastor.gitbook.io/introduction/"&gt;https://mayastor.gitbook.io/introduction/&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;p&gt;Mayastor caught my eye with its high performance and replication features for data protection, which are critical in a production environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;p&gt;However, its limited support for diverse applications and excessive manual configuration needs were major drawbacks. Additionally, the solution's immaturity and specific huge_pages requirement, which caused compatibility issues with PostgreSQL, made it less appealing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Longhorn - &lt;a href="https://longhorn.io/"&gt;https://longhorn.io/&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;p&gt;Longhorn stood out due to its ease of management and handy features such as volume snapshots, backup, and restore, which are key to maintaining data integrity.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;p&gt;Nevertheless, Longhorn's limitation of only supporting read-write operations on a single node at a time posed a significant challenge to my requirement of resilience to node failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rook-Ceph - &lt;a href="https://rook.io/"&gt;https://rook.io/&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;p&gt;Rook-Ceph impressed me with its comprehensive support for diverse applications, scalability, and resilience. Its maturity and the backing of a robust community and wide adoption provided further confidence.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;p&gt;One hiccup I encountered with Rook-Ceph was its requirement for full drives for provisioning, which presented challenges with my partitioned drives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;After careful evaluation, Rook-Ceph emerged as the optimal choice for my multi-node on-premises MicroK8s cluster.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>selfhost</category>
      <category>homelab</category>
      <category>homeserver</category>
    </item>
  </channel>
</rss>
