My Experience Running HeadJobs: Generative AI at Home

Matthew — Sat, 09 Sep 2023 19:12:12 +0000

This blog was originally posted here

The Problem: The High Cost of Cloud Computing and non-existent inventory

As the administrator of Headbot, a generative AI project that creates personalised AI avatars, I found myself grappling with the exorbitant costs and limitations of cloud-based GPU services. Running GPU nodes on Google, Azure, and AWS could set me back between $1-2 per hour. These numbers quickly add up, especially when you're on a tight budget and can't afford data centre-level GPU nodes.

Cloud providers have also been consistently out of capacity for GPU quota since December 2022. It was clear that a different solution was needed.

The Concept of "HeadJobs": The Engine Behind Headbot

The core function lies in Headbot Jobs or what I've named "HeadJobs" for short. These are specific computational tasks spun up to create personalised AI avatars. Once you upload 10+ portraits of yourself, a HeadJob kicks into gear. It's trained to capture your unique facial and body features, right down to your preferred clothing styles. Each HeadJob spends time learning what “you” look like. It will absorb your facial expressions, hair and even your fashion sense.

My Infrastructure: Consumer GPUs on Home Servers

My cost-effective yet powerful solution runs on my home server, utilising Kubernetes on RKE2. The setup includes three Dell R730xd nodes to run the web app and API, along with two RTX 3090 GPUs for the heavy lifting—i.e., running the actual generative AI jobs. These 3090s are a godsend; not only are they cost-efficient, but they also offer just the right performance to handle one Headbot job per node.

The Spiky Nature of Workloads

Due to the high variability of incoming jobs, there are times I need to scramble for additional computational resources. This is where my buddy John comes in with his spare RTX 4090 node, which we utilise during peak demands.

The Technicalities: Software, Drivers, and More

Maintaining this setup isn't a walk in the park, even with all the cost advantages. One of the major challenges lies in keeping the GPU drivers up-to-date, for which I use Lambda Stack. The customer GPU jobs are executed using PyTorch, and each job is specifically designed to capture not just facial features but body features and clothing styles as well.
The Lambda Stack Value Proposition

The Fun of It: A Problem for the Sake of Problem-Solving

Let's be clear: running Headbot this way isn't solving a world-crisis level problem. It's essentially a problem concocted for the sheer joy of solving it—turning AI and machine learning into a form of high-tech artistry where predefined shapes and poses are painted with your personalised features. This isn’t going to be the next Silicon Valley Unicorn.

The Hidden Complexities: A Snapshot of the Challenges

Running this setup isn't plug-and-play. It requires intricate Kubernetes configurations, real-time monitoring for spiky workloads, and constant driver updates via Lambda Stack. Each component needs fine-tuning and ongoing attention, making it a far cry from a "set and forget" operation.

The Verdict: Do It Yourself or Try Headbot

If you're not keen on dealing with the hassles and complexities of setting up your own generative AI project, you might want to check out Headbot. But if you're up for the challenge, running your own cluster can save you substantial amounts of money. In my case, the cost comes down to just 3 cents per hour—peanuts compared to what you'd shell out on cloud platforms.

So, are you up for the challenge, or would you rather take the easy route and create your personalised AI avatar on Headbot? The choice is yours.

Feel free to create your own AI avatars with Headbot, or if you're interested in the nitty-gritty, stay tuned for more detailed breakdowns of my setup.

2023 On Prem Kubernetes Container Attached Storage Options

Matthew — Sun, 02 Jul 2023 06:14:59 +0000

Problem

As some of you may know I’ve been running my in-home on premises server for the last 6 years. Originally that was accomplished through a single-node micro k8s kubernetes cluster. However due to increased workloads I had to expand to a multi node kubernetes cluster. It became clear to me that my existing storage solution microk8s-hostpath would not be sufficient since true production storage had the following requirements.

Storage requirements

In order of importance:

Resilient to node failure: Given the need to perform regular security and operating system level node upgrades as well as hardware failures, the data had to be resilient to the temporary or complete catastrophic failure of a machine.
Diverse workload support: I run a plethora of different applications each with their own unique storage needs. The storage solution had to be versatile, providing support for block file and object storage
Ease of management: To maintain a lean operation and minimise the risk of human error, I needed a solution that was easy to install, configure, and manage. Moreover, an ideal storage solution would offer automatic healing of damaged data nodes and provide a straightforward path to disaster recovery.

Evaluation

OpenEBS Mayastor - https://mayastor.gitbook.io/introduction/

Pros

Mayastor caught my eye with its high performance and replication features for data protection, which are critical in a production environment.

Cons

However, its limited support for diverse applications and excessive manual configuration needs were major drawbacks. Additionally, the solution's immaturity and specific huge_pages requirement, which caused compatibility issues with PostgreSQL, made it less appealing.

Longhorn - https://longhorn.io/

Pros

Longhorn stood out due to its ease of management and handy features such as volume snapshots, backup, and restore, which are key to maintaining data integrity.

Cons

Nevertheless, Longhorn's limitation of only supporting read-write operations on a single node at a time posed a significant challenge to my requirement of resilience to node failure.

Rook-Ceph - https://rook.io/

Pros

Rook-Ceph impressed me with its comprehensive support for diverse applications, scalability, and resilience. Its maturity and the backing of a robust community and wide adoption provided further confidence.

Cons

One hiccup I encountered with Rook-Ceph was its requirement for full drives for provisioning, which presented challenges with my partitioned drives.

Conclusion

After careful evaluation, Rook-Ceph emerged as the optimal choice for my multi-node on-premises MicroK8s cluster.

Forem: Matthew

My Experience Running HeadJobs: Generative AI at Home

The Problem: The High Cost of Cloud Computing and non-existent inventory

The Concept of "HeadJobs": The Engine Behind Headbot

My Infrastructure: Consumer GPUs on Home Servers

The Spiky Nature of Workloads

The Technicalities: Software, Drivers, and More

The Fun of It: A Problem for the Sake of Problem-Solving

The Hidden Complexities: A Snapshot of the Challenges

The Verdict: Do It Yourself or Try Headbot

2023 On Prem Kubernetes Container Attached Storage Options

Problem

Storage requirements

Evaluation

OpenEBS Mayastor - https://mayastor.gitbook.io/introduction/

Pros

Cons

Longhorn - https://longhorn.io/

Pros

Cons

Rook-Ceph - https://rook.io/

Pros

Cons

Conclusion