Forem: Carl Peterson

Why Every GPU will be Virtually Attached over a Network

Carl Peterson — Wed, 08 Jan 2025 23:46:00 +0000

Introducing GPU virtualization

Virtualization is a concept in computer science for creating virtual representations of physical hardware. While virtualization is commonly associated with Virtual Machines (VMs), it extends to other domains, including GPUs. GPU virtualization is essential for efficient resource sharing in high-performance computing, AI, and machine learning. However, it’s often misunderstood, especially when applied to GPUs, where the term can have multiple meanings.

Existing types of GPU Virtualization

GPU virtualization currently exists in three main forms:

Single-node GPU sharing
Dedicated GPU passthrough
Network-based GPU pooling (Thunder Compute’s approach)

The first two operate within a single physical server and are widely used today. Thunder Compute is pioneering the third approach, which operates across multiple servers or ‘nodes’.”

Single-node GPU sharing (e.g., NVIDIA vGPU)

Divides a physical GPU into multiple virtual GPUs. This allows several virtual machines (VMs) to simultaneously use portions of the same GPU, improving resource utilization in scenarios where VMs don’t need the full power of a GPU.

Dedicated GPU passthrough (e.g., Intel GVT-d)

Assigns an entire physical GPU to a single VM. While this doesn’t split the GPU, it’s considered virtualization because it allows a VM to directly access the GPU, providing near-native performance for applications that require the full power of a GPU.

The third approach, network-based GPU pooling, is a newer concept that requires deeper explanation.

A new approach: Network-Based Virtualization

At its core, network-based GPU virtualization solution works by extending physical PCIe connections with virtual connections over a network.

In practice, this means that any computer can access any GPU across a network. Traditionally, adding a GPU to a server requires physically connecting it to the motherboard. With network-based virtualization, a virtual GPU can be “plugged in” via software, behaving just like a physically connected GPU.

This solution acts as a bridge between the application and the GPU. It replaces the standard GPU software interface (like NVIDIA CUDA) with a network-aware version. This allows applications to interact with GPUs on remote servers as if they were locally attached.

The end result is that a computer without a physical GPU can behave exactly as if it has a GPU, without any hardware changes. This creates a flexible, distributed GPU resource pool that can be dynamically allocated and shared across the network.

Why Network-Distributed GPU Virtualization is a Game-Changer

Traditional GPU virtualization is limited by physical hardware constraints, typically supporting a maximum of 8 GPUs per server. Expanding GPU capacity requires vertical scaling, which involves upgrading individual servers. However, this method often leads to inefficient resource utilization as VMs tend to reserve entire GPUs.

A network-distributed approach overcomes these limitations by enabling GPUs to be accessed across multiple servers (also called ‘nodes’) in a data center. This creates a data center-wide pool of GPU resources, rather than limiting each server to its own physically attached GPUs.

This ability to expand GPU resources by adding more servers (known as horizontal scalability) allows for flexible, on-demand allocation of GPU power. It dramatically increases efficiency by ensuring GPUs are used to their full capacity across the entire data center.

Comparing Network-based virtualization to Similar Technologies

To conceptualize network virtualization, it is helpful to look at some existing solutions for attaching GPUs and other hardware across networks:

NVIDIA InfiniBand: This is a high-speed networking technology that allows for faster communication between servers in a data center. While it improves the connection speed for GPU systems spread across multiple servers, it doesn’t address the core issue of efficiently allocating GPU resources among different applications or users.
Storage Area Networks (SANs): SANs pool storage devices across a network, allowing VMs to access only the storage they need without reserving excess capacity. Thunder Compute’s GPU virtualization operates on a similar principle, enabling precise GPU resource allocation with minimal idle time.

The Future of GPU Virtualization

As with other virtualization technologies, network-based GPU virtualization faces performance challenges but continues to improve. Early tests from Thunder Compute, the startup building this technology, showed AI inference tasks running 100 times slower than on attached hardware. Within a month, performance improved to ~2 times slower for most AI workloads.

This rapid progress points to a future where network-virtualized GPUs will match the performance of physically attached GPUs. As the technology matures, applications will extend beyond data centers to slower networks, including connections between data centers and even home networks. We envision a future where developers can access vast GPU resources from their laptops over standard WiFi connections.

Virtualization in Cloud Computing: the Past, the Present, and the Future

Carl Peterson — Tue, 07 Jan 2025 21:50:40 +0000

The concept of virtualization originated with CPUs but has also been applied to storage, memory, and, most recently, GPUs. Generally, virtualization refers to software that creates an abstraction of computer hardware, enabling programming tasks to be executed without directly relying on a specific computer. The goal is for virtualized hardware to behave exactly like physical hardware, with the added benefit of increased flexibility; however, in practice, the drawback of virtualization is limited performance. In particular, the performance of virtualization systems is especially limited early in the lifecycle of a given hardware category. For instance, the earliest CPU virtualization systems were several orders of magnitude slower than their underlying hardware, but today their performance is nearly identical. As virtualization systems continue to improve, they increasingly replace direct reliance on physical hardware.

Why does virtualization matter?

Despite performance limitations, virtualized hardware is significantly more flexible than the underlying hardware on which it runs. In a virtualized system, computer programs are not constrained by physical hardware limitations, allowing them to run on any available capacity within a data center. This flexibility allows greater utilization of limited hardware — often by 5–10x. As a cloud platform, this means that with a software change, you can instantly serve 5–10x more customers without buying more costly hardware. In a CapEx-heavy data center, this translates to tens of millions of dollars in added profit. Scaled across every cloud platform, which includes some of the biggest businesses in the world, the potential impact is enormous.

What are the origins of virtualization?

Virtualization began in the 1960s
IBM created the first virtualization technology, CP-40, which reached production use in 1967. CP-40 allowed multiple users to share a single mainframe computer. At the time, a whole mainframe computer was prohibitively expensive for most users. Virtualization allowed up to 14 customers to share each computer, dramatically improving accessibility.

It took almost 30 years for virtualization to become mainstream
Over the next 30 years, the decline in the cost of consumer x86 hardware reduced the need for virtualization. However, in the 1990s, VMWare revived the concept of virtualization. VMWare noticed that data center hardware was utilized less than 15% of the time, and by developing virtualization technology, the company could improve utilization to 80% or more, making the same hardware available to more users. Additionally, to reduce costs at the time, data centers typically used Intel’s low-end chips. However, VMWare’s virtualization technology allowed them to share premium hardware across multiple users, making the experience for developers faster and cost-effective. Although this advancement was exciting for data centers and users at the time, virtualization initially meant increased program execution time. It took several years until the impact of CPU virtualization on performance became negligible, and today, nearly every cloud platform uses virtualized CPUs (vCPUs) rather than physical CPUs.

The trend towards a fully virtual computer

After recognizing the benefits of CPU virtualization, companies began to explore virtualization beyond the CPU, which led to the concept of a fully virtualized data center. Storage was a critical next step in this evolution. Amazon significantly advanced virtual storage with their Elastic Block Store (EBS) offering on AWS. EBS provided users with scalable, on-demand block storage that could be shared across a pool of physical storage resources. Notably, this technology offers performance that closely rivals directly attached storage, making it a key component of virtualized environments.

RAM virtualization came next

RAM virtualization came next. The most challenging parts of a computer to virtualize have been those with the lowest latency requirements, specifically RAM and GPUs. Several companies have experimented with RAM virtualization, achieving results in small-scale implementations like VMware’s vNUMA. However, these implementations often come with performance drawbacks, such as increased latency and reduced memory bandwidth, making them less suitable for high-performance applications.

GPU virtualization continues to evolve

Early research into GPU virtualization began with rCUDA in 2013. More recently, Thunder Compute has developed one of the first practical, publicly available GPU virtualization technologies. GPU virtualization faces many of the same performance challenges as early CPU virtualization. For example, Thunder Compute’s initial performance was nearly 100 times slower than that of a physical GPU. However, like other virtualization technologies, performance has steadily improved. As of today, Thunder Compute’s performance is approximately 2 times slower than a physical GPU, which improves by the day.

How Thunder Compute attaches GPUs over TCP

Carl Peterson — Tue, 07 Jan 2025 21:34:44 +0000

Thunder Compute uses network-attached GPUs instead of physically-attached GPUs. Behind the scenes, Thunder Compute tricks CPU-only instances into thinking that they have GPUs attached. These GPUs are network-attached over TCP. From your perspective, the resulting instances behave like they have GPUs without requiring that a GPU is physically connected.

As a result, all instances on Thunder Compute are on-demand CPU-only instances, exactly like you would find on AWS, GCP, or Azure. These instances do not have GPUs. Logically, it follows that the CPU-only instances you interact with on Thunder Compute have all of the functionality of EC2 instances that you would find on Amazon or Google Cloud. In fact, many of them are hosted on Amazon or Google Cloud.

Here is a rough diagram of how we manage these connections between CPU-only instances and GPUs behind the scenes:

A simple example demonstrates the distinction between our virtual GPU-over-TCP technology and a physical PCIe connection:

Running $ nvidia-smi on Thunder Compute behaves exactly as expected with a physical GPU, returning the attached GPU.

Meanwhile, running lspci shows no connected GPUs.

To hammer home the point that there is no GPU, here is the full list of PCIe-connected devices on this Thunder Compute instance.

I hope we have convinced you that there is no GPU physically connected to the machine. Pretty cool, right? You can pip install tnr and run tnr start to try this same demo yourself.

Now that you understand the distinction between a Thunder Compute instance and a GPU instance on EC2, it is worth explaining the limitations of this virtualized approach.

Performance: TCP is slower than PCIe. While this may seem problematic, Thunder Compute is optimized to minimize the resulting performance impact. The real-world slowdown often is not noticeable and minimally impacts common data science tasks.
Limited Compatibility: Eventually, our GPUs-over-TCP will have the full functionality of physically attached cards, but today, Thunder Compute lacks official support for some GPU libraries. If Thunder Compute does not work for your particular use case, please reach out. We can usually add support for a new library within a few days.

The impact of these drawbacks will vary depending on your specific workload, and we continue to improve both over time. Until now, our testing has shown data science workflows to be the most performant and stable. Thunder Compute is open to the public, so the easiest way to test compatibility with your workflow is to try it yourself at thundercompute.com.
‍