Forem: Maciej Strzelczyk

TPU Mythbusting: vendor lock-in

Maciej Strzelczyk — Mon, 20 Apr 2026 16:11:30 +0000

Tensor Processing Units are a technology developed and owned by Google. While you can find GPUs in every cloud provider offer, the TPUs are currently only available through Google Cloud Platform. Situation when you invest in a technology or a service that is not available anywhere else is called vendor lock-in — it's something the sales people love, while customers try to avoid it. What does this look like for TPUs? Let's see.

Myth 5: TPUs are available only through Google Cloud Platform

As of today (December 12th, 2025) it is still true that TPUs are available only through Google Cloud Platform. If you develop your application to work specifically with the TPU technology, leverage all its strong sides and account for all the limitations, moving to a different provider would be a big challenge. Luckily, as you may remember from the first myth busting post, GPUs can do everything that TPUs do. They may not be as efficient for a given task, scaling might be different or limited, but in a lot of cases, a move from TPU to a GPU is possible and much easier than the other way around.

Technically, when you decide to use TPUs, you are limited to GCP as your provider, that is true. However, leaving TPUs to use GPUs is not an impossible task. Unless you make use of the TPUs amazing scaling capabilities, a migration to GPUs and a different provider is always an option.

Myth 6: TPUs require unique software

The first TPUs were developed together with the TensorFlow library. Back in 2018 when Google released the first TPUs to their customers, it was indeed the case that your application written for TPUs would not be compatible with other accelerators. Luckily, over the years since then, the software landscape has changed dramatically. Many abstraction layers were added and support for TPUs is now present in popular software solutions. For example the JAX library — it supports TPUs, GPU and CPUs alike.

The situation is especially easy when it comes to inferencing. vLLM supports plenty of models on TPUs as well as on GPUs. Similarly, MaxText can handle both accelerator types out of the box. If you're looking for a platform to run your models, it's a great idea to give TPUs a try, as jumping between the accelerator platforms has never been easier.

What's next?

In the next post, I will dive into more technical aspects of TPUs and their supporting systems. After all, the efficiency of an AI system is not dependent only on its accelerator speed. Networking and storage are also very important and while storage is pretty much the same for TPU systems as it is for GPU systems, networking is a lot more complicated. Stay tuned for the next article and keep an eye on the official Google Cloud blog and GCP YouTube channel!

TPU Mythbusting: cost and usage

Maciej Strzelczyk — Thu, 16 Apr 2026 18:54:26 +0000

TPUs are foundational to Google’s AI capabilities and can be equally transformative for your projects. However, keeping track of a niche technology like Tensor Processing Units amidst the rapid evolution of AI can be challenging. In this installment of TPU Mythbusting, I tackle two common misconceptions about their cost and usage. If you are new to TPUs, check out the previous post for an introduction to these application-specific integrated circuits (ASIC).

Myth 3: You need to have lots of money to start using TPUs

If you are curious about TPU performance, how to program applications that use them, or simply testing a concept, you don’t need a deep wallet or a large investment to get started. TPUs are available, in a limited capacity, for free on two popular platforms.

Google Colab — You can configure your runtime to use a single v5e TPU. This environment is ideal for familiarizing yourself with the required libraries, application organization, and running basic benchmarks. While a single accelerator won’t tackle massive problems, it’s the perfect first step before moving to a paid solution.
Kaggle Notebooks — Kaggle provides access to an instance with 8 v5e chips, which is significantly more powerful than Colab and sufficient for running many mainstream LLMs. The primary restriction is the quota: 20 hours per month with a 9-hour daily limit, which cannot be increased.

With those free options, you can experiment with TPUs before you make any investments on Google Cloud Platform!

As a student and/or researcher, you may also apply for Google Cloud for Education GCP credits. This way, you can access the power of TPUs through Google Cloud Platform — without tight limitations enforced by Colab or Kaggle.

Myth 4: You can use TPUs only through Compute Engine and GKE

The use of TPU is getting friendlier over time. It’s no longer true that you can only use them through a manually managed Compute Instance or through Kubernetes Engine. Today, the main managed solution to make use of TPUs is Vertex AI with its three functions:

Vertex AI Training: You can submit “Custom Training Jobs” that run on TPU workers. You simply select the TPU type (e.g., v5e, v4) in your job configuration. The service provisions the TPUs, runs your code, and shuts them down automatically.
Vertex AI Pipelines: You can define pipeline steps (components) that specifically request TPU accelerators. This is ideal for MLOps workflows where training is just one step in a larger process.
Vertex AI Prediction (Online Inference): You can deploy trained models to endpoints backed by TPU nodes. This is one of the few ways to get “serverless-like” real-time inference on TPUs without managing a permanent VM, although you are billed for the node while the endpoint is active.

These managed solutions minimize expenditure by charging only for the resources consumed, unlike GCE or GKE where infrastructure can sit idle and generate unnecessary cost. Furthermore, Vertex AI simplifies operations management, substantially reducing the human-hours (and therefore cost) required to run and maintain your ML tasks.

Coming next

I’m not done with the myths that you can find around the TPUs. I still want to discuss the subject of vendor lock-in and that developing for TPUs makes your application incompatible with other platforms. The times of incompatibility are gone, as software solutions abstract away the differences between the two platforms.

To stay up to date with everything happening in the Google Cloud ecosystem, keep an eye on the official Google Cloud blog and GCP YouTube channel!

TPU Mythbusting: the general perception

Maciej Strzelczyk — Thu, 16 Apr 2026 18:50:29 +0000

The IT world has been deeply immersed in the AI revolution over the past two years. Terms like GenAI, accelerators, diffusion, and inference are now common, and the understanding that GPUs are valuable beyond video games is well-established. However, certain specialized topics within AI and ML, such as the TPU, remain less understood. What, after all, does thermoplastic polyurethane have to do with Artificial Intelligence? (Just kidding 😉) In the realm of AI and computing, TPU stands for Tensor Processing Unit. This series of articles aims to address and clarify popular myths and misconceptions surrounding this highly specialized technology.

Myth 1: A TPU is just Google’s brand name for a GPU

It is easy to understand where this misconception comes from. The TPU and GPU are often referred to as the engines of Artificial Intelligence. So, if it walks like a duck, it quacks like a duck… it’s a duck, right? Not in this case. TPUs and GPUs do serve a similar purpose in this case, however they are far from being the same. The GPUs are far more versatile in terms of what they can compute. After all, they are also used for processing graphics, rendering 3D models and so on. Have you ever heard someone mention a TPU in this context? A simple venn diagram can help here, it will show the range of tasks a specific chip can handle:

Different chip architectures and their range of use cases.

It all comes down to the purpose of the different architectures in those chips.

Central Processing Unit (CPU): This is a general-purpose processor, designed with a few powerful cores to handle a diverse range of tasks sequentially and quickly, from running an operating system to a word processor.
Graphics Processing Unit (GPU): This is a specialized processor originally designed for the highly parallel task of rendering graphics. Researchers later discovered that this parallel architecture — thousands of simpler cores — was highly effective for the parallel mathematics of AI. The GPU was adapted or co-opted for AI, evolving into a GPGPU, a general-purpose parallel computer.
Tensor Processing Unit (TPU): This is an ASIC (Application-Specific Integrated Circuit). It was not adapted from another purpose; it was architected from the ground up for one specific application: accelerating neural network operations. Its silicon is dedicated only to the massive matrix and tensor operations fundamental to AI. It is, by design, an inflexible chip; it can’t run word processors or render graphics.

This architectural difference highlights why directly comparing GPU and TPU performance is often problematic. It’s challenging to compare devices not designed for identical tasks — perhaps less like comparing apples to oranges, and more like comparing apples to pears, each optimized for different purposes.

Myth 2: TPUs are always cheaper/TPUs are always more expensive than GPU

The comparison of TPU pricing versus GPU pricing is a popular point of confusion. Determining which offers superior cost-effectiveness — which one “gives you more bang for the buck” — is far from a straightforward answer.

While numerous claims suggest TPUs are significantly cheaper than various GPUs, these assertions invariably come with caveats: they often apply only to specific models, certain tasks, or particular configurations. The reality is, there’s no simple formula to determine how one TPU compares in cost-effectiveness to another accelerator.

To find out the real performance of a TPU system, you will need to run experiments. This also applies to GPU systems — the whole system depends on much more than just accelerator performance, that’s why it’s important to compare very specific scenarios, including the storage, networking and the type of workload you want to run.

More to come

These were the first two common myths about TPUs. I hope this explanation has provided some clarity, even if the answers aren’t always straightforward. In the next article, I will delve deeper into TPU costs, as the topic extends beyond a simple ‘it depends.’ To stay updated on the latest TPU news and other exciting announcements, be sure to follow the official Google Cloud blog and the GCP YouTube channel!

Cloud Run Jobs vs. Cloud Batch: Choosing Your Engine for Run-to-Completion Workloads

Maciej Strzelczyk — Tue, 31 Mar 2026 12:19:29 +0000

Google Cloud offers plenty of different products and services, some of which seem to be covering overlapping needs. There are multiple storage solutions (Cloud Storage, Filestore), database products (Cloud SQL, Spanner, BigQuery) or ways to run containerized applications (Cloud Run and GKE). The breadth of options to choose from can be overwhelming and lead to situations where it’s not obvious which way to go to achieve your goal.

Similar situation applies to offline processing (aka batch processing). This is a situation where you have some data and want to run the same operation on each piece of this data. For example: transcoding a big video collection, resizing an image gallery or running inference against a prepared set of prompts. The recommended way to handle such situations is to use proper tools that will automatically scale, handle errors and guarantee that all data has been processed.

Cloud Batch and Cloud Run Jobs are two of the options to consider when you want to handle an offline processing task. In this article, I’ll explain what those two products have in common and what are their main differences. We will finish with a couple of examples showing when to best use each of these products.

The Similarities

Cloud Batch and Cloud Run Jobs are fundamentally aligned in their purpose and share many core features, making them both excellent choices for asynchronous, run-to-completion tasks like data conversion, media processing, and offline processing.

Both services allow you to run your code in standard Open Container Initiative (OCI) images, completely abstracting away the operational headache of managing permanent clusters. They share critical ecosystem features: both can be triggered for periodic execution using Cloud Scheduler and orchestrated into complex, multi-step data pipelines via Cloud Workflows.

Security is standardized, with both offering native integration with Secret Manager to keep credentials safe, and both fully supporting VPC Service Controls (VPC-SC) to define security perimeters.

Furthermore, the services are designed for workload portability through a compatible task indexing system; both inject environment variables like CLOUD_RUN_TASK_INDEX and BATCH_TASK_INDEX to partition data across parallel tasks. This engineering choice allows container images optimized for Cloud Run to be seamlessly migrated and executed on Cloud Batch.

Finally, both offer native support for mounting Google Cloud Storage buckets (using Cloud Storage FUSE) and NFS network shares to efficiently handle large-scale data ingestion and output.

The Differences

Core Architectural Paradigms

The fundamental choice between Cloud Run Jobs and Google Cloud Batch often comes down to the desired level of abstraction versus the required level of infrastructure control. Cloud Run Jobs represents the serverless ideal, prioritizing developer velocity and rapid scaling by entirely abstracting the underlying hardware platform. In contrast, Google Cloud Batch operates as a highly configurable orchestration layer sitting directly atop Compute Engine, granting granular control over virtual machine (VM) shapes and deep hardware integrations.

GPU Ecosystem and Support

Cloud Run Jobs supports a curated, fully managed GPU experience optimized for inference and video transcoding, though it strictly enforces a limit of one GPU per instance and a 1-hour maximum timeout for GPU-based tasks. Google Cloud Batch unlocks the entire Compute Engine accelerator portfolio, allowing users to attach multiple GPUs (up to 8 per VM) and supporting multi-day training runs with advanced interconnects like NVLink.

Task Communication

The architectural divergence between the two services is further highlighted by their approach to inter-task communication. Cloud Run Jobs operates on a "shared nothing" architecture, where parallel tasks are entirely isolated and possess no native mechanism to communicate with one another directly. This is in stark contrast to Google Cloud Batch, which is specifically engineered to support "tightly coupled" workloads, such as multi-physics simulations or complex weather forecasting. Batch facilitates high-performance communication by supporting Message Passing Interface (MPI) libraries and provisioning compute clusters with Cloud RDMA (Remote Direct Memory Access) technology. This allows nodes to exchange state data with ultra-low latency and high bandwidth, making Batch the requisite choice for sophisticated high-performance computing (HPC) scenarios.

Financial Models and Billing

Cloud Run Jobs utilizes instance-based billing, measured in 100-millisecond increments with a generous recurring free tier for vCPU and memory. Google Cloud Batch has no base service fee; users are billed strictly for the underlying Compute Engine infrastructure consumed. Batch offers significant financial leverage through Spot VMs, providing big discounts for fault-tolerant workloads.

Constraints, Limits, and Maximum Scalability

The fundamental difference in architecture directly impacts the scale, concurrency, and duration of workloads each service can handle. Cloud Run Jobs is optimized for relatively bounded workloads, while Google Cloud Batch is engineered for massive, unbounded computational scale.

Execution and Task Limits

A single Cloud Run job is limited to a maximum of 10,000 independent tasks per execution. The maximum execution length for a standard CPU-based task is 168 hours (7 days), but any task utilizing a GPU is severely restricted to a 1-hour maximum timeout. Fault tolerance allows up to 10 retries per failed task.

Google Cloud Batch is built for a significantly larger scale. A single job definition can encompass up to 100,000 tasks within a task group and supports executing up to 5,000 of these tasks in parallel. Execution duration is highly permissive; a Batch task can remain in the RUNNING state for up to 14 days by default. This extended timeout applies even to GPU-based tasks, making Batch mandatory for multi-day distributed training runs.

Specification	Cloud Run Jobs	Google Cloud Batch
Max Tasks Per Job	10,000	100,000
Max Parallel Tasks	Regional Quota Dependent	5,000
Max CPU Task Timeout	168 Hours (7 Days)	14 Days (Default limit)
Max GPU Task Timeout	1 Hour	14 Days (Default limit)
Max Retries Per Task	10	Configurable
Max Concurrent VMs	N/A (Serverless)	2,000 (single-zone) or 4,000 (multi-zone)

Use Case Examples

Example 1: Administrative Automation and Nightly ETL

Recommended Service: Cloud Run Jobs

Scenario: A SaaS platform must execute a nightly script to migrate localized data into a central BigQuery warehouse, generate daily PDF invoices for thousands of clients, and perform routine database schema migrations.

Justification: These tasks are typically I/O bound, complete within a few minutes or hours (well under the 168-hour limit), and do not require specialized CPU instruction sets. Cloud Run Jobs excels here because it requires zero infrastructure scaffolding; the team simply containerised scripts and schedules them via Cloud Scheduler.

Example 2: Massively Parallel Document and Media Processing

Recommended Service: Cloud Run Jobs (with GPU if visual processing is required)

Scenario: A media or e-commerce company must process thousands of user-uploaded videos or images daily, requiring video transcoding via FFmpeg or lightweight AI inference (e.g., YOLO object detection).

Justification: This represents an extremely parallel problem where each file can be processed independently using the task index to assign files. Cloud Run can spin up hundreds of L4-backed containers in seconds and scale to zero immediately upon completion.

Example 3: High-Performance Computing (HPC) and Multi-Physics Simulation

Recommended Service: Google Cloud Batch

Scenario: A climate research institute runs physics-based simulations for weather forecasting, or a pharmaceutical company performs massive simulations for drug discovery.

Justification: These are "tightly coupled" workloads where parallel processes must exchange state data. Batch is mandatory as it supports MPI configurations and Cloud RDMA for ultra-low latency inter-node communication.

Example 4: Distributed Machine Learning Training

Recommended Service: Google Cloud Batch

Scenario: An AI laboratory pre-training a 70-billion parameter model or performing extensive fine-tuning across terabytes of data over several days.

Justification: Cloud Run Jobs is disqualified due to the 1-hour GPU timeout and 1-GPU-per-instance limit. Batch allows provisioning A3 or A4 machine series with up to 8 GPUs per VM interconnected via NVLink for multi-day training runs.

Happy Processing!

I hope this article has helped you better understand the difference between Cloud Batch and Cloud Run Jobs - the two products designed for processing tasks to completion. Lightweight Cloud Run containers and heavy-duty Cloud Batch machines will definitely help you with all the computations tasks you may have. Try them out by creating a Cloud Run Job (code lab) or by scheduling a Cloud Batch job!

To stay up to date with all that's happening in the Google Cloud world keep an eye on Google Cloud blog and Google Cloud YouTube channel to not miss any updates!

Inference on GKE Private Clusters

Maciej Strzelczyk — Thu, 12 Mar 2026 12:52:00 +0000

Setting up inference service without access to Internet

Deploying an inference service on your GKE cluster in 2026 is a fairly simple task. With a short Deployment definition making use of a vLLM image (TPU or GPU) and a Service definition, you have the basic setup ready to go! vLLM grabs the model of your choosing from Hugging Face during its startup. It’s all nicely automated. However, this setup requires your GKE nodes to have access to the Internet. What should you do when there’s no Internet connection? I will discuss the options in this article, but first, let’s start with a short analysis of how and why you may want to have no Internet connection for your nodes.

GKE Private Nodes

One situation where your vLLM pod might not be able to download a model from the Internet is when you decide to use GKE Private Cluster. When you choose this option, the nodes in your cluster are assigned only a private IP from your VPC network. With only a private IP address, it’s impossible to reach them from outside of your network, but they also lose the default way to communicate with the outside world. This feature is great for increasing the security of your system, but it has obvious drawbacks, like this lack of connectivity to the world.

One easy solution to the private nodes situation is to configure Cloud NAT for the region your cluster is in. That will create a way for the nodes and pods running on them to access the Internet, while keeping them protected from any attempt to establish new connections from outside of the network. However, if you want your pods to be unable to connect to the Internet, we need another way to get the model for vLLM to run.

Providing images to the pods

One other problem you might encounter when choosing to use Private Cluster without access to the Internet is the fact that your nodes won’t have access to the default source of Docker images: Docker Hub. The simple vllm/vllm-openai:latest image specification will not work. You will need to copy the images you want to use to the Artifact Registry—this way GKE Nodes will be able to download the images and run them. This gives you additional control over your environment; you can carefully control which versions of the images to download and allow cluster users to use.

Providing the LLM

vLLM can run a model stored in a local directory if you pass it as the --model argument value. To make use of this ability in your private GKE cluster, you will have to somehow provide the model to the vLLM through a mounted directory. The easiest way to do this is through GCS FUSE, which allows you to simply mount a GCS bucket as a folder in your Pod. You just need to remember that:

The GKE Cluster must have the GcsFuseCsiDriver add-on enabled.
You should use Workload Identity and a dedicated service account to allow the pod to access the bucket. The roles/storage.objectViewer role should work just fine for read-only access.
It’s important to host the model in the same region as the nodes of your cluster to ensure the fastest transfers.

Serving LLMs from a mounted directory speeds up the startup process of your inference service, as it doesn’t have to download the model each time a new pod is started.

Alternative to mounting GCS Bucket - persistent disks

An alternative to mounting a bucket is to use a zonal or regional persistent disk or hyperdisk. A single disk can be mounted by multiple pods at once if using read-only mode. Creating a disk to store a model is a bit more time consuming than using a GCS bucket, but might provide better performance (depending on the disk type) and be cheaper, as GCS and disk billing is structured differently.

To create a disk storing a model, you will need a temporary Compute Instance, where you will mount, format and fill the disk with data (hf download works just fine for this). Once the disk is ready, the VM can be deleted and the disk attached to the vLLM pods.

Summary

Using GKE without Internet access can be a good practice, providing you with additional security and control. As you can see, the additional work required to get your inference service running in this case is not negligible, but it is also not a deal-breaker. It’s up to you to decide if it’s a configuration you would like to use in your setup. Using a GCS Bucket or persistent disk to store a model is also a very good idea to simply cut down on the startup time of your services, especially with larger models.

The ecosystem of AI is changing at a rapid pace and it’s important to stay up to date with all the latest news. Follow the official Google Cloud blog, Google Developers blog and Google Cloud Tech YouTube channel to not miss any updates!

AI deployment: to host or not to host?

Maciej Strzelczyk — Tue, 10 Mar 2026 23:28:46 +0000

So you’ve built your AI application prototype. You used your own local GPU to run the AI model, or just used the free AI Studio tier to power your clever program. The app is ready, the world is ready, time to deploy your production instance! In the case of traditional, non-AI powered apps and services, the choice of deployment platform is based on personal preference, what you are familiar with, how much control over fine details you want to have etc. Cost is usually not the most important factor, as for a new service, that’s just going to start gaining a userbase, the first usage bills won’t be that high anyway. The situation is different when it comes to running services that make use of AI. Here, you need to make two separate decisions. First is how to deploy your application, this is the same as for a vanilla non-AI app. Second is how you are going to provision the AI capabilities. This second decision will most likely be responsible for a big chunk of your bill and it shouldn’t be made without proper consideration. In this article, I will try to help you make the right decision for your use case.

Serverless vs hosted inference service

There are two ways of provisioning AI for a production-grade application:

Serverless - where you pay for the tokens your application sends and receives.This is sometimes called Model as a Service (MaaS). In Google Cloud, this approach is available in Vertex AI and Google AI Studio (Gemini API).
Hosted - where you pay for the time you use the infrastructure running an LLM. In Google Cloud, this model is available through multiple services like: Compute (through certain machine types), Vertex AI, GKE or Cloud Run.

Depending on your situation, you may not have an option to choose between the two, because only one would be possible. For example, if you have to use one of the Gemini models, there’s no way to host it yourself and the MaaS (pay per token) approach is the only one available. Similarly, if you have to use a custom model that is not available as a service, you just have to go down the hosted path.

In cases where you do have a choice between the two paths you need to understand how they will affect your budget.

Serverless (pay per token)

Paying only for the tokens your application uses is a fair and easy to understand setup. It works exactly like any other paid service on Google Cloud - you pay for what you use.

Pros:

It scales to zero, when you don’t use the AI,
you don’t have to worry about scaling,
Configuration and maintenance are extremely simple,

Cons:

Less predictable for your budget
You may reach service quota, either when your application experiences a rush-hour or when you reach some total monthly usage quota
In case your application is hacked, your bill might skyrocket
Once your application gets popular, the bill will grow with your active userbase

Hosted (pay per second)

Hosting an LLM on infrastructure that you pay for is extremely predictable cost-wise. As long as you know how long you are going to hold on to that GPU or TPU accelerated instance, you know exactly how much you are going to pay.

Pros:

Extremely predictable cost
Many ways to lower your bill: CUDs, Spot Instances, choosing a cheaper zone or choosing the right instance and/or accelerator type
No quota on how much tokens your application consumes
Full control over hardware and software inference configuration

Cons:

Big initial cost
Doesn’t scale as smoothly as serverless
Configuration and maintenance is more complicated

Couple of considerations

To help you out a bit further, here are some questions you should ask yourself, before deciding on one of the deployment options.

How much traffic do I expect?

With low traffic, the choice is almost obvious - serverless is cheaper and easier. However, as your usage grows, the number of tokens consumed will add up to a considerable amount. In such a case, using a self-hosted solution might save you from unexpected bills at the end of the month.

Am I legally bound to keep user data in certain region?

In some cases, like with medical or financial data, you might be required by local regulations or your own contracts to ensure your user data doesn’t leave a certain location, or will not be sent to service you don’t control. This might be a situation where no matter the cost effectiveness self-hosting an AI model is the only possible solution.

Am I likely to hit the hourly/monthly quota?

All API Services have some usage quotas, that includes AI services. If you expect your application may reach this quota, it’s a big hint that you should consider self-hosting your model.

Mixed-approach

It is also worth noting that you don’t have to limit your architecture to using only one AI Model with one deployment option. Imagine your application offers multiple AI-powered features - some of them might be simple enough for a small model to handle, while others require full power of Gemini. It is perfectly fine to have for example a Gemma 3 running on a VM, handling the easier tasks, while you delegate the harder/bigger tasks to Gemini API.

This is not an irrevocable decision

Even after careful consideration, the decision might still not be a simple one, especially if you’re starting with a new idea and simply don’t know how popular it’ll get. Luckily, with good architecture of your application, it is not that difficult to prepare for changing the AI API endpoint. It’s reasonable to start with a serverless solution, where you will often make great use of the fact that no traffic = zero cost. Once your application takes off and the Vertex AI or AI Studio bill reaches levels comparable to running a self-hosted model, you should reevaluate your situation and perhaps switch to the more predictable approach.

Keep up!

P.S. Did you know that Google Cloud now offers Developer Knowledge API and MCP server that can give your AI Agents access to always up-to-date knowledge straight from the official Google Cloud, Firebase and Android documentation?!

How to enable Secure Boot for your AI workloads

Maciej Strzelczyk — Mon, 21 Jul 2025 14:43:09 +0000

Written in cooperation with Aron Eidelman.

As organizations race to deploy powerful GPU-accelerated workloads, they might overlook a foundational step: ensuring the integrity of the system from the very moment it turns on.

Threat actors, however, have not overlooked this. They increasingly target the boot process with sophisticated malware like bootkits, which seize control before any traditional security software can load and grant them the highest level of privilege to steal data or corrupt your most valuable AI models.

Why it matters: The most foundational security measure for any server is verifying its integrity the moment it powers on. This process, known as Secure Boot, is designed to stop deep-level malware that can hijack a system before its primary defenses are even awake.

Secure Boot is part of Google Cloud’s Shielded VM offering, which allows you to verify the integrity of your Compute VM instances, including the VMs that handle your AI workloads. It’s the only major cloud offering of its kind that can track changes beyond initial boot out of the box and without requiring the use of separate tools or event-driven rules.

The bottom line: Organizations don't have to sacrifice security for performance. There is a clear, repeatable process to sign your own GPU drivers, allowing you to lock down your infrastructure's foundation without compromising your AI workloads.

Google Cloud’s Secure Boot capability can be opted into at no additional charge, and now there’s a new, easier way to set it up for your GPU-accelerated machines.

Understanding the danger of bootkits

It’s important to secure your systems from boot-level threats. Bootkits target the boot process, the foundation of an operating system. By compromising the bootloader and other early-stage system components, a bootkit can gain kernel-level control before the operating system and its security measures load. Malware can then operate with the highest privileges, bypassing traditional security software.

This technique falls under the Persistence and Defense Evasion tactics in the MITRE ATT&CK framework. Bootkits are difficult to detect and remove due to their low-level operation. They hide by intercepting system calls and manipulating data, persisting across reboots, stealing data, installing malware, and disabling security features.

Bootkits and rootkits pose a persistent, embedded threat, and have been observed as part of current threat actor trends from Google Threat Intelligence Group, the European Union Agency for Cybersecurity (ENISA), and the U.S. Cybersecurity and Infrastructure Security Agency (CISA). Google Cloud always works on improving the security of our solutions by strengthening our products and providing tools you can use yourself. In this article, we would like to demonstrate a new, easier way of setting up Secure Boot for your GPU-accelerated machines.

Limitations of Secure Boot with GPUs

Shielded VMs employ a TPM 2.0-compliant virtual Trusted Platform Module (vTPM) as their root of trust, protected by Google Cloud's virtualization and isolation powered by Titan chips. While Secure Boot enforces signed software execution, Measured Boot logs boot component measurements to the vTPM for remote attestation and integrity verification.

Limitations start when you want to use a kernel module that is not part of the official distribution of your operating system. That is especially problematic for AI workloads, which rely on GPUs whose drivers are usually not part of official distributions. If you want to manually install GPU drivers on a system with Secure Boot, the system will refuse to use them because they won’t be properly signed.

How to use Secure Boot on GPU-accelerated machines

There are two ways you can tell Google Cloud to trust your signature when it confirms the GPU driver validity with Secure Boot: with an automated script, or manually.

The script that can help you prepare a Secure Boot compatible image is open-source and is available in our GitHub repository. Here’s how you can use it:

# Download the newest version of the script:
curl -L https://storage.googleapis.com/compute-gpu-installation-us/installer/latest/cuda_installer.pyz --output cuda_installer.pyz

# Make sure you are logged in with gcloud
gcloud auth login

# Check available option for the build process
python3 cuda_installer.pyz build_image --help

# Use the script to build an image based on Ubuntu 24.04
PROJECT = your_project_name
ZONE = zone_you_want_to_use
SECURE_BOOT_IMAGE = name_of_the_final_image

python3 cuda_installer.pyz build_image --project $PROJECT --vm-zone $ZONE --base-image ubuntu-24 $SECURE_BOOT_IMAGE

The script will execute each of the five steps described below for you. It may take up to 30 minutes, as the installation process takes this much time. We’ve also detailed how to use the building script in our documentation.

To manually tell Google Cloud to trust your signature, follow these five steps (also available in our documentation):

Generate your own certificate to be used for signing the driver.
Create a fresh VM with the OS of your choice (Secure Boot disabled, GPU not required).
Install and sign the GPU driver (and optionally CUDA toolkit).
Create a new Disk Image based on the machine with a self-signed driver, adding your certificate to the list of trusted certificates.
The new image can be now used with Secure Boot enabled VMs.

Whether you used the script or performed the task manually, you’ll want to verify that the process worked.

Start a new GPU accelerated VM using the created image

To verify that everything worked, we can create a new VM using the new disk image with the following command (we enable the Secure Boot option to verify that our process worked).

# Create a new VM with T4 GPU to verify that everything works. Note that here ZONE needs to have T4 GPUs available.
TEST_INSTANCE_NAME = name_of_the_test_instance

gcloud compute instances create $TEST_INSTANCE_NAME \
--project=$PROJECT \
--zone=$ZONE \
--machine-type=n1-standard-4 \
--accelerator=count=1,type=nvidia-tesla-t4 \
--create-disk=auto-delete=yes,boot=yes,device-name=$TEST_INSTANCE_NAME,image=projects/$PROJECT/global/images/$SECURE_BOOT_IMAGE,mode=rw,size=100,type=pd-balanced \
--shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--maintenance-policy=TERMINATE

# gcloud compute ssh to run nvidia-smi and see the output
gcloud compute ssh --project=$PROJECT --zone=$ZONE $TEST_INSTANCE_NAME --command "nvidia-smi"

# If you decided to also install CUDA, you can verify it with the following command
gcloud compute ssh --project=$PROJECT --zone=$ZONE $TEST_INSTANCE_NAME --command "python3 cuda_installer.pyz verify_cuda"

Clean up

When you verify that the new image works, there’s no need to keep the verification VM around. You can delete it with:

gcloud compute instances delete --zone=$ZONE --project=$PROJECT $TEST_INSTANCE_NAME

Enabling Secure Boot

Now that you have built a Secure Boot compatible base image for your GPU-based workloads, remember to actually enable Secure Boot on your VM instances when you use those images! Secure Boot is disabled by default, so it needs to be explicitly enabled for Compute Engine instances.

When creating new instances

If you create a new instance using Cloud Console, the checkbox to enable Secure Boot can be found in the Security tab of the creation page, under the Shielded VM section.

For the gcloud enthusiasts, there’s --shielded-secure-boot flag available for the gcloud compute instances create command.

Updating existing instances

You can also enable Secure Boot for instances that already exist, however, make sure that they are running a compatible system. If the driver installed on those machines is not signed with a properly configured key, the driver will not be loaded. To update Secure Boot configuration for existing VMs, you’ll have to follow the stop, update and restart procedure, as described in this documentation page.

Get started

Make sure to visit our documentation page to learn more about the process and follow our GitHub repository to stay up to date with other GPU automation news.

Understanding Google Cloud’s Dynamic Workload Scheduler

Maciej Strzelczyk — Tue, 01 Jul 2025 11:53:04 +0000

In the age of artificial intelligence and machine learning, there is a constant need for powerful hardware like GPUs and TPUs. Ideally, access to this hardware should be predictable and reliable. Resource availability shouldn’t be a blocker for your projects. If customers want to use a GPU, they should be provided with a GPU! After all, this is supposed to be one of the ideas behind cloud computing: to have resources available on demand. But with a limited supply of hardware, there is a need for a solution more sophisticated than simple “first come, first serve.”

Introducing DWS

Dynamic Workload Scheduler (DWS) is Google Cloud's innovative solution designed to optimize the allocation of high-demand, finite resources like GPUs and TPUs, ensuring that customer workloads can access the necessary hardware when needed. It directly addresses the supply and demand imbalance problem. On one hand, Google Cloud has customers asking for GPUs and TPUs to run their workloads. On the other hand, there’s a limited number of hardware resources that can be assigned to the customers. DWS is what balances customer demands against the finite resources of the cloud (which wants to feel infinite).

To the traditional model of on-demand provisioning, Spot instances and reservations, DWS adds two simple, yet powerful provisioning methods:

In this article, I’ll explain the benefits of each of these DWS methods and provide practical scenarios for when you might want to use them, helping you choose the best provisioning strategy for your specific workloads. Both methods are still in preview, so you can expect their availability and scope to improve once they enter general availability later this year.

If you’d rather watch a video about Dynamic Workloads Scheduler — I’ve got you covered:

Calendar mode

Let’s start with Calendar mode, which is a bit simpler to understand. DWS Calendar Mode allows you to create future reservations for the hardware you know you will need in advance. Booking rooms in a hotel is a great analogy here. You specify the range of dates, location, type and quantity of the hardware you need and you submit your request. Like a hotel, the system checks resource availability. It then books the resources you want to reserve. Once your future reservation is approved, all you need to do is wait for the starting date. Google Cloud creates a reservation for you on the start date that you can then consume however you want (GCE, GKE, Vertex AI, Vertex AI Workbench and Batch - they can all consume reservations).

Once the reservation time runs out, the system will reclaim the resources, so they can be allocated to other customers. Just like in a hotel, you pay for the time you had your reservation, even if you didn’t use it 100% of the time.

Here are some facts about the DWS Calendar Mode reservations:

The reservation period has a fixed length of 1 to 90 days.
Currently, GPUs require a 4 day lead time before the reservation can start. TPU reservations can be submitted 24 hours in advance of the desired start time.
Once your request is accepted, you will have to pay for the full reservation period, even if not used.
Once the reservation period ends, the resources are reclaimed.
Reserved resources are physically close to each other to minimize network latency.
Calendar Mode reservations can be shared with other projects.
DWS has its own pricing, separate from other provisioning methods. (Usually cheaper than on-demand pricing).
No quota is consumed while using resources booked through Calendar Mode reservations.

So, what are the best scenarios for Calendar mode? If you…

Know how much resources you need
Know how long you need them for
Know when you want to start and finish your project

…then DWS Calendar Mode is the solution for you. Whether it’s an ML training job, HPC simulation or expected spike in the number of inference requests (isn’t Black Friday great?) - the Calendar Mode has you covered.

So what’s the difference between regular future reservations and Calendar Mode?

You might have seen that in Google Cloud, there are also future reservations that are not related to DWS Calendar Mode. You can think of Calendar Mode reservations as a subset of the more generic future reservations. Every Calendar Mode reservation is a Future Reservation, but for a Future Reservation to be a Calendar Mode reservation, it needs to be:

Configured to auto-delete the reservation on expiry, even if it’s not consumed.
No longer than 90 days.
Limited to certain types of resources (see documentation for up to date list)

Additionally, Calendar Mode comes with a handy assistant that helps you find available capacity.

Flex start mode

With Calendar mode being so great, what more may you need? Well, you don’t always have a schedule you need to keep. Sometimes you want your job finished as soon as possible. At other times, you don’t know how long it will take to complete the work. This is where Flex Start mode comes in. If Calendar mode works similar to a hotel, you can compare Flex Start mode to a restaurant.

How does it work? You tell DWS that you need hardware, let’s say 10x A4 machines, to run a job that will take at most 6 days. With that knowledge, DWS goes out to the Cloud to get you your 10 A4 machines. After some time (this is where the “flex” part comes from - it’s a flexible process) the system has the 10 A4 machines you need and provides them to you all at once. This 'all-or-nothing' approach ensures you receive the full requested capacity simultaneously. This way, you don’t have to worry about paying for unused 7 machines while you wait to create 3 more. You get all 10 at the same time. Once they are delivered to you, they will be yours until the specified time runs out, or you’re done with your task. If you release the resources before the time runs out, you pay only for the time you actually used them. Since there is no provisioning notification, ensure your workloads can start automatically upon machine creation.

While Calendar mode was similar to booking rooms in a hotel, Flex Start is more akin to waiting for your order in a restaurant. You wait until your “order” is served and eat until you’re done, or the restaurant closes. If you change your mind before the order is fulfilled, you can cancel your request without any consequences.

To summarize:

Flex Start mode requests hardware for specified periods of time from 1 minute to 7 days.
Requests are fulfilled as soon as possible. (Shorter requests tend to be fulfilled quicker)
You can cancel your request at any time; you only pay for what you used.
DWS Flex Start pricing offers discounts compared to on-demand provisioning.
Once the time limit of your request is reached, the resources are reclaimed.
Resources acquired through Flex Start mode consume the preemptible quota, which is usually a lot higher than on-demand quota.
Works only for Accelerator-optimized machine series and N1 virtual machine (VM) instances with GPUs attached
You can't stop, suspend, or recreate the instances you create through Flex Start mode.

Flex start mode works best if:

You have a short (< 7 days) need for resources.
You want your job started as soon as possible.
You don’t know how long your task will take, and appreciate the flexibility to release resources early and only pay for actual usage.

How to use it?

Flex Start mode works a bit differently in every supported product.

For Compute Engine, it comes in the form of an all-or-nothing Managed Instance Group resize request with the maximum run duration specified.
For Google Kubernetes Engine (GKE), it’s specified for a workload or through scheduling tool.
For Cloud Batch, it’s available for jobs running on specific machine types.
For Vertex AI, specify FLEX_START as your scheduling strategy.

Happy computing!

When it comes to getting your hands on high-demand hardware for your advanced workloads, Google Cloud's Dynamic Workload Scheduler has you covered. With its Calendar and Flex Start modes, you get powerful and flexible solutions that truly fit your needs. By digging into these new provisioning methods, you can count on predictable, reliable, and efficient access to essential resources like GPUs and TPUs. This means your AI, ML, and HPC projects will run smoother than ever. Try booking some powerful machines for your next project now!

Developing in the (Google) Cloud

Maciej Strzelczyk — Thu, 26 Jun 2025 13:46:24 +0000

As I entered the office today, it was clear that physical desktop computers are becoming a rarity. Most desks were equipped only with monitors, reflecting a significant shift in how many organizations, including Google, are approaching employee workstations. Historically, developers might have received both a desktop and a laptop. However, the trend is now towards providing only high-tier laptops, with heavy workloads and software development tasks offloaded to virtual workstations hosted in the cloud. This approach offers enhanced control over assets, improved security, and streamlined management for organizations.

This cloud-centric approach offers substantial benefits for organizations aiming to equip their employees with powerful development environments without the complexities of procuring and maintaining physical desktops. Beyond the immediate advantage of remote work flexibility, where employees can be fully productive with just a laptop and a stable internet connection, cloud-based workstations offer significant scalability. They allow organizations to rapidly provision and de-provision resources as needed, ensuring developers always have access to the optimal computing power, including high-end GPU-accelerated environments that traditional laptops simply cannot provide for demanding industry needs.

There are two ways your organization can leverage this model using Google Cloud Platform.

Google Compute Engine

Google Compute Engine (GCE) provides an Infrastructure as a Service (IaaS) approach to creating virtual workstations through highly configurable virtual machines. This solution offers unparalleled flexibility, granting you complete control over virtually every aspect of your development environment. You can choose your preferred operating system, machine type (including CPU, memory, and specialized hardware), storage solutions, and install any software or tools required. This level of customization makes GCE an excellent choice for a variety of use cases, including:

Heavy graphics

Once you create a virtual machine equipped with a powerful GPU, you can work with demanding graphical applications. Designing complicated systems and models, programming games or rendering videos - all this heavy lifting can happen in the datacenter, while your computer only has to handle the decoding of the remote desktop stream. To fully leverage the remote desktop experience of those setups, you need to:

Pick a GPU that supports NVIDIA RTX Virtual Workstations (vWS) for graphics workloads. That means L4, T4, P4 or P100 accelerators. A new G4 machine type hosting NVIDIA RTX PRO 6000 Blackwell cards should be available by the end of 2025.
Install RTX-compatible GPU drivers.
Select the remote desktop software you want to use to access the machines. There are many available options like HP Anywhere, Parsec or Moonlight to name a few.
Ensure the Internet connection on your client side is fast and reliable.

Computation intensive (like AI)

Google Cloud offers really powerful GPUs that can empower your team to effortlessly tackle many AI challenges. With no need for high-quality graphical interface, access to machines in this category can be even limited to an SSH tunnel. The developer can run their favourite IDE on their laptop, while executing the code remotely in the cloud. Depending on the GPU you pick, the pricing of such workstations will vary greatly. Good news though, a single machine can be easily shared between multiple developers with proper configuration.

General development

Developers who don’t need GPU-powered machines to do their jobs, still can benefit from a remote, powerful environment. More RAM, CPUs and storage is really easy to obtain to exceed what even the best laptops can provide.

Considerations

When working with GCE VMs, it is crucial to pay special attention to both the security and cost optimization of these machines. Failing to properly configure these aspects can lead to vulnerabilities or unnecessary expenses. Here are some key considerations (this list is not exhaustive):

Security Best Practices

1) Service Accounts: Avoid using the default compute Service Account, which comes with an overly permissive Editor role. Instead, create new service accounts with the principle of least privilege, assigning only the minimal required permissions for your workloads. For individual users, consider creating dedicated service accounts.

2) Network Access: Consider disabling external IPs for your VMs. For internet access, configure Cloud NAT. For secure remote access, leverage Cloud VPN or Identity-Aware Proxy (IAP).

3) Firewall Policies: Implement stringent firewall policies to control inbound and outbound traffic, ensuring only necessary ports and protocols are open.

Cost Optimization Strategies

1) Commitment-based Discounts: Take advantage of Committed Use Discounts (CUDs) for predictable workloads, which can substantially reduce costs over long-term commitments.

2) Automated Scheduling: Implement VM instance scheduling to automatically stop workstations during off-hours (e.g., overnight or weekends), minimizing resource consumption when not in use.

Google Cloud Workstations

If all your team needs is the computation power of cloud instances and not a full graphical connection, then Cloud Workstations might be just for you (video explainer). It’s a managed solution, which allows you to create virtual workstations that your team can connect to and use for development. Those instances can be based on many different machine types, including GPU-accelerated ones. You can choose to use them through Code OSS (Visual Studio), multiple JetBrains IDEs through JetBrains Gateway or Posit Workbench (with RStudio Pro).

Workstations allow you to customize the developer environments, so that each new instance comes with all the necessary tools preinstalled. Users can be allowed to create and destroy their own environments, while you retain the control over the allowed configurations of those environments.

Despite being a bit more expensive than “raw” Compute Engine instances, the managed Workstations might turn out to be cheaper in use than Compute Instances, as they allow you to configure auto-sleep and auto-shutdown settings, so resources are not wasted when the workstations are not used.

Cloud Workstations offer a wide variety of customization options and security configurations. While not as flexible as simple Virtual Machines, the Workstations might be more attractive due to easier management, strict control and out-of-the-box compatibility with popular coding solutions.

In summary

Google Cloud offers virtual workstation solutions for all kinds of developer needs. Here’s a short summary table, highlighting various applications of GCE and Workstations.

	Compute Engine (unmanaged VMs)	Cloud Workstations
Graphical-heavy work Designing Gaming Game development Video editing	GPU-accelerated VMs offer great performance when paired with proper virtual workspace software.	N/A - Cloud Workstations don’t support this kind of work.
AI and HPC workloads AI training AI inference GPU-powered simulations	GPU-accelerated VMs can make use of every GPU-type available in Google Cloud. Sharing a big VM between multiple developers is a valid approach.	Cloud Workstations support GPU-accelerated machine types, allowing developers to work on software that requires GPU-acceleration.
General workloads	While regular VMs can work for hosting a workstation for this kind of applications, it might not be worth the management effort.	Cloud Workstations work great as a platform for developers who need a remote cloud-based environment to work on their projects. With the majority of management hassle taken care of, you are free to just work on your project.

Embrace the future of development today by exploring the powerful virtual workstation solutions offered by Google Cloud. While Compute Engine provides unbridled flexibility, Cloud Workstations offer streamlined efficiency. Unlock enhanced productivity and simplified asset management for your team. Start your cloud development journey now and discover the perfect environment for your needs.