<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Daniel Gwerzman</title>
    <description>The latest articles on Forem by Daniel Gwerzman (@kulaone).</description>
    <link>https://forem.com/kulaone</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3700947%2F9cbf0498-ff26-48a2-b5f6-bf03ce467f94.webp</url>
      <title>Forem: Daniel Gwerzman</title>
      <link>https://forem.com/kulaone</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kulaone"/>
    <language>en</language>
    <item>
      <title>Deploy Gemma 4 on Cloud Run: Pay Only When You Actually Use It</title>
      <dc:creator>Daniel Gwerzman</dc:creator>
      <pubDate>Sat, 04 Apr 2026 10:42:37 +0000</pubDate>
      <link>https://forem.com/gde/deploy-gemma-4-on-cloud-run-pay-only-when-you-actually-use-it-9ln</link>
      <guid>https://forem.com/gde/deploy-gemma-4-on-cloud-run-pay-only-when-you-actually-use-it-9ln</guid>
      <description>&lt;p&gt;Last year, Google flew me to Paris for the announcement of &lt;a href="https://blog.google/technology/developers/gemma-3/?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Gemma 3&lt;/a&gt;. It was an exciting event. The demos were impressive. But what really mattered happened later, back at my desk, when I ran my own tests and found out the demos weren't lying.&lt;/p&gt;

&lt;p&gt;Gemma 3 was the first open model that closed the gap on the big commercial ones. It didn't beat Gemini. But it reached the level Gemini was at a year earlier. For an open model you could run on your own infrastructure, that was a meaningful leap. I started integrating it into my own pipelines. Specific tasks, small steps, places where the answer doesn't need a frontier model to get it right.&lt;/p&gt;

&lt;p&gt;Then I made a mistake.&lt;/p&gt;

&lt;p&gt;I deployed Gemma 3 on &lt;a href="https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Vertex AI Model Garden&lt;/a&gt; over a weekend for testing. Left it running. Didn't turn it off. Came back to a bill that made me rethink my relationship with cloud infrastructure. I made a video about it on &lt;a href="https://youtu.be/v_EWVdNPvpA" rel="noopener noreferrer"&gt;my YouTube channel&lt;/a&gt; so others wouldn't repeat the same mistake.&lt;/p&gt;

&lt;p&gt;This article is the redemption.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Gemma 4&lt;/a&gt; just launched. It's a bigger jump than Gemma 3 was. And this time, I'm deploying it on &lt;a href="https://cloud.google.com/run?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;, which scales to zero when you're not using it. Forget to turn it off. I dare you. You won't pay a cent.&lt;/p&gt;

&lt;p&gt;This article is in two parts. The first covers what Gemma 4 is, why running your own model changes what you can build, how the deployment stack works, and the performance data from my own tests.&lt;/p&gt;

&lt;p&gt;The second part is the step-by-step deployment guide. Prerequisites, VPC setup, model upload, deploy commands for all four model sizes, and cleanup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Understanding Gemma 4 on Cloud Run
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Changed in Gemma 4
&lt;/h3&gt;

&lt;p&gt;Gemma 4 ships as four distinct models, not one. Two small ones and two large ones, each with a different tradeoff.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;E2B&lt;/td&gt;
&lt;td&gt;2.3B effective&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;128k tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td&gt;4.5B effective&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;128k tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B A4B&lt;/td&gt;
&lt;td&gt;26B total, 4B active&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;256k tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;256k tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 26B model deserves a closer look. It uses &lt;strong&gt;Mixture of Experts&lt;/strong&gt; (MoE) architecture: a design where the model has 26 billion parameters on disk but only activates 4 billion of them per token during inference. Think of it like a large team of specialists where only the relevant experts are called in for each task, rather than everyone working on every problem. The result: capability that approaches a 26B model, at the compute cost of a 4B one. This matters enormously at inference time, as you'll see in the numbers below.&lt;/p&gt;

&lt;p&gt;Beyond size, Gemma 4 adds &lt;strong&gt;multimodal input&lt;/strong&gt;. Images, audio, video: all supported as inputs, with text output. The small models (E2B, E4B) can process video with audio. The larger ones handle images with extended context.&lt;/p&gt;

&lt;p&gt;But the two improvements that matter most for anyone building agentic pipelines are &lt;strong&gt;reasoning&lt;/strong&gt; and &lt;strong&gt;function calling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Reasoning means the model works through a problem step by step before producing an answer, rather than jumping straight to a response. For complex tasks that previously required a frontier model, a reasoning-capable Gemma 4 can now handle them at a fraction of the cost. Function calling has also been significantly improved, the model reliably returns structured tool calls, which is what makes it composable inside an agent that orchestrates multiple steps.&lt;/p&gt;

&lt;p&gt;Together, these two capabilities change where Gemma fits in a production system. My own pipelines split work into layers: small, focused steps that don't need deep reasoning, and orchestration steps that evaluate results and decide what happens next. Gemma 3 could handle the simple steps. Gemma 4 can handle more of the middle layer too, the tasks that previously needed a bigger, more expensive model to get right. Every step moved from a frontier API to a self-hosted Gemma is tokens you stop paying for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Running Your Own Model Changes What You Can Build
&lt;/h3&gt;

&lt;p&gt;First, let's be precise about what "open" means here. Gemma 4 is not open source. The training code, the training data, and the full recipe that produced the model are not public. What Google releases are the &lt;strong&gt;weights&lt;/strong&gt;, the trained parameters of the model itself, under the &lt;a href="https://www.apache.org/licenses/LICENSE-2.0" rel="noopener noreferrer"&gt;Apache 2.0 license&lt;/a&gt;. You can download them, run them, modify them, and build commercial products on top of them. But you can't reproduce the training process.&lt;/p&gt;

&lt;p&gt;That distinction matters less than people think for most use cases, and more than people think for one specific one: &lt;strong&gt;fine-tuning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Because you have the weights, you can train on top of them. Take the 4B model, run it through your own domain-specific dataset, and produce a version that understands your terminology, follows your output format, and performs better on your specific tasks than the general-purpose model does. This is the path from "capable open model" to "model that knows your business." The data you use for fine-tuning never leaves your environment, and the result belongs to you.&lt;/p&gt;

&lt;p&gt;There's a category of problem that big cloud-hosted models can't solve for certain customers. Not because the models aren't capable, but because the data can't leave the building.&lt;/p&gt;

&lt;p&gt;Healthcare providers, financial institutions, legal firms. Organizations with serious data privacy obligations can't pipe sensitive information through external APIs. For them, powerful AI has always meant exposing data to someone else's infrastructure. That's a non-starter in many regulated environments.&lt;/p&gt;

&lt;p&gt;A self-hosted Gemma on your own Cloud Run service changes the equation. The model runs in your project, your VPC, your infrastructure. The data never leaves.&lt;/p&gt;

&lt;p&gt;But the most compelling example isn't in an office. It's in a field.&lt;/p&gt;

&lt;p&gt;Imagine a drone with cameras flying over farmland, using computer vision and reasoning to detect crop diseases, identify irrigation problems, or spot pest damage. That drone can't wait for a round-trip API call to a cloud endpoint. It might not even have a reliable internet connection in a remote countryside location. The decision needs to happen on the device, or close to it. And it needs to happen fast.&lt;/p&gt;

&lt;p&gt;Gemma 4's multimodal capability, combined with its small model sizes, makes that kind of on-device or edge deployment practical. A 2B or 4B model can run on hardware that a drone or industrial sensor could realistically carry or connect to. The reasoning capability means it can do more than classify. It can think through what it's seeing.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Gemma 4 Gets Deployed: The Stack
&lt;/h3&gt;

&lt;p&gt;When Gemma 3 came out, the typical deployment used &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;, a tool designed for running models locally. It's simple, works well for small models, and you can get something running quickly. Ollama bakes the model weights directly into the container image. The container starts, the model is already there, and you're serving in seconds. For a 2B or 7B model this is fine.&lt;/p&gt;

&lt;p&gt;Gemma 4's larger models break that pattern. A 31B model doesn't fit comfortably in a container image. You can't bake 65GB of weights into something you expect to deploy and scale quickly. Ollama also doesn't expose the production controls you need at scale: request batching, KV-cache sharing across concurrent requests, quantization that doesn't sacrifice accuracy. It's a great local tool. It's not designed for what we're building here.&lt;/p&gt;

&lt;p&gt;Gemma 4's official path uses &lt;a href="https://docs.vllm.ai/en/latest/?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; instead. vLLM is an inference engine built specifically for serving LLMs in production. It handles multiple concurrent requests efficiently by batching them together, shares the KV-cache across requests to reduce memory pressure, and supports fp8 quantization, which is what lets the 31B model fit inside the &lt;a href="https://cloud.google.com/run/docs/configuring/services/gpu?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;NVIDIA RTX Pro 6000 Blackwell&lt;/a&gt;'s 96GB of VRAM without meaningful quality loss. No cluster to manage, no node pool to configure. One flag in your deploy command.&lt;/p&gt;

&lt;p&gt;For the 26B and 31B models, there's a third piece: the &lt;a href="https://github.com/run-ai/runai-model-streamer" rel="noopener noreferrer"&gt;Run:ai Model Streamer&lt;/a&gt;. Rather than waiting for the entire model to load before serving the first request, the streamer fetches weights in parallel from &lt;a href="https://cloud.google.com/storage/docs?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Google Cloud Storage&lt;/a&gt; while vLLM initializes. The model starts accepting requests before it's fully loaded. This is what makes large model cold starts on Cloud Run feasible rather than painful.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Model Loading Decision: HuggingFace vs GCS
&lt;/h3&gt;

&lt;p&gt;Before you deploy, there's one choice that affects everything else: where does the model come from at startup?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HuggingFace&lt;/strong&gt; is the simple option. The container downloads the model weights on every cold start. No storage cost, no upfront setup. The tradeoff: you're downloading over the public internet each time, and that download time dominates your cold start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Cloud Storage&lt;/strong&gt; is the production option. You upload the weights once, and the container streams them from GCS on each cold start via the Run:ai Model Streamer. More setup upfront. But the streaming happens over Google's internal network, and the speed difference is significant.&lt;/p&gt;

&lt;p&gt;Here's the part that surprised me when I tested it: GCS without proper VPC configuration is actually &lt;em&gt;slower&lt;/em&gt; than HuggingFace for small models. The Run:ai streamer's advantage only materializes when traffic stays on Google's internal network. When it goes out to the public internet, the overhead eliminates the benefit.&lt;/p&gt;

&lt;p&gt;The fix is &lt;a href="https://cloud.google.com/vpc/docs/private-google-access?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Private Google Access&lt;/a&gt; on your VPC subnet. One command. It opens a route from your Cloud Run container to Google APIs (including GCS) without touching the public internet. The official documentation doesn't highlight this, and it's the single most important detail in this entire guide.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Numbers
&lt;/h3&gt;

&lt;p&gt;I deployed all four model sizes from HuggingFace and from GCS, with and without VPC, and measured cold start time (first request after scale-to-zero), time to first token, and warm response time. Same prompt every time: "What is the moon?"&lt;/p&gt;

&lt;p&gt;A note on methodology: these are single measurements, not averages from a full benchmark suite. LLMs are nondeterministic by nature, and infrastructure performance varies with load, region capacity, and network conditions. The numbers below are directionally correct, not scientifically precise. Use them to understand the relative tradeoffs between deployment options. Before committing to one approach in production, run your own tests with your own workload.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Cold Start&lt;/th&gt;
&lt;th&gt;Warm Response&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2B&lt;/td&gt;
&lt;td&gt;HuggingFace&lt;/td&gt;
&lt;td&gt;311s&lt;/td&gt;
&lt;td&gt;1.75s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2B&lt;/td&gt;
&lt;td&gt;GCS (no VPC)&lt;/td&gt;
&lt;td&gt;334s&lt;/td&gt;
&lt;td&gt;1.81s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2B&lt;/td&gt;
&lt;td&gt;GCS + VPC&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;245s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.81s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;HuggingFace&lt;/td&gt;
&lt;td&gt;452s&lt;/td&gt;
&lt;td&gt;2.46s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;GCS (no VPC)&lt;/td&gt;
&lt;td&gt;433s&lt;/td&gt;
&lt;td&gt;2.47s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;GCS + VPC&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;246s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.47s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B&lt;/td&gt;
&lt;td&gt;GCS + VPC&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;191s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.61s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;GCS + VPC&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;251s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.90s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things to unpack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GCS without VPC is slower than HuggingFace for small models.&lt;/strong&gt; The streamer adds overhead that only pays off when traffic is fast. Over the public internet, HuggingFace wins for small files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC changes everything.&lt;/strong&gt; With Private Google Access, the 4B cold start drops from 433 seconds to 246 seconds. That's a 43% reduction just from routing traffic differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 26B model cold starts faster than the 2B from HuggingFace.&lt;/strong&gt; Read that again. A 26 billion parameter model, streamed over Google's internal network, is ready to serve in 191 seconds. The 2B downloading from HuggingFace takes 311 seconds. Network path and streaming architecture matter more than model size on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MoE vs. dense matters at inference time, not just startup.&lt;/strong&gt; The 26B warm response is 1.61 seconds. The 31B is 5.90 seconds. The 31B is a dense model: every one of its 31 billion parameters participates in every token. The 26B only activates 4 billion at a time. That's why the 26B responds nearly four times faster despite being nominally "larger." For latency-sensitive applications with a 256k context requirement, the 26B A4B is the more interesting choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scale-to-Zero and What It Means for Cost
&lt;/h3&gt;

&lt;p&gt;Cloud Run scales running instances based on traffic. When there are no requests, it scales to zero. No instances, no GPU allocated, no cost. The moment a request arrives, a new instance starts, loads the model, and serves it.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from Vertex AI Model Garden, where a deployed endpoint keeps a running instance alive. Walk away for the weekend, come back to a bill.&lt;/p&gt;

&lt;p&gt;With Cloud Run's scale-to-zero, the worst case is the cold start delay. And as the numbers show, even a 26B model is ready in about three minutes. For development and testing, that tradeoff is straightforward.&lt;/p&gt;

&lt;p&gt;To verify an instance has scaled to zero: open the Cloud Run console, click your service, go to the &lt;strong&gt;Metrics&lt;/strong&gt; tab, and look at &lt;strong&gt;Instance count&lt;/strong&gt;. Zero means you're not being charged. Or just wait 5 minutes after your last request and send a new one. If it takes 200+ seconds instead of 2, the instance scaled down.&lt;/p&gt;

&lt;p&gt;Scale-to-zero is the right choice for development and testing. But it's not the only option. Cloud Run also lets you set a &lt;strong&gt;minimum number of instances&lt;/strong&gt; to keep alive at all times. For production serving consistent traffic, you'd configure at least one warm instance to eliminate cold starts entirely. That changes the cost model, you're paying for idle time again, but it's a deliberate tradeoff, not an accident.&lt;/p&gt;

&lt;p&gt;The guide in this article is optimized for testing: minimal cost, maximum flexibility, scale-to-zero on everything. When you're ready to move to production and need to think about minimum instances, concurrency tuning, and traffic management, my post &lt;a href="https://dev.to/gde/this-is-cloud-run-configuration-2gi2"&gt;&lt;em&gt;"This is Cloud Run: Configuration"&lt;/em&gt;&lt;/a&gt; covers those options.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: The Deployment Guide
&lt;/h2&gt;

&lt;p&gt;I ran all of this myself before writing a word of this guide. Deployed every model size, hit every error, debugged every failure. The instructions below are what actually works for me.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Need
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A Google Cloud project with billing enabled&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/shell/docs?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Shell&lt;/a&gt; (recommended) or the &lt;a href="https://cloud.google.com/sdk/docs/install?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;&lt;code&gt;gcloud&lt;/code&gt; CLI&lt;/a&gt; installed locally. This guide assumes you're using Cloud Shell.&lt;/li&gt;
&lt;li&gt;Access to the Gemma 4 models on &lt;a href="https://huggingface.co/google" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; (requires accepting the license for each model)&lt;/li&gt;
&lt;li&gt;A &lt;a href="https://huggingface.co/settings/tokens" rel="noopener noreferrer"&gt;HuggingFace access token&lt;/a&gt; with read access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ran everything below from &lt;a href="https://cloud.google.com/shell/docs?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Shell&lt;/a&gt;. Google Cloud's browser-based terminal, pre-authenticated with your account and &lt;code&gt;gcloud&lt;/code&gt; already installed. No local setup, no version mismatches. Open it from the Cloud Console by clicking the terminal icon in the top right.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Set Your Environment Variables
&lt;/h3&gt;

&lt;p&gt;Set these once at the start of your Cloud Shell session. Every command in this guide uses them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-project-id"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us-central1"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-model-cache"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gemma-vpc"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gemma-subnet"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-huggingface-token"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROJECT_NUMBER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud projects describe &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'value(projectNumber)'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note 1:&lt;/strong&gt; RTX Pro 6000 available in us-central1 and europe-west4&lt;br&gt;
&lt;strong&gt;Note 2:&lt;/strong&gt; Cloud Shell sessions don't persist environment variables across reconnects. If you close and reopen Cloud Shell, run this block again before continuing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 2: Enable the Required APIs
&lt;/h3&gt;

&lt;p&gt;On a new project, you'll need all of these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud services &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    run.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    compute.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    storage.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    artifactregistry.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    iam.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="nv"&gt;$GOOGLE_CLOUD_PROJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;compute.googleapis.com&lt;/code&gt; is required for VPC and GPU resources. &lt;code&gt;iam.googleapis.com&lt;/code&gt; is needed to grant permissions to the Cloud Run service agent. The others cover model storage and Cloud Run itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Check GPU Quota (optional)
&lt;/h3&gt;

&lt;p&gt;Cloud Run GPU access requires quota approval. You can check your current allocation:&lt;/p&gt;

&lt;p&gt;Go to &lt;strong&gt;IAM &amp;amp; Admin &amp;gt; Quotas &amp;amp; System Limits&lt;/strong&gt; in the &lt;a href="https://console.cloud.google.com/iam-admin/quotas?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Google Cloud Console&lt;/a&gt;, filter by &lt;code&gt;NVIDIA_RTX_PRO_6000&lt;/code&gt; and &lt;code&gt;region:us-central1&lt;/code&gt;. If the limit is 0 or the quota doesn't appear, you need to request it.&lt;/p&gt;

&lt;p&gt;Request GPU quota through the &lt;a href="https://cloud.google.com/run/quotas?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run quotas page&lt;/a&gt;. Approval is not instant, allow a few days.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Create the VPC
&lt;/h3&gt;

&lt;p&gt;As explained in Part 1, Private Google Access on the subnet is the critical step. Without it, the container can't reach GCS at all. It's included directly in the &lt;code&gt;subnets create&lt;/code&gt; command below — no separate update needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud compute networks create &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet-mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;custom &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bgp-routing-mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;regional &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

gcloud compute networks subnets create &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--range&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10.0.0.0/24 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-private-ip-google-access&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grant the Cloud Run service agent permission to use the subnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:service-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_NUMBER&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;@serverless-robot-prod.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/compute.networkUser"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; In production, instead of using the default compute service account, create a dedicated one with least-privilege access to the GCS bucket.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 5: Create the GCS Bucket and Upload Models
&lt;/h3&gt;

&lt;p&gt;Create a single-region bucket in the same region as your Cloud Run service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud storage buckets create &lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--uniform-bucket-level-access&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

gcloud storage buckets add-iam-policy-binding &lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_NUMBER&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-compute@developer.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/storage.objectViewer"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;On disk space:&lt;/strong&gt; The 2B (~5GB) and 4B (~9GB) models fit in Cloud Shell. The 26B (~50GB) and 31B (~65GB) don't. Cloud Shell has about 5GB of disk. For the large models, spin up a temporary GCE VM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud compute instances create gemma-uploader &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-a"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--machine-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;n2-standard-8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--boot-disk-size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300GB &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--boot-disk-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pd-ssd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image-family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;debian-12 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image-project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;debian-cloud &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;storage-full
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If you get a subnet error&lt;/strong&gt;, add &lt;code&gt;--network="${VPC_NETWORK}" --subnet="${VPC_SUBNET}"&lt;/code&gt; to the command. Cloud Shell handles this automatically.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud compute ssh gemma-uploader &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-a"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the VM, install dependencies and upload the models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-huggingface-token"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-bucket-name"&lt;/span&gt;

&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-pip &lt;span class="nt"&gt;--fix-missing&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;huggingface_hub hf_transfer &lt;span class="nt"&gt;--break-system-packages&lt;/span&gt;

&lt;span class="c"&gt;# Stop immediately if any step fails&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="c"&gt;# Download, upload, and clean up each model one at a time&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;MODEL &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"google/gemma-4-E2B-it:gemma-4-E2B-it"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
             &lt;span class="s2"&gt;"google/gemma-4-E4B-it:gemma-4-E4B-it"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
             &lt;span class="s2"&gt;"google/gemma-4-26B-A4B-it:gemma-4-26B-A4B-it"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
             &lt;span class="s2"&gt;"google/gemma-4-31B-it:gemma-4-31B-it"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;REPO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;%%&lt;/span&gt;:&lt;span class="p"&gt;*&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;##*&lt;/span&gt;:&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/tmp/hf_cache_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
from huggingface_hub import snapshot_download
import os
snapshot_download(repo_id='&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REPO&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;', local_dir='/tmp/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;', token=os.environ['HF_TOKEN'])
"&lt;/span&gt;
  gcloud storage &lt;span class="nb"&gt;cp&lt;/span&gt; /tmp/&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/models/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/"&lt;/span&gt; &lt;span class="nt"&gt;--recursive&lt;/span&gt;
  &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /tmp/&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/hf_cache_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit the VM and delete it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;exit

&lt;/span&gt;gcloud compute instances delete gemma-uploader &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-a"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quiet&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Deploy
&lt;/h3&gt;

&lt;p&gt;Every deployment uses the same prebuilt container image from Google:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One important detail covered in Part 1: the container doesn't read the model or vLLM flags from environment variables. They must be passed via &lt;code&gt;--command="vllm"&lt;/code&gt; and &lt;code&gt;--args&lt;/code&gt;. Without &lt;code&gt;--command="vllm"&lt;/code&gt;, the startup script fails immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2B model:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_2B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/models/gemma-4-E2B-it"&lt;/span&gt;

gcloud beta run deploy gemma4-2b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--execution-environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gen2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;80Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-rtx-pro-6000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-gpu-zonal-redundancy&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-cpu-throttling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-egress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;all-traffic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vllm"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serve,&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_2B&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;,--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--max-num-seqs=64,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--startup-probe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4B model:&lt;/strong&gt; same command, replace &lt;code&gt;GCS_MODEL_PATH_2B&lt;/code&gt; with &lt;code&gt;GCS_MODEL_PATH_4B&lt;/code&gt;, service name with &lt;code&gt;gemma4-4b&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_4B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/models/gemma-4-E4B-it"&lt;/span&gt;

gcloud beta run deploy gemma4-4b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--execution-environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gen2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;80Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-rtx-pro-6000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-gpu-zonal-redundancy&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-cpu-throttling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-egress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;all-traffic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vllm"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serve,&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_4B&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;,--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--max-num-seqs=64,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--startup-probe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;26B model:&lt;/strong&gt; adds fp8 quantization and drops concurrency to 8.&lt;br&gt;
The startup probe below allows 6 minutes total (1 minute initial delay + 5 checks × 1 minute each), based on my measured load time. If your deployment times out, increase &lt;code&gt;failureThreshold&lt;/code&gt; — each unit adds one more minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_26B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/models/gemma-4-26B-A4B-it"&lt;/span&gt;

gcloud beta run deploy gemma4-26b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--execution-environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gen2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;80Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-rtx-pro-6000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-gpu-zonal-redundancy&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-cpu-throttling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-egress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;all-traffic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vllm"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serve,&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_26B&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;,--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--quantization=fp8,--kv-cache-dtype=fp8,--max-num-seqs=8,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0,--max-model-len=32767"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--startup-probe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;31B model:&lt;/strong&gt; same as 26B with &lt;code&gt;failureThreshold=6&lt;/code&gt;, allowing 7 minutes total based on my measured load time. Increase &lt;code&gt;failureThreshold&lt;/code&gt; if needed, same rule applies: each unit adds one minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_31B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/models/gemma-4-31B-it"&lt;/span&gt;

gcloud beta run deploy gemma4-31b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--execution-environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gen2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;80Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-rtx-pro-6000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-gpu-zonal-redundancy&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-cpu-throttling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-egress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;all-traffic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vllm"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serve,&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_MODEL_PATH_31B&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;,--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--quantization=fp8,--kv-cache-dtype=fp8,--max-num-seqs=8,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0,--max-model-len=32767"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--startup-probe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=6,timeoutSeconds=60,periodSeconds=60"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get the service URLs after deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SERVICE_URL_2B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud run services describe gemma4-2b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"value(status.url)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SERVICE_URL_4B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud run services describe gemma4-4b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"value(status.url)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SERVICE_URL_26B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud run services describe gemma4-26b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"value(status.url)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SERVICE_URL_31B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud run services describe gemma4-31b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"value(status.url)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"2B:  &lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_URL_2B&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"4B:  &lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_URL_4B&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"26B: &lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_URL_26B&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"31B: &lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_URL_31B&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Test It
&lt;/h3&gt;

&lt;p&gt;The service exposes an &lt;a href="https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html" rel="noopener noreferrer"&gt;OpenAI-compatible API&lt;/a&gt;. Any client that speaks the OpenAI protocol works against it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The first request will be slow.&lt;/strong&gt; If the instance has scaled to zero, Cloud Run needs to start a new one and load the model before responding. For the 2B and 4B models expect around 4 minutes. For the 26B and 31B, up to 5 minutes. Don't cancel the request — it will come back. Every request after that will be fast.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SERVICE_URL_2B&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/v1/chat/completions"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "What is the moon?"}],
    "max_tokens": 200
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model name in the request must match the HuggingFace repo ID passed in &lt;code&gt;--args&lt;/code&gt; (or a custom &lt;code&gt;--served-model-name&lt;/code&gt; if you set one).&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Hardening
&lt;/h3&gt;

&lt;p&gt;Everything above uses &lt;code&gt;--allow-unauthenticated&lt;/code&gt; and the default compute service account. That's fine for testing. Before real users or real data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication.&lt;/strong&gt; Replace &lt;code&gt;--allow-unauthenticated&lt;/code&gt; with &lt;code&gt;--no-allow-unauthenticated&lt;/code&gt;. Cloud Run supports &lt;a href="https://cloud.google.com/run/docs/authenticating/service-to-service?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;OIDC tokens&lt;/a&gt; for service-to-service calls and &lt;a href="https://cloud.google.com/iap/docs?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;IAP&lt;/a&gt; for user-facing access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dedicated Service Account.&lt;/strong&gt; Create one with only &lt;code&gt;roles/storage.objectViewer&lt;/code&gt; on the model bucket. The default compute service account has broader permissions than necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Private endpoint.&lt;/strong&gt; For sensitive workloads, remove the public URL and access the service only from within your VPC via &lt;a href="https://cloud.google.com/run/docs/securing/private-services?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run private networking&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cleaning Up
&lt;/h3&gt;

&lt;p&gt;To remove everything after testing is done, run these commands in order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delete the Cloud Run services:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;SERVICE &lt;span class="k"&gt;in &lt;/span&gt;gemma4-2b gemma4-4b gemma4-26b gemma4-31b&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;gcloud run services delete &lt;span class="nv"&gt;$SERVICE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--quiet&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Delete the GCS bucket and all model weights:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud storage &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Delete the VPC subnet and network:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud compute networks subnets delete &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_SUBNET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quiet&lt;/span&gt;

gcloud compute networks delete &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VPC_NETWORK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quiet&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If subnet deletion fails with an error about IP addresses still in use:&lt;/strong&gt; Cloud Run holds onto internal IP addresses for a period after services are deleted. Nothing to force-release them. Give it a few hours and try again.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Remove the IAM binding:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects remove-iam-policy-binding &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:service-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_NUMBER&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;@serverless-robot-prod.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/compute.networkUser"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;I started this piece talking about a Paris visit and an unexpected cloud bill. But the real reason I spent time getting Gemma 4 running on Cloud Run isn't just the cost.&lt;/p&gt;

&lt;p&gt;It's the access.&lt;/p&gt;

&lt;p&gt;When an LLM runs in your own infrastructure, things become possible that weren't possible before. Regulated data that couldn't touch a third-party API can now be processed by a capable model. Privacy becomes a feature of the architecture, not a compromise against capability.&lt;/p&gt;

&lt;p&gt;But there's a more practical argument too. Commercial frontier models are expensive per token, and they come with regional rate limits that cap how much you can do. When your production pipeline hits a rate limit, it will slow it down. When you hit your own model, there's no rate limit. You control the capacity. You control the cost. You decide when to scale.&lt;/p&gt;

&lt;p&gt;Gemma 4 is the first open model where that tradeoff genuinely makes sense across the full range of AI tasks: text, vision, reasoning, function calling. Not every step in your pipeline needs a frontier model. The steps that don't, and with Gemma 4's reasoning capability, that's more steps than before, can run on infrastructure you own, at a cost you control, without a rate limit in sight.&lt;/p&gt;

&lt;p&gt;The drone flying over the farmer's fields, making decisions on its own, is not a hypothetical. The only thing that was missing was a model good enough to run on hardware that fits in a backpack.&lt;/p&gt;

&lt;p&gt;Now there is one. And you know how to deploy it. Enjoy discovering it.&lt;/p&gt;

</description>
      <category>gemma</category>
      <category>cloudrun</category>
      <category>cloud</category>
      <category>ai</category>
    </item>
    <item>
      <title>This is Cloud Run: Configuration</title>
      <dc:creator>Daniel Gwerzman</dc:creator>
      <pubDate>Fri, 27 Mar 2026 10:23:13 +0000</pubDate>
      <link>https://forem.com/gde/this-is-cloud-run-configuration-2gi2</link>
      <guid>https://forem.com/gde/this-is-cloud-run-configuration-2gi2</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 3 of the "This is Cloud Run" series. In &lt;a href="https://dev.to/gde/this-is-cloud-run-a-decision-guide-for-developers-428m"&gt;Part 1&lt;/a&gt;, we covered what Cloud Run is and when to choose it. In &lt;a href="https://dev.to/gde/this-is-cloud-run-nine-ways-to-deploy-and-when-to-use-each-506b"&gt;Part 2&lt;/a&gt;, we walked through the deployment options and revision management. Now let's tune it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Cloud Run's defaults are good. We covered that in Part 1. But every workload has its own needs, and Cloud Run gives you the knobs to tune for them. This article covers the settings you'll reach for most often.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU and Memory
&lt;/h2&gt;

&lt;p&gt;Every Cloud Run instance gets a share of CPU and memory. The defaults (1 vCPU, 512 MiB) are reasonable for a lightweight API, but you'll want to adjust them as you understand your workload's needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU&lt;/strong&gt; ranges from 0.08 vCPU (less than a tenth of a core) to 8 vCPUs. &lt;strong&gt;Memory&lt;/strong&gt; ranges from 128 MiB to 32 GiB. The two are linked: higher CPU allocations require minimum memory thresholds, and some memory configurations require minimum CPU.&lt;/p&gt;

&lt;p&gt;But the more important decision is the &lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;&lt;strong&gt;CPU allocation mode&lt;/strong&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request-only (default).&lt;/strong&gt; CPU is only allocated while your instance is actively processing a request. Between requests, CPU is throttled to near-zero. You pay only for the time spent handling requests. This is the serverless model, and it's the right choice for most HTTP APIs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Always-on.&lt;/strong&gt; CPU is always available, even between requests. This costs more, but it's required for workloads that do work outside of request handling: WebSocket connections that maintain state, background threads that process queues, or services that need to keep in-memory caches warm.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cpu&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt; 1Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-cpu-throttling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--no-cpu-throttling&lt;/code&gt; flag enables always-on CPU. Without it (or with &lt;code&gt;--cpu-throttling&lt;/code&gt;), you get the default request-only mode.&lt;/p&gt;

&lt;p&gt;The pricing difference is significant. With request-only allocation, you pay per vCPU-second and GiB-second only while handling requests. With always-on, you pay for the entire lifecycle of the instance. For a service that handles bursty HTTP traffic with idle periods between, request-only can be dramatically cheaper. For a service that runs background tasks or maintains WebSocket connections, always-on is the only option that works correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health Checks
&lt;/h2&gt;

&lt;p&gt;Cloud Run won't send traffic to an instance until it's ready. By default, it uses a &lt;a href="https://cloud.google.com/run/docs/configuring/healthchecks" rel="noopener noreferrer"&gt;&lt;strong&gt;TCP startup probe&lt;/strong&gt;&lt;/a&gt;: it waits for your container to listen on the expected port, then considers it ready.&lt;/p&gt;

&lt;p&gt;For most services, that's enough. But if your application needs time to load data, warm caches, or establish database connections after the port is open, you'll want a custom &lt;a href="https://cloud.google.com/run/docs/configuring/healthchecks?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;&lt;strong&gt;HTTP startup probe&lt;/strong&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthz&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells Cloud Run to &lt;code&gt;GET /healthz&lt;/code&gt; every 2 seconds. If it fails 15 times, the instance is marked unhealthy and restarted. Only when the probe succeeds does the instance start receiving traffic. This prevents the 502 errors that happen when a load balancer sends requests to an instance that's technically listening but not yet ready to serve.&lt;/p&gt;

&lt;p&gt;Cloud Run also supports &lt;a href="https://cloud.google.com/run/docs/configuring/healthchecks?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;&lt;strong&gt;liveness probes&lt;/strong&gt;&lt;/a&gt; that run continuously after startup. If a liveness probe fails, Cloud Run restarts the instance. Useful for detecting stuck processes, deadlocks, or memory leaks that don't crash the container but make it unresponsive.&lt;/p&gt;

&lt;p&gt;For &lt;a href="https://grpc.io/" rel="noopener noreferrer"&gt;gRPC&lt;/a&gt; services, Cloud Run supports gRPC health checking probes following the &lt;a href="https://github.com/grpc/grpc/blob/master/doc/health-checking.md" rel="noopener noreferrer"&gt;gRPC health checking protocol&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Request Timeout
&lt;/h2&gt;

&lt;p&gt;Every Cloud Run request has a &lt;a href="https://cloud.google.com/run/docs/configuring/request-timeout" rel="noopener noreferrer"&gt;timeout&lt;/a&gt;. The default is &lt;strong&gt;300 seconds (5 minutes)&lt;/strong&gt;. The maximum is &lt;strong&gt;3600 seconds (60 minutes)&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt; 600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your service processes large file uploads, generates reports, or runs long-running computations, you'll want to increase this. But keep in mind: the timeout applies per-request. If a single request takes longer than the timeout, Cloud Run terminates it. WebSocket connections are also subject to this timeout, which is why Part 1 mentioned the ~60-minute connection limit.&lt;/p&gt;

&lt;p&gt;A common pattern for long-running work: accept the request, kick off the processing asynchronously (via &lt;a href="https://cloud.google.com/tasks?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Tasks&lt;/a&gt; or &lt;a href="https://cloud.google.com/pubsub?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt;), and return a 202 immediately. The client polls for status or receives a callback when the work is done. This keeps your request timeout short and your service responsive.&lt;/p&gt;

&lt;p&gt;If you find yourself regularly hitting the 60-minute maximum, that's a signal your workload might be better suited to &lt;a href="https://cloud.google.com/run/docs/create-jobs" rel="noopener noreferrer"&gt;Cloud Run jobs&lt;/a&gt; (for batch processing) or a different platform entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling: Instances and Concurrency
&lt;/h2&gt;

&lt;p&gt;Cloud Run's autoscaler manages three related settings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/min-instances?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Minimum instances&lt;/a&gt;&lt;/strong&gt; controls how many instances stay warm when there's no traffic. The default is &lt;code&gt;0&lt;/code&gt; (scale-to-zero). Setting it to &lt;code&gt;1&lt;/code&gt; or higher eliminates cold starts but means you're paying for idle instances. It's the classic serverless trade-off: latency vs. cost. For latency-sensitive production services, &lt;code&gt;1&lt;/code&gt; is often the right number. For dev environments, &lt;code&gt;0&lt;/code&gt; keeps your bill at zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/max-instances?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Maximum instances&lt;/a&gt;&lt;/strong&gt; caps how far Cloud Run can scale up. The default is &lt;code&gt;100&lt;/code&gt;. This protects you from runaway scaling (and a surprising bill) during unexpected traffic spikes. But set this thoughtfully: if your service talks to a database with a 20-connection pool, 100 instances all trying to connect will overwhelm it. Match your max instances to your backend's capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/concurrency?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Concurrency&lt;/a&gt;&lt;/strong&gt; controls how many requests a single instance handles simultaneously. The default is &lt;code&gt;80&lt;/code&gt;. This is one of Cloud Run's key advantages over the old Cloud Functions 1st gen model, which processed one request per instance. With concurrency at 80, a single instance can serve 80 simultaneous requests before Cloud Run spins up another instance.&lt;/p&gt;

&lt;p&gt;Lower the concurrency for CPU-heavy workloads where each request needs dedicated processing power. Raise it (up to 1000) for lightweight I/O-bound handlers that spend most of their time waiting on network calls. Setting concurrency to &lt;code&gt;1&lt;/code&gt; mimics the one-request-per-instance model if your code isn't thread-safe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-instances&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-instances&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And remember &lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Startup CPU Boost&lt;/a&gt; from Part 1: Cloud Run temporarily doubles CPU during instance initialization to get instances ready faster. Combined with minimum instances, this makes cold starts a non-issue for most workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment Variables and Secrets
&lt;/h2&gt;

&lt;p&gt;Cloud Run supports two mechanisms for passing configuration to your containers, and it's important to use the right one for the job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/environment-variables?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Environment variables&lt;/a&gt;&lt;/strong&gt; are for non-sensitive configuration: feature flags, API endpoints, logging levels, database hostnames. Set them at deploy time with &lt;code&gt;--set-env-vars&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-env-vars&lt;/span&gt; &lt;span class="s2"&gt;"DB_HOST=10.0.0.1,LOG_LEVEL=info,ENV=production"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This follows the &lt;a href="https://12factor.net/config" rel="noopener noreferrer"&gt;12-Factor App&lt;/a&gt; methodology: configuration lives in the environment, not in the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/secrets?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Secrets&lt;/a&gt;&lt;/strong&gt; are for sensitive credentials: API keys, database passwords, TLS certificates, OAuth client secrets. These should &lt;em&gt;never&lt;/em&gt; be plain environment variables. Plain env vars are visible in the Cloud Run Console, show up in debug logs, and can leak into error reports. Instead, store them in &lt;a href="https://cloud.google.com/secret-manager?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Secret Manager&lt;/a&gt; and reference them at deploy time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-secrets&lt;/span&gt; &lt;span class="s2"&gt;"API_KEY=my-api-key:latest"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-secrets&lt;/span&gt; &lt;span class="s2"&gt;"/secrets/tls.key=tls-private-key:latest"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Secrets can be mounted as environment variables or as files. The first example above mounts the secret as an environment variable called &lt;code&gt;API_KEY&lt;/code&gt;. The second mounts it as a file at &lt;code&gt;/secrets/tls.key&lt;/code&gt;. Secrets are versioned, access-controlled via IAM, and audit-logged. If a secret is compromised, you rotate it in Secret Manager and redeploy. No code changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Volume Mounts
&lt;/h2&gt;

&lt;p&gt;Cloud Run instances are ephemeral, but sometimes you need temporary storage or access to shared files. Cloud Run supports three types of &lt;a href="https://cloud.google.com/run/docs/configuring/services/volumes?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;volume mounts&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-memory volumes&lt;/strong&gt; are &lt;code&gt;tmpfs&lt;/code&gt;-style mounts backed by your instance's RAM. They're fast but volatile (gone when the instance terminates) and count against your memory limit. Useful for temporary file processing, like downloading a file, transforming it, and uploading the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--add-volume&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;scratch,type&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="nt"&gt;-memory&lt;/span&gt;,size-limit&lt;span class="o"&gt;=&lt;/span&gt;256Mi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--add-volume-mount&lt;/span&gt; &lt;span class="nv"&gt;volume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;scratch,mount-path&lt;span class="o"&gt;=&lt;/span&gt;/tmp/work &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts" rel="noopener noreferrer"&gt;Cloud Storage FUSE&lt;/a&gt;&lt;/strong&gt; mounts a &lt;a href="https://cloud.google.com/storage?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Storage&lt;/a&gt; bucket as a local filesystem. Your code reads and writes files normally, and &lt;a href="https://cloud.google.com/storage/docs/cloud-storage-fuse/overview?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;GCS FUSE&lt;/a&gt; translates those operations into Cloud Storage API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--add-volume&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;models,type&lt;span class="o"&gt;=&lt;/span&gt;cloud-storage,bucket&lt;span class="o"&gt;=&lt;/span&gt;my-ml-models &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--add-volume-mount&lt;/span&gt; &lt;span class="nv"&gt;volume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;models,mount-path&lt;span class="o"&gt;=&lt;/span&gt;/mnt/models &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The catch: it's eventually consistent. No file locking, last write wins. Good for reading shared assets (ML models, configuration files) or writing artifacts (logs, exports). Not good for concurrent writes to the same file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/nfs-volume-mounts" rel="noopener noreferrer"&gt;NFS via Filestore&lt;/a&gt;&lt;/strong&gt; gives you a fully &lt;a href="https://en.wikipedia.org/wiki/POSIX" rel="noopener noreferrer"&gt;POSIX&lt;/a&gt;-compliant network filesystem with proper file locking. Lower latency than GCS FUSE for random reads. Requires &lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;VPC connectivity&lt;/a&gt; since &lt;a href="https://cloud.google.com/filestore?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Filestore&lt;/a&gt; instances live on your VPC. Best for workloads that need shared read/write access with file-level consistency.&lt;/p&gt;

&lt;p&gt;For most Cloud Run services, you won't need any of these. But when you do (image processing pipelines, ML model serving, shared configuration across instances), they save you from building workarounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Network Configuration
&lt;/h2&gt;

&lt;p&gt;Cloud Run's networking defaults are simple: your service is public, and it connects to the internet for outbound traffic. But when you need more control, there are three areas to configure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingress
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/run/docs/securing/ingress?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Ingress settings&lt;/a&gt; control who can reach your service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All (default).&lt;/strong&gt; Accepts traffic from anywhere on the internet. Fine for public APIs and web apps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal.&lt;/strong&gt; Only accepts traffic from within your &lt;a href="https://cloud.google.com/vpc?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;VPC&lt;/a&gt; or from other Google Cloud services (like &lt;a href="https://cloud.google.com/pubsub?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt;, &lt;a href="https://cloud.google.com/scheduler?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt;, or &lt;a href="https://cloud.google.com/tasks?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Tasks&lt;/a&gt;). The service is invisible to the public internet. Use this for backend services that should never be called directly by external clients.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal + Cloud Load Balancing.&lt;/strong&gt; Same as internal, but also accepts traffic through a &lt;a href="https://cloud.google.com/load-balancing/docs/https?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;global external Application Load Balancer&lt;/a&gt;. This is the path to custom domains, CDN caching with &lt;a href="https://cloud.google.com/cdn?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud CDN&lt;/a&gt;, and WAF protection with &lt;a href="https://cloud.google.com/armor?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Armor&lt;/a&gt;. You'll see this load balancer pattern come up again in the Custom Domains and Cloud Armor sections below.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ingress&lt;/span&gt; internal &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Egress and VPC Connectivity
&lt;/h3&gt;

&lt;p&gt;By default, your Cloud Run instances connect to the internet directly. But if your service needs to reach private resources (a &lt;a href="https://cloud.google.com/sql?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud SQL&lt;/a&gt; database, a &lt;a href="https://cloud.google.com/memorystore?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Memorystore&lt;/a&gt; Redis instance, an internal API), it needs VPC access.&lt;/p&gt;

&lt;p&gt;Two options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://cloud.google.com/vpc/docs/configure-serverless-vpc-access?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Serverless VPC Access connectors&lt;/a&gt;.&lt;/strong&gt; The original approach. You create a connector resource that bridges Cloud Run and your VPC. Works, but adds a network hop and has throughput limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Direct VPC egress&lt;/a&gt;.&lt;/strong&gt; The newer approach. Cloud Run instances are placed directly on your VPC subnet. No connector needed, no extra hop, no throughput bottleneck. This is the recommended path for new deployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're starting fresh, go with Direct VPC egress. If you have existing services using connectors, they'll keep working, but consider migrating when convenient.&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom Domains
&lt;/h3&gt;

&lt;p&gt;Every Cloud Run service gets a &lt;code&gt;*.run.app&lt;/code&gt; URL with automatic HTTPS. But for production, you'll want your own domain. Two paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/mapping-custom-domains?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run domain mapping&lt;/a&gt;.&lt;/strong&gt; The simpler option. Map a domain directly to your Cloud Run service. SSL certificates are provisioned and renewed automatically. Works for straightforward setups where you just need &lt;code&gt;api.example.com&lt;/code&gt; pointing to your service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://cloud.google.com/load-balancing/docs/https?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Global external Application Load Balancer&lt;/a&gt;.&lt;/strong&gt; The more capable option. Gives you CDN caching, Cloud Armor WAF, multi-region routing, and URL-based routing to different services. More setup, but it unlocks features that domain mapping alone can't provide.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security Configuration
&lt;/h2&gt;

&lt;p&gt;Cloud Run's security defaults are strong (covered in Part 1). But for production services, you'll want to customize a few settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Accounts
&lt;/h3&gt;

&lt;p&gt;Every Cloud Run service runs as a &lt;a href="https://cloud.google.com/run/docs/configuring/service-accounts?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;service account&lt;/a&gt;, which determines what Google Cloud resources it can access. By default, Cloud Run uses the project's default compute service account, which typically has broad permissions.&lt;/p&gt;

&lt;p&gt;For production, create a &lt;strong&gt;dedicated service account per service&lt;/strong&gt; with only the permissions it needs. If your service reads from Cloud Storage and writes to Pub/Sub, its service account should have &lt;code&gt;storage.objectViewer&lt;/code&gt; and &lt;code&gt;pubsub.publisher&lt;/code&gt;. Nothing more. This is the principle of least privilege, and it limits the blast radius if a service is compromised.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; my-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-account&lt;/span&gt; my-sa@my-project.iam.gserviceaccount.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IAM Authentication
&lt;/h3&gt;

&lt;p&gt;By default, Cloud Run requires authentication. Every request must include a valid identity token, and the caller must have the &lt;code&gt;roles/run.invoker&lt;/code&gt; role on the service. This is the right default for service-to-service communication.&lt;/p&gt;

&lt;p&gt;For public-facing services (APIs, webhooks, web apps), you explicitly opt out by granting the &lt;code&gt;roles/run.invoker&lt;/code&gt; role to &lt;code&gt;allUsers&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run services add-iam-policy-binding my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"allUsers"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/run.invoker"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But even with unauthenticated access enabled, you can implement your own authentication layer in your application code. Cloud Run handles transport security (HTTPS) and platform-level identity (the IAM invoker check). Your app handles application-level identity: user logins, API keys, JWT validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Binary Authorization
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/binary-authorization?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Binary Authorization&lt;/a&gt; enforces deploy-time policies: only container images that have been signed by your CI/CD pipeline can be deployed. This prevents someone from deploying an untested image directly to production, even if they have the IAM permissions to do so.&lt;/p&gt;

&lt;p&gt;It's a layer of governance that makes sense for organizations with compliance requirements or strict change management processes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Armor
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/armor?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Armor&lt;/a&gt; is Google Cloud's WAF (Web Application Firewall). It sits in front of your Cloud Run service and can enforce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IP allowlists and denylists&lt;/li&gt;
&lt;li&gt;Geographic restrictions&lt;/li&gt;
&lt;li&gt;Rate limiting per client&lt;/li&gt;
&lt;li&gt;Pre-configured WAF rules (SQL injection, XSS, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud Armor requires a &lt;a href="https://cloud.google.com/load-balancing/docs/https?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;global external Application Load Balancer&lt;/a&gt; in front of your Cloud Run service. If you're using the default &lt;code&gt;*.run.app&lt;/code&gt; URL without a load balancer, Cloud Armor isn't available. But if your service is public-facing and handles sensitive data, the load balancer + Cloud Armor combination is worth the extra setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Cloud Run gives you enough configuration knobs to tune for real workloads. But you don't need to touch all of them at once.&lt;/p&gt;

&lt;p&gt;The pattern I recommend: start with the defaults. Deploy your service, see how it behaves under real traffic, then adjust. Bump the memory if you're hitting limits. Lower the concurrency if requests are CPU-heavy. Add a health check if your startup is slow. Set up a dedicated service account before going to production. Every change takes effect on the next deployment, with zero downtime. Nothing is permanent.&lt;/p&gt;

&lt;p&gt;If Part 1 was about &lt;em&gt;whether&lt;/em&gt; Cloud Run belongs in your architecture, and Part 2 was about &lt;em&gt;getting your code onto it&lt;/em&gt;, this article is about &lt;em&gt;making it work well for your specific needs&lt;/em&gt;. Start simple. Add complexity when your workload demands it, not before.&lt;/p&gt;

&lt;p&gt;If you're just joining the series, &lt;a href="https://dev.to/gde/this-is-cloud-run-a-decision-guide-for-developers-428m"&gt;Part 1&lt;/a&gt; is the place to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs" rel="noopener noreferrer"&gt;Cloud Run documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/pricing" rel="noopener noreferrer"&gt;Cloud Run pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu" rel="noopener noreferrer"&gt;Cloud Run CPU configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/healthchecks" rel="noopener noreferrer"&gt;Cloud Run health checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grpc.io/" rel="noopener noreferrer"&gt;gRPC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/grpc/grpc/blob/master/doc/health-checking.md" rel="noopener noreferrer"&gt;gRPC health checking protocol&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/request-timeout" rel="noopener noreferrer"&gt;Cloud Run request timeout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/tasks" rel="noopener noreferrer"&gt;Cloud Tasks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/pubsub" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/create-jobs" rel="noopener noreferrer"&gt;Cloud Run jobs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/min-instances" rel="noopener noreferrer"&gt;Cloud Run minimum instances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/max-instances" rel="noopener noreferrer"&gt;Cloud Run maximum instances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/concurrency" rel="noopener noreferrer"&gt;Cloud Run concurrency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu" rel="noopener noreferrer"&gt;Startup CPU Boost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/environment-variables" rel="noopener noreferrer"&gt;Cloud Run environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/secrets" rel="noopener noreferrer"&gt;Cloud Run secrets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/secret-manager" rel="noopener noreferrer"&gt;Secret Manager&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://12factor.net/config" rel="noopener noreferrer"&gt;12-Factor App: Config&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts" rel="noopener noreferrer"&gt;Cloud Storage volume mounts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/storage" rel="noopener noreferrer"&gt;Cloud Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/storage/docs/cloud-storage-fuse/overview" rel="noopener noreferrer"&gt;GCS FUSE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/nfs-volume-mounts" rel="noopener noreferrer"&gt;NFS volume mounts (Filestore)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/filestore" rel="noopener noreferrer"&gt;Filestore&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/securing/ingress" rel="noopener noreferrer"&gt;Cloud Run ingress settings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vpc" rel="noopener noreferrer"&gt;VPC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/scheduler" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/cdn" rel="noopener noreferrer"&gt;Cloud CDN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/armor" rel="noopener noreferrer"&gt;Cloud Armor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/load-balancing/docs/https" rel="noopener noreferrer"&gt;Global external Application Load Balancer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc" rel="noopener noreferrer"&gt;Direct VPC egress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vpc/docs/configure-serverless-vpc-access" rel="noopener noreferrer"&gt;Serverless VPC Access&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/sql" rel="noopener noreferrer"&gt;Cloud SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/memorystore" rel="noopener noreferrer"&gt;Memorystore&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/mapping-custom-domains" rel="noopener noreferrer"&gt;Cloud Run domain mapping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/service-accounts" rel="noopener noreferrer"&gt;Cloud Run service accounts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/binary-authorization" rel="noopener noreferrer"&gt;Binary Authorization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>gcp</category>
      <category>cloudrun</category>
      <category>serverless</category>
    </item>
    <item>
      <title>This is Cloud Run: Nine Ways to Deploy (and When to Use Each)</title>
      <dc:creator>Daniel Gwerzman</dc:creator>
      <pubDate>Fri, 20 Mar 2026 07:04:45 +0000</pubDate>
      <link>https://forem.com/gde/this-is-cloud-run-nine-ways-to-deploy-and-when-to-use-each-506b</link>
      <guid>https://forem.com/gde/this-is-cloud-run-nine-ways-to-deploy-and-when-to-use-each-506b</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 2 of the "This is Cloud Run" series. In &lt;a href="https://dev.to/gde/this-is-cloud-run-a-decision-guide-for-developers-428m"&gt;Part 1&lt;/a&gt;, we covered what Cloud Run is, what's behind the curtain, what you get for free, Cloud Run functions, the platform's boundaries, and the migration path to Kubernetes. Now let's get practical.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In Part 1, the question was "should I use Cloud Run?" Here, the question is &lt;strong&gt;"how do I get my code onto it?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a step-by-step tutorial. This article is the &lt;em&gt;why&lt;/em&gt; behind each option, so you can make informed choices instead of copying commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Options
&lt;/h2&gt;

&lt;p&gt;One of Cloud Run's underrated strengths is how many ways you can get your code onto the platform. Between the CLI, the Console, YAML, Terraform, Cloud Build, GitHub Actions, Cloud Deploy, client libraries, and more, there's no shortage of options. I'm not going to cover all of them. Instead, I'll focus on the ones I find most useful across the projects I work on, from quick prototypes to production deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  From Source Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the "I just want this running" command. You point &lt;code&gt;gcloud&lt;/code&gt; at your source code directory, and Cloud Run handles everything else. But "everything else" hides a multi-step pipeline that's worth understanding:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Upload.&lt;/strong&gt; &lt;code&gt;gcloud&lt;/code&gt; zips your source directory and uploads it to &lt;a href="https://cloud.google.com/build" rel="noopener noreferrer"&gt;Cloud Build&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect.&lt;/strong&gt; Cloud Build runs &lt;a href="https://cloud.google.com/docs/buildpacks/overview" rel="noopener noreferrer"&gt;Google Cloud Buildpacks&lt;/a&gt;, which inspect your code to determine the language and framework. Found a &lt;code&gt;requirements.txt&lt;/code&gt;? Python. Found &lt;code&gt;gunicorn&lt;/code&gt; in the dependencies? That's your server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build.&lt;/strong&gt; Buildpacks create a container image: secure base image, installed dependencies, configured entry point. If a &lt;code&gt;Dockerfile&lt;/code&gt; is present in your directory, Cloud Build uses that instead of buildpacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push.&lt;/strong&gt; The built image is pushed to &lt;a href="https://cloud.google.com/artifact-registry" rel="noopener noreferrer"&gt;Artifact Registry&lt;/a&gt;, Google Cloud's container registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy.&lt;/strong&gt; Cloud Run pulls the image and deploys it as a new &lt;a href="https://cloud.google.com/run/docs/managing/revisions" rel="noopener noreferrer"&gt;revision&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All from one command. You don't need to know Docker, you don't need to understand container registries, and you don't even need a &lt;code&gt;Dockerfile&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The trade-off is control. You can't customize the base image or run multi-stage builds. If you don't need any of that, source deploy works perfectly fine in production. Many of my services run this way. But if you need precise control over what's in the container, the next option gives you that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Quick iterations during development, prototyping, developers new to containers, and the "I just want this running" moments.&lt;/p&gt;

&lt;h3&gt;
  
  
  From a Pre-Built Container Image
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; us-docker.pkg.dev/my-project/repo/my-image:v1.2.3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the production path. You build your Docker image however you like (locally, in CI, in &lt;a href="https://cloud.google.com/build" rel="noopener noreferrer"&gt;Cloud Build&lt;/a&gt;), push it to &lt;a href="https://cloud.google.com/artifact-registry" rel="noopener noreferrer"&gt;Artifact Registry&lt;/a&gt;, and deploy by image URL. Notice the flag: &lt;code&gt;--image&lt;/code&gt; instead of &lt;code&gt;--source&lt;/code&gt;. No build step happens on Google's side. Cloud Run pulls the image and starts running it, which makes the deployment itself much faster.&lt;/p&gt;

&lt;p&gt;The key advantage is full control over the build process. Multi-stage builds to keep images small. Custom base images tuned for your runtime. Build-time secrets for private package registries. Whatever your &lt;code&gt;Dockerfile&lt;/code&gt; needs. If you care about minimal images, pinned base image versions, and a small attack surface, this is where you get that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Production deployments, teams with existing build pipelines, workloads that need custom build steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Run Functions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-function &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function&lt;/span&gt; myEntrypoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--base-image&lt;/span&gt; python312 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We covered &lt;a href="https://cloud.google.com/run/docs/writing-a-function" rel="noopener noreferrer"&gt;Cloud Run functions&lt;/a&gt; in depth in Part 1, so here we'll focus on the deployment mechanics. Two flags distinguish a function deployment from a service deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--function myEntrypoint&lt;/code&gt;&lt;/strong&gt; selects which function in your source code to use as the HTTP entry point. Your source can define multiple functions; each deployment serves one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--base-image python312&lt;/code&gt;&lt;/strong&gt; selects a &lt;a href="https://cloud.google.com/run/docs/deploy-functions" rel="noopener noreferrer"&gt;managed base image&lt;/a&gt; for the runtime. Google manages these base images, including security patches, so you don't maintain a &lt;code&gt;Dockerfile&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Supported runtimes include Node.js, Python, Go, Java, .NET, Ruby, and PHP. You can also deploy Cloud Run functions from the &lt;a href="https://console.cloud.google.com/run" rel="noopener noreferrer"&gt;Console UI&lt;/a&gt; with an inline code editor, which is handy for quick experiments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Single-purpose endpoints, webhooks, event handlers, and the LLM API proxy pattern I described in Part 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Console UI
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://console.cloud.google.com/run" rel="noopener noreferrer"&gt;Google Cloud Console&lt;/a&gt; provides a point-and-click interface for deploying Cloud Run services. You select a container image from Artifact Registry, configure settings through a guided form (CPU, memory, scaling, environment variables, networking), and deploy without touching the command line.&lt;/p&gt;

&lt;p&gt;One thing the Console is surprisingly good for: &lt;strong&gt;discovery.&lt;/strong&gt; Before you memorize CLI flags, clicking through the Console form shows you every configuration option Cloud Run offers. I've used it more than once to discover a setting I didn't know existed, then replicated it in my deployment scripts.&lt;/p&gt;

&lt;p&gt;But the Console has an obvious limitation: it's not scriptable. Every deployment is manual, which means it's not repeatable, not version-controllable, and impossible to code review. You can't &lt;code&gt;git diff&lt;/code&gt; a click.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Exploration, one-off deployments, reviewing and tweaking configurations visually, and learning what options exist before writing automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  gcloud CLI with Full Configuration
&lt;/h3&gt;

&lt;p&gt;You've already seen &lt;code&gt;gcloud run deploy&lt;/code&gt; with minimal flags. But the same command is also a full-featured deployment tool with &lt;a href="https://cloud.google.com/sdk/gcloud/reference/run/deploy" rel="noopener noreferrer"&gt;dozens of configuration options&lt;/a&gt;. Every Cloud Run configuration is a flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; us-docker.pkg.dev/my-project/repo/my-image:v1.2.3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt; 512Mi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cpu&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-instances&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-instances&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-env-vars&lt;/span&gt; &lt;span class="s2"&gt;"DB_HOST=10.0.0.1,ENV=production"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-secrets&lt;/span&gt; &lt;span class="s2"&gt;"API_KEY=my-secret:latest"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-account&lt;/span&gt; my-sa@my-project.iam.gserviceaccount.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--revision-suffix&lt;/span&gt; v1-2-3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Memory? A flag. Secrets from &lt;a href="https://cloud.google.com/secret-manager" rel="noopener noreferrer"&gt;Secret Manager&lt;/a&gt;? A flag. &lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc" rel="noopener noreferrer"&gt;VPC connectivity&lt;/a&gt;? A flag. This makes &lt;code&gt;gcloud&lt;/code&gt; commands scriptable, repeatable, and easy to drop into shell scripts or CI/CD pipelines.&lt;/p&gt;

&lt;p&gt;One thing that keeps it practical: &lt;strong&gt;configuration is sticky.&lt;/strong&gt; Once you set a flag, it stays on the service until you explicitly change it. So your first deployment might be the long one with &lt;code&gt;--memory&lt;/code&gt;, &lt;code&gt;--cpu&lt;/code&gt;, &lt;code&gt;--max-instances&lt;/code&gt;, and everything else. But subsequent deployments can go back to the simple &lt;code&gt;gcloud run deploy my-service --image my-image:v2 --region us-central1&lt;/code&gt; and all your previous settings carry over. You only specify what you want to change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Scripted deployments, shell scripts, CI/CD integration, and when you need precise, repeatable control over every setting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Declarative YAML
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run services replace service.yaml &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CLI flags are imperative: "change these things." YAML is declarative: "this is what I want." You define your entire service configuration in a file, and Cloud Run makes reality match. If something drifted (someone tweaked a setting in the Console), the YAML corrects it. If nothing changed, nothing happens.&lt;/p&gt;

&lt;p&gt;The YAML follows the &lt;a href="https://cloud.google.com/run/docs/reference/rest/v1/namespaces.services" rel="noopener noreferrer"&gt;Knative Serving API v1 schema&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.knative.dev/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cloud.googleapis.com/location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;run.googleapis.com/ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;autoscaling.knative.dev/minScale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0'&lt;/span&gt;
        &lt;span class="na"&gt;autoscaling.knative.dev/maxScale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;10'&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containerConcurrency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-docker.pkg.dev/my-project/repo/my-image:v1.2.3&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ENV&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same model as &lt;a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; and &lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt;: you describe &lt;em&gt;what you want&lt;/em&gt;, not &lt;em&gt;how to get there&lt;/em&gt;. And because the schema is Knative-compatible, these YAML files are portable between Cloud Run and self-hosted &lt;a href="https://knative.dev/" rel="noopener noreferrer"&gt;Knative&lt;/a&gt; on Kubernetes.&lt;/p&gt;

&lt;p&gt;A practical tip: you can export an existing service's configuration with &lt;code&gt;gcloud run services describe SERVICE --format export &amp;gt; service.yaml&lt;/code&gt;, modify it, and reapply. This is a great way to bring a service that was originally deployed via the Console or CLI into version control.&lt;/p&gt;

&lt;p&gt;One important nuance: IAM policies (who can invoke your service) are managed separately from the service definition. The YAML defines the service configuration; &lt;code&gt;gcloud run services add-iam-policy-binding&lt;/code&gt; controls access. This is actually good practice, since access control is often managed by a different team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; GitOps workflows, infrastructure-as-code, teams that want configuration in version control, and teams migrating from Kubernetes or Knative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous Deployment from Git
&lt;/h3&gt;

&lt;p&gt;Connect a &lt;a href="https://github.com" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://gitlab.com" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;, or &lt;a href="https://bitbucket.org" rel="noopener noreferrer"&gt;Bitbucket&lt;/a&gt; repository to Cloud Run through &lt;a href="https://cloud.google.com/build/docs/automating-builds/create-manage-triggers" rel="noopener noreferrer"&gt;Cloud Build triggers&lt;/a&gt;. Every push to a specified branch automatically builds your container and deploys a new revision. You set it up once and forget it.&lt;/p&gt;

&lt;p&gt;You can configure this &lt;a href="https://cloud.google.com/run/docs/continuous-deployment-with-cloud-build" rel="noopener noreferrer"&gt;from the Cloud Run Console UI&lt;/a&gt; under "Set up continuous deployment," or manually by creating Cloud Build triggers. Either way, the result is the same: push to &lt;code&gt;main&lt;/code&gt;, wait a couple of minutes, and your changes are live. Under the hood, it uses the same buildpack pipeline as &lt;code&gt;--source&lt;/code&gt; deploys: your code is auto-detected, built into an image, pushed to Artifact Registry, and deployed as a new revision.&lt;/p&gt;

&lt;p&gt;The difference between this and the next two options (Cloud Build and GitHub Actions) is simplicity. Continuous deployment from Git is a pre-built pipeline. You don't write build steps or workflow files. The trade-off is flexibility: you can't run tests before deploying, can't deploy to staging first, and can't customize the build beyond what Cloud Build's auto-detection provides. If you need any of those things, keep reading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want automated deployment on every push without building or maintaining custom CI/CD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Build
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/build" rel="noopener noreferrer"&gt;Cloud Build&lt;/a&gt; is Google Cloud's serverless CI/CD platform. Where the previous option gives you a pre-built pipeline, Cloud Build gives you the building blocks to assemble your own.&lt;/p&gt;

&lt;p&gt;You define your pipeline in a &lt;code&gt;cloudbuild.yaml&lt;/code&gt; file at the root of your repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/cloud-builders/docker'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;build'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-t'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-docker.pkg.dev/$PROJECT_ID/repo/my-image:$COMMIT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/cloud-builders/docker'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;push'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-docker.pkg.dev/$PROJECT_ID/repo/my-image:$COMMIT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/google.com/cloudsdktool/cloud-sdk'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcloud&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deploy'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-service'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
           &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--image'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-docker.pkg.dev/$PROJECT_ID/repo/my-image:$COMMIT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
           &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--region'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;step&lt;/code&gt; runs in its own container. The first step builds the Docker image, tagging it with the commit SHA for traceability. The second pushes it to Artifact Registry. The third deploys it to Cloud Run. You can chain as many steps as you need: run tests, lint code, scan for vulnerabilities, deploy to staging, run integration tests against staging, then deploy to production. The &lt;code&gt;$PROJECT_ID&lt;/code&gt; and &lt;code&gt;$COMMIT_SHA&lt;/code&gt; are &lt;a href="https://cloud.google.com/build/docs/configuring-builds/substitute-variable-values" rel="noopener noreferrer"&gt;built-in substitution variables&lt;/a&gt; that Cloud Build populates automatically.&lt;/p&gt;

&lt;p&gt;Trigger this pipeline &lt;a href="https://cloud.google.com/build/docs/automating-builds/create-manage-triggers" rel="noopener noreferrer"&gt;on every push to a branch&lt;/a&gt;, on pull requests, or on-demand with &lt;code&gt;gcloud builds submit&lt;/code&gt;. That flexibility is the point: Cloud Build is the pipeline, and you decide what goes in it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Complex build pipelines, multi-service deployments, teams that need test-before-deploy workflows, and teams already invested in Google Cloud's CI/CD ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Actions
&lt;/h3&gt;

&lt;p&gt;If your code lives on GitHub and your CI/CD already runs on &lt;a href="https://github.com/features/actions" rel="noopener noreferrer"&gt;GitHub Actions&lt;/a&gt;, Google provides an official action for deploying to Cloud Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/auth@v2&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;workload_identity_provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider'&lt;/span&gt;
    &lt;span class="na"&gt;service_account&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deployer@my-project.iam.gserviceaccount.com'&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/deploy-cloudrun@v2&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-docker.pkg.dev/my-project/repo/my-image:${{ github.sha }}&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Authentication is handled via &lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;Workload Identity Federation&lt;/a&gt;, which lets GitHub Actions authenticate to Google Cloud without service account keys. Instead of storing a JSON key file as a GitHub secret (a secret that never expires and can be copied anywhere), Workload Identity Federation uses short-lived tokens granted through an identity mapping. No keys to store, no keys to rotate, no keys to accidentally leak in a log.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/google-github-actions/deploy-cloudrun" rel="noopener noreferrer"&gt;&lt;code&gt;google-github-actions/deploy-cloudrun&lt;/code&gt;&lt;/a&gt; action supports both image-based and source-based deployments, and the resulting service URL is available as a workflow output for downstream steps (useful for posting preview links on pull requests).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams with GitHub-centric workflows who want deployment integrated into their existing CI/CD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing the Right Deployment Method
&lt;/h3&gt;

&lt;p&gt;Here's a quick reference for the methods we covered:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Deploy Speed&lt;/th&gt;
&lt;th&gt;Build Control&lt;/th&gt;
&lt;th&gt;Repeatable&lt;/th&gt;
&lt;th&gt;Best Scenario&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;From Source&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Yes (CLI)&lt;/td&gt;
&lt;td&gt;Prototyping, quick iterations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Image&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Yes (CI/CD)&lt;/td&gt;
&lt;td&gt;Production, custom builds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Run Functions&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Yes (CLI)&lt;/td&gt;
&lt;td&gt;Single-purpose endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Console UI&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Exploration, learning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gcloud CLI (full)&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Yes (scripts)&lt;/td&gt;
&lt;td&gt;Scripted deploys, CI/CD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Declarative YAML&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Yes (GitOps)&lt;/td&gt;
&lt;td&gt;Infrastructure-as-code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git Continuous Deploy&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Yes (auto)&lt;/td&gt;
&lt;td&gt;Simple auto-deploy on push&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Build&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Yes (pipeline)&lt;/td&gt;
&lt;td&gt;Complex CI/CD pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Actions&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Yes (pipeline)&lt;/td&gt;
&lt;td&gt;GitHub-centric teams&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice, most teams follow a natural progression. You start with &lt;strong&gt;source deploy&lt;/strong&gt; for prototyping: one command, instant feedback. As the project matures, you move to &lt;strong&gt;pre-built images&lt;/strong&gt; for reproducibility and control. When the team grows, you add &lt;strong&gt;CI/CD&lt;/strong&gt; (Cloud Build, GitHub Actions, or continuous deployment from Git) so deployments happen automatically and consistently.&lt;/p&gt;

&lt;p&gt;You don't need to pick one and stick with it. Use source deploy for your dev environment and image-based deploys for production. Use the Console to explore, then codify what you learned in YAML. The deployment method is a tool, not a commitment.&lt;/p&gt;

&lt;p&gt;These aren't the only options. Cloud Run also supports deployment via &lt;a href="https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_v2_service" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt;, &lt;a href="https://cloud.google.com/deploy" rel="noopener noreferrer"&gt;Cloud Deploy&lt;/a&gt; for managed continuous delivery pipelines, &lt;a href="https://cloud.google.com/code" rel="noopener noreferrer"&gt;Cloud Code&lt;/a&gt; for IDE integration, client libraries, and the REST API directly. The &lt;a href="https://cloud.google.com/run/docs/deploying" rel="noopener noreferrer"&gt;Cloud Run deployment documentation&lt;/a&gt; covers the full list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Revisions and Traffic Management
&lt;/h2&gt;

&lt;p&gt;Every time you deploy to Cloud Run, it creates a new &lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/managing/revisions" rel="noopener noreferrer"&gt;revision&lt;/a&gt;&lt;/strong&gt;: an immutable snapshot of your service's configuration and container image. Think of revisions as Cloud Run's built-in version control. Your service might accumulate dozens of revisions over its lifetime, each representing a specific deployment. Old revisions stick around and can serve traffic again at any time. Nothing is deleted unless you explicitly remove it.&lt;/p&gt;

&lt;p&gt;By default, Cloud Run auto-generates revision names like &lt;code&gt;my-service-00001-abc&lt;/code&gt;. That works, but it's not helpful when you're staring at a list of revisions trying to figure out which one introduced a bug. You can set meaningful names with the &lt;code&gt;--revision-suffix&lt;/code&gt; flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; us-docker.pkg.dev/my-project/repo/my-image:v1.2.3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--revision-suffix&lt;/span&gt; v1-2-3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the revision is named &lt;code&gt;my-service-v1-2-3&lt;/code&gt;. In a CI/CD pipeline, you might use the Git commit SHA: &lt;code&gt;--revision-suffix=$(git rev-parse --short HEAD)&lt;/code&gt;. When something goes wrong, you can immediately tell which commit is running.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traffic Splitting
&lt;/h3&gt;

&lt;p&gt;But the real power of revisions is &lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration" rel="noopener noreferrer"&gt;traffic splitting&lt;/a&gt;&lt;/strong&gt;. Because revisions are immutable and stick around, you can split traffic between them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run services update-traffic my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--to-revisions&lt;/span&gt; my-service-v1-2-3&lt;span class="o"&gt;=&lt;/span&gt;95,my-service-v1-3-0&lt;span class="o"&gt;=&lt;/span&gt;5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends 95% of traffic to the old revision and 5% to the new one. Watch the metrics in &lt;a href="https://cloud.google.com/monitoring" rel="noopener noreferrer"&gt;Cloud Monitoring&lt;/a&gt;. If the new revision's error rate and latency look good, shift more traffic. If something's off, one command puts you back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run services update-traffic my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--to-revisions&lt;/span&gt; my-service-v1-2-3&lt;span class="o"&gt;=&lt;/span&gt;100 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instant rollback. No redeployment, no rebuild, no downtime. The old revision is still there, already running, ready to take 100% of traffic again. This is the real power of immutable revisions.&lt;/p&gt;

&lt;p&gt;While the CLI works well for scripted rollouts, traffic splitting is one of those things that's often easier to do from the &lt;a href="https://console.cloud.google.com/run" rel="noopener noreferrer"&gt;Cloud Run Console&lt;/a&gt;. You can see all your revisions, drag sliders to adjust percentages, and watch the changes take effect.&lt;/p&gt;

&lt;p&gt;You can also use traffic splitting for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A/B testing&lt;/strong&gt; across different versions of your service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradual rollouts&lt;/strong&gt; where you shift traffic incrementally (5% → 25% → 50% → 100%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blue/green deployments&lt;/strong&gt; by deploying the new revision with 0% traffic, testing it via a revision tag (see below), then flipping traffic all at once&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Revision Tags
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration#tags" rel="noopener noreferrer"&gt;Revision tags&lt;/a&gt; give individual revisions their own stable URLs without routing any production traffic to them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run services update-traffic my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-tags&lt;/span&gt; &lt;span class="nv"&gt;staging&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-service-v1-3-0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a URL like &lt;code&gt;https://staging---my-service-abc123-uc.a.run.app&lt;/code&gt; that points directly to that revision. Your QA team can test the new version at that URL while production traffic continues hitting the current revision untouched. When you're satisfied, shift traffic. No separate staging environment needed.&lt;/p&gt;

&lt;p&gt;You can have multiple tags active at once: &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;canary&lt;/code&gt;, &lt;code&gt;pr-42&lt;/code&gt;. Each gets its own URL. This is particularly useful in CI/CD pipelines where you want to run automated tests against a deployed revision before routing real users to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Cloud Run gives you many paths to deployment, each designed for a different stage of your project. You don't need to use all of them.&lt;/p&gt;

&lt;p&gt;The pattern I see most often: teams start with &lt;code&gt;gcloud run deploy --source .&lt;/code&gt; and the default configuration. That gets them running in minutes. As the project matures, they move to pre-built images for reproducibility, add CI/CD for automation, and use revisions and traffic splitting for safe rollouts. Every change takes effect on the next deployment, with zero downtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 3 is coming soon.&lt;/strong&gt; We'll dive into the configuration options that let you tune Cloud Run for your specific workload: CPU, memory, scaling, networking, secrets, and security. Follow so you don't miss it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs" rel="noopener noreferrer"&gt;Cloud Run documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/pricing" rel="noopener noreferrer"&gt;Cloud Run pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/deploying" rel="noopener noreferrer"&gt;Cloud Run deployment documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/sdk/gcloud/reference/run/deploy" rel="noopener noreferrer"&gt;&lt;code&gt;gcloud run deploy&lt;/code&gt; reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/build" rel="noopener noreferrer"&gt;Cloud Build&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/docs/buildpacks/overview" rel="noopener noreferrer"&gt;Google Cloud Buildpacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/artifact-registry" rel="noopener noreferrer"&gt;Artifact Registry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/writing-a-function" rel="noopener noreferrer"&gt;Cloud Run functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/deploy-functions" rel="noopener noreferrer"&gt;Cloud Run deploy functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/managing/revisions" rel="noopener noreferrer"&gt;Cloud Run revisions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration" rel="noopener noreferrer"&gt;Cloud Run traffic splitting (rollouts and rollbacks)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/monitoring" rel="noopener noreferrer"&gt;Cloud Monitoring&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/reference/rest/v1/namespaces.services" rel="noopener noreferrer"&gt;Knative Serving API v1 schema&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/build/docs/automating-builds/create-manage-triggers" rel="noopener noreferrer"&gt;Cloud Build triggers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/continuous-deployment-with-cloud-build" rel="noopener noreferrer"&gt;Continuous deployment from Git&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google-github-actions/deploy-cloudrun" rel="noopener noreferrer"&gt;&lt;code&gt;google-github-actions/deploy-cloudrun&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;Workload Identity Federation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/deploy" rel="noopener noreferrer"&gt;Cloud Deploy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/code" rel="noopener noreferrer"&gt;Cloud Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_v2_service" rel="noopener noreferrer"&gt;Terraform Cloud Run resource&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/secret-manager" rel="noopener noreferrer"&gt;Secret Manager&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc" rel="noopener noreferrer"&gt;Direct VPC egress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://knative.dev/" rel="noopener noreferrer"&gt;Knative&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloudrun</category>
      <category>gcp</category>
      <category>serverless</category>
      <category>devops</category>
    </item>
    <item>
      <title>This is Cloud Run: A Decision Guide for Developers</title>
      <dc:creator>Daniel Gwerzman</dc:creator>
      <pubDate>Sat, 14 Mar 2026 08:54:11 +0000</pubDate>
      <link>https://forem.com/gde/this-is-cloud-run-a-decision-guide-for-developers-428m</link>
      <guid>https://forem.com/gde/this-is-cloud-run-a-decision-guide-for-developers-428m</guid>
      <description>&lt;p&gt;I like to throw spaghetti at the wall and see if it sticks.&lt;/p&gt;

&lt;p&gt;Some of my best projects started exactly that way. An idea on a Saturday morning, a container deployed by lunch, a URL shared with a friend by dinner. No infrastructure planning, no provisioning tickets, no three-day detour through VPC configurations before writing a single line of business logic. Just code, deploy, done.&lt;/p&gt;

&lt;p&gt;Every single one of them ran on &lt;a href="https://cloud.google.com/run?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For me, Cloud Run is my go-to, whether it's a weekend experiment that might go nowhere or a production solution for a client. More than once, a quick demo I built on Cloud Run ended up maturing into the actual production system, running on the exact same setup.&lt;/p&gt;

&lt;p&gt;A recent example: every time I build something with AI, even a quick vibe-coded prototype, I need a server-side component to keep my LLM API calls secure and my credential keys away from public eyes. Cloud Run is perfect for this. In minutes I have a secure backend with HTTPS, and I didn't have to think about infrastructure at all. The prototype works, the keys are safe, and if the project grows into something real, the backend is already production-ready.&lt;/p&gt;

&lt;p&gt;The idea that you need days of infrastructure preparation before you can test something with real users has always felt backwards to me. I believe in getting something live as fast as possible, putting it in front of people, and &lt;em&gt;then&lt;/em&gt; deciding if it deserves more investment.&lt;/p&gt;

&lt;p&gt;But this article isn't a love letter. I want to give you the understanding to make a real architectural decision: &lt;strong&gt;when is Cloud Run the right choice, and when isn't it?&lt;/strong&gt; We'll look at what it actually is under the hood, what you get for free, where its boundaries are, and when you should consider moving to Kubernetes. By the end, you'll know whether Cloud Run belongs in your next project, or whether you should reach for something else entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Cloud Run?
&lt;/h2&gt;

&lt;p&gt;Cloud Run is a fully managed serverless platform on &lt;a href="https://cloud.google.com/?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Google Cloud&lt;/a&gt; that runs containers. You give it code, it gives you a URL. No clusters to provision, no nodes to manage, no load balancers to configure. You bring the code; Google handles everything else.&lt;/p&gt;

&lt;p&gt;But what makes Cloud Run different is its core promise: &lt;strong&gt;the same configuration that runs your proof of concept can carry you to production.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think about that for a moment. Most cloud services force you into one of two buckets: either you use a "quick and dirty" option for prototyping that you'll have to throw away later, or you invest days of infrastructure setup upfront for a production-grade environment. Cloud Run refuses that trade-off. Your weekend project and your production workload run on the same platform, with the same security, the same scaling, the same deployment model.&lt;/p&gt;

&lt;p&gt;And when nobody is using your service? It scales to zero. You pay nothing. That means you can spin up ten experimental services, let them sit idle for a month, and your bill is exactly zero.&lt;/p&gt;

&lt;p&gt;So where does Cloud Run sit in the serverless landscape? It's not a virtual machine, you don't manage an OS. It's not a Kubernetes cluster, you don't manage nodes or pods. It's not a function, you're not limited to a single entry point with a 15-minute timeout. It's a &lt;strong&gt;container-as-a-service&lt;/strong&gt;: you provide something that can run in a container, and the platform handles everything else (placement, scaling, networking, TLS). If you already know containers, bring your image and Cloud Run runs it. If you don't? Just point it at your source code and Cloud Run will package and deploy it for you. And if you don't even want to manage an HTTP framework, &lt;a href="https://cloud.google.com/run/docs/writing-a-function?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run functions&lt;/a&gt; let you write functions and deploy them individually, with Cloud Run wrapping each one in a server for you. Either way, you end up with a running service.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Behind the Curtain?
&lt;/h2&gt;

&lt;p&gt;You don't need to understand any of this to use Cloud Run. But knowing what's underneath explains &lt;em&gt;why&lt;/em&gt; the defaults are production-grade and why you can trust them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Borg: Google's Internal Engine
&lt;/h3&gt;

&lt;p&gt;Cloud Run doesn't run on some separate, less-proven infrastructure. It runs directly on &lt;a href="https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/" rel="noopener noreferrer"&gt;Borg&lt;/a&gt;, Google's internal cluster management system. The same system that powers Gmail, YouTube, Google Search, and virtually every other Google service. Borg has been in production for over a decade, deploying &lt;strong&gt;billions of containers per week&lt;/strong&gt; across clusters of tens of thousands of machines.&lt;/p&gt;

&lt;p&gt;If Borg sounds familiar, it should. It was the direct predecessor to &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;. Many of the same engineers and architectural concepts carried over. But while Kubernetes is the open-source version built for the rest of us, Borg is the battle-hardened original that still runs Google internally.&lt;/p&gt;

&lt;p&gt;What does this mean for your containers? It means they inherit the same scheduling, failover, and resource management that Google trusts for its own products. It means your service benefits from Google's &lt;a href="https://cloud.google.com/security/beyondprod?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;BeyondProd&lt;/a&gt; zero-trust security framework, where trust depends on code provenance and service identity, not network location. It means &lt;a href="https://cloud.google.com/docs/security/binary-authorization-for-borg?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Binary Authorization for Borg&lt;/a&gt; verifies that only reviewed, properly-built code is deployed to the infrastructure.&lt;/p&gt;

&lt;p&gt;In short: your containers run on the same infrastructure as Gmail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Knative: The API Layer
&lt;/h3&gt;

&lt;p&gt;Cloud Run's API is based on &lt;a href="https://knative.dev/docs/serving/" rel="noopener noreferrer"&gt;Knative Serving&lt;/a&gt;, an open-source project originally started by Google for running serverless workloads on Kubernetes. But Cloud Run is not "managed Knative". It reimplements the Knative Serving API on top of Borg, with no Kubernetes underneath.&lt;/p&gt;

&lt;p&gt;The practical takeaway: if you define your service using a Knative YAML manifest, that definition is portable between Cloud Run and self-hosted Knative on Kubernetes. And because there's no Kubernetes under the hood, you don't pay the complexity tax of managing a cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  gVisor and Container Sandboxing
&lt;/h3&gt;

&lt;p&gt;Every Cloud Run instance is sandboxed with &lt;strong&gt;two layers of isolation&lt;/strong&gt;: not just Linux namespaces and cgroups like standard containers, but hardware-backed virtualization on top of that.&lt;/p&gt;

&lt;p&gt;Cloud Run offers two execution environments:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen1 (gVisor-based):&lt;/strong&gt; &lt;a href="https://gvisor.dev/" rel="noopener noreferrer"&gt;gVisor&lt;/a&gt; is an open-source container sandbox developed by Google. It acts as a user-space kernel, a process written in Go that intercepts your container's system calls and reimplements them, so the host kernel is never directly exposed. This gives you a smaller attack surface and faster cold starts, but some software that relies on unusual system calls may be incompatible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen2 (Linux microVM-based):&lt;/strong&gt; Instead of gVisor, Gen2 runs your container inside a lightweight virtual machine with a full Linux kernel. You get complete system call compatibility, better sustained CPU and network performance, but slightly longer cold starts.&lt;/p&gt;

&lt;p&gt;Both environments use the same two-layer approach: a hardware-backed &lt;strong&gt;virtual machine monitor (VMM)&lt;/strong&gt; boundary between instances, plus the software kernel layer (gVisor or microVM). Even if someone found a way to escape the container sandbox, they'd still face the hardware virtualization boundary.&lt;/p&gt;

&lt;p&gt;You choose per service. Pricing is identical. Most developers never need to think about it. If you don't specify an execution environment, Cloud Run selects one automatically based on the features your service uses.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Production-Ready Default
&lt;/h3&gt;

&lt;p&gt;You can happily use Cloud Run without knowing any of this. But when someone asks "is Cloud Run production-ready?", the answer isn't "probably." Hardware-backed isolation between every instance, zero-trust security, and battle-tested scheduling. It comes hardened.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Get Out of the Box
&lt;/h2&gt;

&lt;p&gt;Here's what a single deploy command gives you, before you touch a single config file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud run deploy my-api &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1 &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt;

Building using Dockerfile and deploying container to Cloud Run service &lt;span class="o"&gt;[&lt;/span&gt;my-api] &lt;span class="k"&gt;in &lt;/span&gt;project &lt;span class="o"&gt;[&lt;/span&gt;my-project] region &lt;span class="o"&gt;[&lt;/span&gt;us-central1]
✓ Building and deploying... Done.
  ✓ Uploading sources...
  ✓ Building Container...
  ✓ Creating Revision...
  ✓ Routing traffic...
Done.
Service &lt;span class="o"&gt;[&lt;/span&gt;my-api] revision &lt;span class="o"&gt;[&lt;/span&gt;my-api-00001-abc] has been deployed and is serving 100 percent of traffic.
Service URL: https://my-api-abc123-uc.a.run.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That URL is live, load-balanced, auto-scaling, and secured with a managed TLS certificate. Let's break down what's included.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security (Zero Config)
&lt;/h3&gt;

&lt;p&gt;Every Cloud Run service automatically gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;HTTPS with managed TLS certificates.&lt;/strong&gt; Every &lt;code&gt;*.run.app&lt;/code&gt; URL is served over HTTPS with auto-provisioned, auto-renewed certificates. There is no option to serve plain HTTP on the public endpoint. You cannot accidentally deploy an insecure service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DDoS protection.&lt;/strong&gt; The &lt;a href="https://cloud.google.com/docs/security/infrastructure/design#google-front-end-service" rel="noopener noreferrer"&gt;Google Front End (GFE)&lt;/a&gt; sits in front of every Cloud Run service, applying the same DDoS protections that guard Google's own services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hardware-backed container isolation.&lt;/strong&gt; As we covered in the "behind the curtain" section, every instance is sandboxed behind a VMM boundary. This isn't namespace isolation, it's virtualization-level separation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Encryption everywhere.&lt;/strong&gt; Data encrypted at rest using Google-managed keys. All traffic between Google Cloud services encrypted in transit. This is the default and always on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/iam?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;IAM&lt;/a&gt;-based access control.&lt;/strong&gt; Every service integrates with Google Cloud IAM. By default, services require authentication. You explicitly opt in to public access with &lt;code&gt;--allow-unauthenticated&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scaling (Zero Config)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale-to-zero by default.&lt;/strong&gt; No traffic? No instances. No cost. This is the default behavior, you don't configure it, you don't enable it. It just works.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Autoscaling up to 100 instances by default.&lt;/strong&gt; Cloud Run automatically evaluates two key signals: request concurrency (targeting 60% of your configured max) and CPU utilization (targeting 60%). It scales up and down based on real demand. The default cap of 100 instances is configurable higher. To put that in perspective: if your service handles one request at a time and you get a sudden spike of 80 concurrent users, Cloud Run spins up roughly 80 instances to absorb the load, then scales back down as traffic drops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Idle instance retention.&lt;/strong&gt; After the last request, instances may be kept idle for up to 15 minutes before being terminated. This absorbs traffic bursts without cold starts. It's a small detail that makes a big difference in practice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Startup CPU Boost&lt;/a&gt;.&lt;/strong&gt; When an instance starts up, Cloud Run temporarily doubles (or more) its CPU allocation to speed up initialization. A service configured for 2 vCPU gets boosted to 4 vCPU during startup and for 10 seconds after. Google reported up to 50% faster startup times for Java/Spring applications when this feature is enabled.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Observability (Zero Config)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automatic logging.&lt;/strong&gt; Everything your container writes to &lt;code&gt;stdout&lt;/code&gt; and &lt;code&gt;stderr&lt;/code&gt; is automatically captured in &lt;a href="https://cloud.google.com/logging?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Logging&lt;/a&gt;. No agent to install, no sidecar to configure. Write structured JSON logs and they're automatically parsed into searchable fields. For example, a JSON line like &lt;code&gt;{"severity":"ERROR", "message":"connection refused", "sessionId":"abc-123", "userId":"user-42"}&lt;/code&gt; becomes a fully filterable log entry in the Cloud Logging console. That means you can add any custom fields you want to your JSON payload (session IDs, user IDs, request traces, feature flags) and later filter your logs by those exact fields. Debugging a problem for a specific user? Filter by &lt;code&gt;jsonPayload.userId="user-42"&lt;/code&gt; and you get every log entry for that user across all instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in metrics.&lt;/strong&gt; &lt;a href="https://cloud.google.com/monitoring?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Monitoring&lt;/a&gt; automatically tracks request count, latency distribution, CPU utilization, memory utilization, and instance count. These show up in the Cloud Run console with no setup.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit logs always on.&lt;/strong&gt; &lt;a href="https://cloud.google.com/logging/docs/audit?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Admin Activity audit logs&lt;/a&gt; record who deployed what, when, and with what configuration. These are always enabled and cannot be turned off.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Infrastructure (Zero Config)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in load balancing.&lt;/strong&gt; Requests are distributed across instances automatically. No load balancer to provision or configure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Zero-downtime deployments.&lt;/strong&gt; Every deployment creates a new immutable &lt;a href="https://cloud.google.com/run/docs/managing/revisions?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;revision&lt;/a&gt;. Traffic switches to the new revision only after it passes its startup probe. Old instances keep serving in-flight requests. No deployment strategy to configure. It just happens. And because revisions are immutable and stick around, you can &lt;a href="https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;split traffic&lt;/a&gt; between them. Send 5% of traffic to the new revision while 95% stays on the current one, monitor the metrics, and gradually shift. Canary deployments without a deployment tool.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automatic health checks.&lt;/strong&gt; Cloud Run configures a TCP startup probe by default: it waits for your container to listen on the expected port before sending traffic. Your service doesn't receive requests until it's actually ready.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OS patching and runtime maintenance.&lt;/strong&gt; You never patch the underlying OS, kernel, or runtime. Google handles the entire infrastructure stack beneath your container.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cloud Run Functions: The Simpler Path
&lt;/h2&gt;

&lt;p&gt;Everything above applies to Cloud Run &lt;em&gt;services&lt;/em&gt;, where you bring a container (or source code) that runs an HTTP server. But what if you don't want to deal with an HTTP framework at all?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/run/docs/writing-a-function?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run functions&lt;/a&gt; let you skip all of that. You write your functions, point a deployment at one of them, and Cloud Run wraps it in an HTTP server automatically. Your source code can define as many functions as you like. Each deployment serves one entry point, specified by the &lt;code&gt;--function&lt;/code&gt; flag. Same codebase, multiple deployments, each with its own URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions_framework&lt;/span&gt;

&lt;span class="nd"&gt;@functions_framework.http&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;World&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy hello-func &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function&lt;/span&gt; hello &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--base-image&lt;/span&gt; python312 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No &lt;code&gt;Flask&lt;/code&gt;, no &lt;code&gt;FastAPI&lt;/code&gt;, no &lt;code&gt;Dockerfile&lt;/code&gt;. Cloud Run builds the container, injects the HTTP server, and deploys it. You get the same scaling, the same security, the same zero-config observability that a full Cloud Run service gets.&lt;/p&gt;

&lt;p&gt;If this sounds like &lt;a href="https://cloud.google.com/functions?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Functions&lt;/a&gt;, here's the history. Cloud Functions 1st gen ran on older, separate infrastructure with strict limits: 9-minute timeouts, one request per instance, no concurrency. Cloud Functions 2nd gen (GA in 2022) was already built on top of Cloud Run under the hood, which unlocked 60-minute timeouts and multi-request concurrency. In 2024, Google made it official and rebranded 2nd gen as &lt;strong&gt;Cloud Run functions&lt;/strong&gt;, consolidating everything under the Cloud Run name. So this isn't a new product. It's the recognition that the infrastructure was already unified. If your functions outgrow the one-entry-point-per-deployment model and you need routing, middleware, or multiple endpoints behind a single URL, you swap it for a full service on the same platform. No migration, no new infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use functions vs. services:&lt;/strong&gt; Cloud Run functions shine for single-purpose endpoints: webhooks, event handlers, lightweight APIs, scheduled tasks. A good example from my own workflow: when I build AI-powered front-end apps, I never call the LLM API directly from the client. That would mean shipping my API keys to the browser. Instead, I deploy a Cloud Run function that sits between my front end and the LLM provider. The function validates the user's authorization, makes the LLM call with my credentials server-side, and returns the response. My keys never leave the server. It takes minutes to set up, and it's exactly the kind of single-purpose endpoint where a function is the right fit. The moment you need multiple routes, middleware, or background processing within the same service, a full Cloud Run service with your own HTTP framework gives you that control. It's not a matter of which is "better." It's about matching the model to the job.&lt;/p&gt;

&lt;p&gt;Supported runtimes include Node.js, Python, Go, Java, .NET, Ruby, and PHP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud Run Jobs: Run to Completion
&lt;/h2&gt;

&lt;p&gt;Cloud Run services and functions are request-driven: they wait for traffic and respond to it. But not every workload fits that model. What about a nightly database export, a batch of image transformations, or a data pipeline that processes a million rows and then exits?&lt;/p&gt;

&lt;p&gt;That's what &lt;a href="https://cloud.google.com/run/docs/create-jobs?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run jobs&lt;/a&gt; are for. Instead of listening for requests, a job runs your container to completion and stops. No HTTP endpoint, no scaling based on traffic. You tell it what to do, it does it, and it's done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run &lt;span class="nb"&gt;jobs &lt;/span&gt;create my-etl-job &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; us-docker.pkg.dev/my-project/repo/etl:v1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tasks&lt;/span&gt; 100 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--task-timeout&lt;/span&gt; 30m &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-retries&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run &lt;span class="nb"&gt;jobs &lt;/span&gt;execute my-etl-job &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first command creates the job. The second runs it. You can also run jobs &lt;a href="https://cloud.google.com/run/docs/execute/jobs-on-schedule?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;on a schedule&lt;/a&gt; using &lt;a href="https://cloud.google.com/scheduler?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt;, or trigger them from workflows and event-driven pipelines.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--tasks&lt;/code&gt; flag is where it gets interesting. A job can run up to &lt;strong&gt;10,000 parallel tasks&lt;/strong&gt;, each receiving a &lt;code&gt;CLOUD_RUN_TASK_INDEX&lt;/code&gt; environment variable (0 through 9,999) so it knows which chunk of work to handle. Need to process a million images? Create a job with 1,000 tasks, each processing 1,000 images. Cloud Run runs them in parallel, retries any that fail (up to &lt;code&gt;--max-retries&lt;/code&gt;), and reports the result.&lt;/p&gt;

&lt;p&gt;Task timeouts go up to &lt;strong&gt;168 hours (7 days)&lt;/strong&gt;, or 1 hour with GPU, compared to the 60-minute request timeout on services. This makes jobs the natural fit for workloads that take hours to complete.&lt;/p&gt;

&lt;p&gt;Jobs get the same infrastructure benefits as services: the same Borg scheduling, the same container isolation, the same scaling. The difference is the execution model. Services are long-lived and request-driven. Jobs are ephemeral and task-driven. Both are first-class Cloud Run workload types.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Cloud Run Is NOT the Right Choice
&lt;/h2&gt;

&lt;p&gt;Every platform has boundaries. Cloud Run's have narrowed significantly over the past two years (sidecars, GPU support, volume mounts, and worker pools have all landed), but real limits remain. Knowing them in advance saves you from the painful realization six months into a project that you're fighting the platform instead of building on it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Statelessness by Design
&lt;/h3&gt;

&lt;p&gt;Cloud Run instances are ephemeral. They can be created and destroyed at any moment. If your architecture requires any of the following, you need to understand the trade-offs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Local disk persistence beyond the instance lifecycle.&lt;/strong&gt; The local filesystem is ephemeral. When the instance is gone, so is everything on disk. That said, Cloud Run now supports mounting &lt;a href="https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Storage buckets via FUSE&lt;/a&gt; and &lt;a href="https://cloud.google.com/run/docs/configuring/services/nfs-volume-mounts?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;NFS file shares via Filestore&lt;/a&gt;, giving you read/write access to persistent shared storage. Cloud Storage mounts are eventually consistent (no file locking, last write wins), while Filestore gives you full POSIX semantics. Neither is local disk, but for many use cases they close the gap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In-memory caching shared across instances.&lt;/strong&gt; There are no sticky sessions by default (though &lt;a href="https://cloud.google.com/run/docs/configuring/session-affinity?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;session affinity&lt;/a&gt; is available on a best-effort basis). Each request might hit a different instance. If you need shared state, you need an external store like &lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; or &lt;a href="https://cloud.google.com/memorystore?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Memorystore&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WebSocket connections that must survive beyond ~60 minutes.&lt;/strong&gt; Cloud Run supports WebSockets, and combined with session affinity this works well for real-time applications. But connections are limited to approximately 60 minutes (the maximum &lt;a href="https://cloud.google.com/run/docs/configuring/request-timeout?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;request timeout&lt;/a&gt;). If you need connections that live for hours or days, you need dedicated infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long-running background workers without HTTP triggers.&lt;/strong&gt; Cloud Run services are request-driven. But this boundary is softening: &lt;a href="https://cloud.google.com/run/docs/deploy-worker-pools?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Cloud Run worker pools&lt;/a&gt; (currently in preview) are designed for pull-based workloads like Kafka consumers and Pub/Sub subscribers, with no public HTTP endpoint required and up to 40% lower pricing than standard services.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams that need truly stateful workloads (ML model serving with warm caches that must survive across deploys, game servers with persistent connections beyond 60 minutes) find &lt;a href="https://cloud.google.com/kubernetes-engine?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;GKE's&lt;/a&gt; persistent volumes and StatefulSets a more honest fit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Container Support: Better Than Before, Not Kubernetes
&lt;/h3&gt;

&lt;p&gt;Cloud Run now supports &lt;a href="https://cloud.google.com/run/docs/configuring/services/containers?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;multi-container instances (sidecars)&lt;/a&gt;. You can run up to 10 containers per instance sharing the same network namespace and in-memory volumes. This enables patterns like running &lt;a href="https://cloud.google.com/run/docs/internet-proxy-nginx-sidecar?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;Nginx as a reverse proxy&lt;/a&gt;, &lt;a href="https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar?utm_campaign=deveco_gdemembers&amp;amp;utm_source=deveco" rel="noopener noreferrer"&gt;OpenTelemetry collectors&lt;/a&gt; for custom metrics, Envoy for traffic management, or Prometheus for metric export.&lt;/p&gt;

&lt;p&gt;But it's not full Kubernetes pod topology. The key differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Only one container receives inbound HTTP traffic&lt;/strong&gt; (the "ingress container"). Sidecars can't independently serve external requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No init containers.&lt;/strong&gt; You can control startup ordering (sidecar starts before ingress container), but unlike Kubernetes init containers, sidecars keep running. They don't run-to-completion before the main container starts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Maximum 10 containers per instance.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most sidecar patterns (proxies, observability agents, log processors), Cloud Run's implementation is sufficient. For complex pod topologies with init containers and multiple ingress points, GKE remains the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Networking Depth
&lt;/h3&gt;

&lt;p&gt;Cloud Run's networking has improved with &lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc" rel="noopener noreferrer"&gt;Direct VPC egress&lt;/a&gt; (placing instances directly on your VPC without a connector), but teams still hit walls with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Service mesh requirements.&lt;/strong&gt; &lt;a href="https://istio.io/" rel="noopener noreferrer"&gt;Istio&lt;/a&gt; and &lt;a href="https://cloud.google.com/anthos/service-mesh" rel="noopener noreferrer"&gt;Anthos Service Mesh&lt;/a&gt; are native in GKE. You can run an Envoy sidecar on Cloud Run, but a full service mesh with mTLS, traffic policies, and observability across services is a different story.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pod-to-pod direct communication&lt;/strong&gt; without going through load balancers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom network policies&lt;/strong&gt; for zero-trust internal segmentation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-cluster routing and traffic mirroring.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your architecture involves sophisticated network topologies or strict internal traffic control, GKE gives you the knobs. Cloud Run gives you simplicity at the cost of that control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cold Start Economics
&lt;/h3&gt;

&lt;p&gt;Cloud Run's &lt;a href="https://cloud.google.com/run/docs/configuring/min-instances" rel="noopener noreferrer"&gt;minimum instances&lt;/a&gt; feature mitigates cold starts, and &lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu" rel="noopener noreferrer"&gt;Startup CPU Boost&lt;/a&gt; temporarily doubles CPU during initialization to get instances ready faster. For many workloads, these two features together make cold starts a non-issue.&lt;/p&gt;

&lt;p&gt;But if your latency requirements are strict and you end up keeping instances always-on, you've lost the serverless cost model. You're now paying for always-on compute. And once you're paying for always-on instances anyway, the economic argument shifts toward GKE, where you have more control over resource packing, node utilization, and cost optimization across multiple services sharing the same cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workload Heterogeneity
&lt;/h3&gt;

&lt;p&gt;Cloud Run primarily targets HTTP/gRPC workloads, though it keeps expanding. &lt;a href="https://cloud.google.com/run/docs/create-jobs" rel="noopener noreferrer"&gt;Cloud Run jobs&lt;/a&gt; handle batch processing, and &lt;a href="https://cloud.google.com/run/docs/configuring/services/gpu" rel="noopener noreferrer"&gt;GPU support&lt;/a&gt; makes AI/ML inference possible with scale-to-zero economics. The NVIDIA L4 (24 GB VRAM) is generally available, and the NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM) is available in preview.&lt;/p&gt;

&lt;p&gt;But the moment a team needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Daemonsets&lt;/strong&gt; for node-level operations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Priority classes and preemption policies&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple GPUs per instance&lt;/strong&gt; (Cloud Run supports only one)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large models beyond single-GPU capacity&lt;/strong&gt; (the L4's 24 GB VRAM limits you to ~9B parameters, though the RTX PRO 6000 with 96 GB VRAM expands this significantly in preview)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...GKE becomes the natural fit. Cloud Run is opinionated about what it runs. That opinion keeps getting broader, but it has limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration Path: Cloud Run to Kubernetes
&lt;/h2&gt;

&lt;p&gt;Here's the good news: if you start on Cloud Run and later need to move to Kubernetes, the migration path is straightforward, at least for the container itself.&lt;/p&gt;

&lt;p&gt;If you deployed with a Docker image, &lt;strong&gt;that same image runs on GKE without modification&lt;/strong&gt;. Your container doesn't know or care whether it's running on Cloud Run or Kubernetes. It listens on a port, responds to HTTP requests, and that's it.&lt;/p&gt;

&lt;p&gt;If you deployed from source code (using Cloud Run's buildpack-based deployment), converting to a Docker image is trivial. You're adding a &lt;code&gt;Dockerfile&lt;/code&gt; to a project that already works. The application code, the dependencies, the runtime behavior, none of that changes. You're just making the packaging step explicit instead of letting buildpacks handle it.&lt;/p&gt;

&lt;p&gt;But here's the honest part: &lt;strong&gt;the container isn't the hard part of the migration. The redesign is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You're moving to Kubernetes &lt;em&gt;because&lt;/em&gt; you need something Cloud Run doesn't offer: complex pod topologies, full service mesh, multi-GPU inference, or unlimited connection lifetimes. That means you're not just moving a container; you're evolving your architecture. What was an in-memory cache on Cloud Run becomes a Redis cluster backed by persistent volumes on GKE. What was a single ingress container with an Envoy sidecar becomes a pod with init containers, network policies, and custom scheduling rules. The container image stays the same; everything around it changes.&lt;/p&gt;

&lt;p&gt;The container is portable. The architecture might not be. And that's fine. It's the right trade-off. Cloud Run lets you start fast, validate your idea with real users, and build confidence in the solution. When you hit the boundaries we discussed above, you graduate to Kubernetes with a proven container and a clear understanding of what you actually need.&lt;/p&gt;

&lt;p&gt;Cloud Run isn't a dead end. It's a deliberate starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So here's the decision framework:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with Cloud Run&lt;/strong&gt; when you're building a containerized service and you want to move fast without worrying about infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay on Cloud Run&lt;/strong&gt; as long as your workload fits its model: stateless, request-driven, with scaling needs that the platform handles naturally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graduate to GKE&lt;/strong&gt; when you hit the boundaries. You'll know when you do, because you'll be fighting the platform instead of building on it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your container is the unit of portability. Whether it ends up on Cloud Run, GKE, or another platform entirely, the work you put into building and packaging it is never wasted. That's not unique to Cloud Run. It's the power of containers in general. But Cloud Run is the fastest way I've found to prove that a container works in production, without the upfront investment that usually requires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/kulaone/this-is-cloud-run-nine-ways-to-deploy-and-when-to-use-each-506b"&gt;Part 2&lt;/a&gt;&lt;/strong&gt; We'll get hands-on: the different ways to deploy to Cloud Run (there are more than you'd expect). In &lt;strong&gt;Part 3 is coming soon&lt;/strong&gt; we'll dive into the configuration options that let you tune CPU, memory, scaling, networking, and security for your specific needs. Follow so you don't miss it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs" rel="noopener noreferrer"&gt;Cloud Run documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/pricing" rel="noopener noreferrer"&gt;Cloud Run pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/" rel="noopener noreferrer"&gt;Large-scale cluster management at Google with Borg (research paper)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/security/beyondprod" rel="noopener noreferrer"&gt;BeyondProd: A new approach to cloud-native security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/docs/security/binary-authorization-for-borg" rel="noopener noreferrer"&gt;Binary Authorization for Borg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://knative.dev/docs/serving/" rel="noopener noreferrer"&gt;Knative Serving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gvisor.dev/" rel="noopener noreferrer"&gt;gVisor: Container sandbox runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/writing-a-function" rel="noopener noreferrer"&gt;Cloud Run functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/functions" rel="noopener noreferrer"&gt;Cloud Functions (predecessor)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/functions/docs/functions-framework" rel="noopener noreferrer"&gt;Functions Framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/execution-environments" rel="noopener noreferrer"&gt;Cloud Run execution environments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/containers" rel="noopener noreferrer"&gt;Cloud Run multi-container (sidecar) support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/blog/products/serverless/cloud-run-now-supports-multi-container-deployments" rel="noopener noreferrer"&gt;Cloud Run now supports multi-container deployments (blog post)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/gpu" rel="noopener noreferrer"&gt;Cloud Run GPU support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/deploy-worker-pools" rel="noopener noreferrer"&gt;Cloud Run worker pools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts" rel="noopener noreferrer"&gt;Cloud Storage volume mounts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/nfs-volume-mounts" rel="noopener noreferrer"&gt;NFS volume mounts (Filestore)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/services/cpu" rel="noopener noreferrer"&gt;Startup CPU Boost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/session-affinity" rel="noopener noreferrer"&gt;Session affinity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/vpc-direct-vpc" rel="noopener noreferrer"&gt;Direct VPC egress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/docs/security/infrastructure/design#google-front-end-service" rel="noopener noreferrer"&gt;Google Front End service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/about-instance-autoscaling" rel="noopener noreferrer"&gt;Cloud Run instance autoscaling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/configuring/min-instances" rel="noopener noreferrer"&gt;Cloud Run minimum instances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/managing/revisions" rel="noopener noreferrer"&gt;Cloud Run revisions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration" rel="noopener noreferrer"&gt;Cloud Run traffic splitting (rollouts and rollbacks)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/logging" rel="noopener noreferrer"&gt;Cloud Logging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/monitoring" rel="noopener noreferrer"&gt;Cloud Monitoring&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/iam" rel="noopener noreferrer"&gt;Cloud IAM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/create-jobs" rel="noopener noreferrer"&gt;Cloud Run jobs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/run/docs/execute/jobs-on-schedule" rel="noopener noreferrer"&gt;Execute jobs on a schedule&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/scheduler" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/memorystore" rel="noopener noreferrer"&gt;Memorystore (managed Redis)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/kubernetes-engine" rel="noopener noreferrer"&gt;Google Kubernetes Engine (GKE)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloudrun</category>
      <category>gcp</category>
      <category>serverless</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Automate Your GitHub Workflow with Gemini CLI</title>
      <dc:creator>Daniel Gwerzman</dc:creator>
      <pubDate>Sun, 11 Jan 2026 10:58:08 +0000</pubDate>
      <link>https://forem.com/gde/automate-your-github-workflow-with-gemini-cli-4p4e</link>
      <guid>https://forem.com/gde/automate-your-github-workflow-with-gemini-cli-4p4e</guid>
      <description>&lt;h3&gt;
  
  
  Automate Your GitHub Workflow: Meet Your New AI Coding Partner
&lt;/h3&gt;

&lt;p&gt;Google dropped something that caught my attention back at Cloud Next 25 Tokyo: a new AI coding teammate that lives directly in your GitHub repository. It’s called the Gemini CLI GitHub Action, and it can automatically triage issues, fix bugs, and even review your pull requests.&lt;/p&gt;

&lt;p&gt;The best part? The Google team built this tool for themselves to handle the flood of requests on their own Gemini CLI repository, and now they’re sharing it with the rest of us. I spent some time testing it on my personal Flutter project, and I want to walk you through exactly how to set it up and what you can expect.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Exactly Does This Tool Do?
&lt;/h3&gt;

&lt;p&gt;Before we dive into setup, let’s clarify what we’re working with. The Gemini CLI GitHub Action gives you three main capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Issue Triage&lt;/strong&gt; : It reads new issues and automatically labels them (bug, feature, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Fixes&lt;/strong&gt; : It can analyze issues and write code to solve them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pull Request Reviews&lt;/strong&gt; : It provides automated code reviews with suggestions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as having an AI developer on your team that works 24/7, never gets tired, and can handle the routine tasks that eat up your time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://youtu.be/CDeVrgTBl6E" rel="noopener noreferrer"&gt;Full review on Youtube&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites: What You Need to Know
&lt;/h3&gt;

&lt;p&gt;This guide assumes you have basic familiarity with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub repositories (creating repos, issues, and pull requests)&lt;/li&gt;
&lt;li&gt;Command line basics&lt;/li&gt;
&lt;li&gt;Having a project you can test with&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re new to GitHub, spend some time with their &lt;a href="https://docs.github.com/en/get-started" rel="noopener noreferrer"&gt;getting started guide&lt;/a&gt; first. You’ll need to be comfortable creating issues and managing repositories.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step-by-Step Setup Guide
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Step 1: Install and Update Gemini CLI
&lt;/h3&gt;

&lt;p&gt;First, make sure you have the Gemini CLI installed and updated to the latest version. If you don’t have it yet, check the &lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;official installation guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once installed, verify you’re on the latest version — this is crucial for the GitHub integration to work properly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Set Up the GitHub Action
&lt;/h3&gt;

&lt;p&gt;Navigate to your project directory and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gemini setup github
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command takes just a few seconds and adds the necessary GitHub Action files to your repository. You’ll see new files in the &lt;code&gt;.github/workflows&lt;/code&gt; directory that define how the AI teammate will respond to different events.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Commit and Push
&lt;/h3&gt;

&lt;p&gt;Don’t forget this crucial step! Commit all the new files and push them to GitHub. The actions won’t work until they’re actually in your repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git add .
git commit -m "Add Gemini CLI GitHub Actions"
git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Add Your Gemini API Key
&lt;/h3&gt;

&lt;p&gt;Here’s the step that’s not immediately obvious from the documentation but is absolutely essential:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your GitHub repository&lt;/li&gt;
&lt;li&gt;Click*&lt;em&gt;Settings&lt;/em&gt;*&lt;/li&gt;
&lt;li&gt;Navigate to*&lt;em&gt;Security&lt;/em&gt;* →&lt;strong&gt;Secrets and Variables&lt;/strong&gt; →&lt;strong&gt;Actions&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Click*&lt;em&gt;New repository secret&lt;/em&gt;*&lt;/li&gt;
&lt;li&gt;Name it&lt;code&gt;GEMINI_API_KEY&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;For the value, you’ll need to get your API key from Google AI Studio&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To get your API key:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open&lt;a href="http://ai.dev" rel="noopener noreferrer"&gt;ai.dev&lt;/a&gt; in a new tab&lt;/li&gt;
&lt;li&gt;This takes you to Google AI Studio&lt;/li&gt;
&lt;li&gt;Click*&lt;em&gt;Get API key&lt;/em&gt;*&lt;/li&gt;
&lt;li&gt;Select*&lt;em&gt;Create new API key&lt;/em&gt;*&lt;/li&gt;
&lt;li&gt;Copy the generated string and paste it as your secret value&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Testing Your New AI Teammate
&lt;/h3&gt;

&lt;p&gt;Now for the fun part! Let’s see what this thing can actually do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Issue Triage
&lt;/h3&gt;

&lt;p&gt;I started by creating a new issue in my Flutter shopping list project. Here’s what I wrote:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“&lt;/em&gt; When a user hit enter on edit mode of an item, the system will end the edit mode and update as usual. and open a new empty item under the edited item that is focus and ready to be edit.&lt;em&gt;”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Admittedly, this is a pretty vague description , normally I’d provide much more detail when working with AI tools. But I wanted to see how well it handled unclear requirements.&lt;/p&gt;

&lt;p&gt;To trigger the triage, I added this comment to the issue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@gemini-cli triage this issue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Within a few minutes, the action ran and automatically labeled my issue as a “bug.” Looking at the action logs, I could see exactly how it made this decision — it analyzed the issue description and determined this was describing missing functionality rather than a new feature request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Code Fixes
&lt;/h3&gt;

&lt;p&gt;Next came the real test. I commented:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@gemini-cli fix this issue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where things got interesting. The AI spent several minutes analyzing my code, creating a detailed plan with checkboxes, and then implementing the changes. Unlike my usual coding workflow where I see every change being made, this felt completely autonomous. I just watched it work and waited for the results.&lt;/p&gt;

&lt;p&gt;The process looked like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read and analyze the issue&lt;/li&gt;
&lt;li&gt;Examine the existing codebase&lt;/li&gt;
&lt;li&gt;Create a detailed implementation plan&lt;/li&gt;
&lt;li&gt;Execute the plan step by step&lt;/li&gt;
&lt;li&gt;Create a new branch with the changes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2kg80ombvgpgjgjh7sy.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2kg80ombvgpgjgjh7sy.webp" alt="Github screenshot" width="800" height="516"&gt;&lt;/a&gt;&lt;br&gt;
When it finished, I had a new branch called something like “fix-enter-key-functionality” with actual working code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Pull Request Review
&lt;/h3&gt;

&lt;p&gt;The final piece was testing the automated code review. I manually created a pull request from the branch the AI had created (it couldn’t create the PR automatically for some reason).&lt;/p&gt;

&lt;p&gt;Almost immediately, another action triggered: the pull request review. After about three minutes, I had a comprehensive code review with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A summary of the changes&lt;/li&gt;
&lt;li&gt;Specific feedback on potential issues&lt;/li&gt;
&lt;li&gt;Security and performance considerations&lt;/li&gt;
&lt;li&gt;An overall assessment (in my case, it was marked as low-to-medium risk)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Actually Worked (And What Didn’t)
&lt;/h3&gt;

&lt;p&gt;Let’s be honest about the results. The good news: the code actually worked! When I tested my Flutter app, pressing enter while editing an item did indeed update the item and create a new one. The functionality was implemented correctly despite my vague description.&lt;/p&gt;

&lt;p&gt;However, there were some limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AI had trouble with Flutter/Dart compared to what I’d expect with JavaScript projects (there’s simply more training data available for web technologies)&lt;/li&gt;
&lt;li&gt;It took longer to debug and correct Flutter-specific issues&lt;/li&gt;
&lt;li&gt;The autonomous nature felt strange compared to tools like Cursor where I can approve changes line by line&lt;/li&gt;
&lt;li&gt;It made some unwanted changes to my configuration files&lt;/li&gt;
&lt;li&gt;The pull request creation failed, requiring manual intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pro Tips for Better Results
&lt;/h3&gt;

&lt;p&gt;Based on my testing, here are some recommendations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write Better Issue Descriptions&lt;/strong&gt; : Even though my vague description worked, you’ll get much better results with detailed requirements, acceptance criteria, and context about your project. And always break it down into small pieces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Different AI Tools for Different Tasks&lt;/strong&gt; : I recommend using different AI tools for writing code versus reviewing it. For example, if Claude writes the code, use Gemini or another tool for the review. This provides a fresh perspective and catches issues the original tool might miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review the Prompts&lt;/strong&gt; : One of the best features is that all the prompts are visible and customizable. Check out the &lt;code&gt;.github/workflows&lt;/code&gt; files to see exactly what instructions the AI is following, and modify them to match your team's standards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start Small&lt;/strong&gt; : Test this on smaller, non-critical repositories first. Get comfortable with how it works before deploying it on your main projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bottom Line: Should You Try It?
&lt;/h3&gt;

&lt;p&gt;Absolutely. Even with its limitations, this tool represents a significant step forward in AI-assisted development. It’s particularly valuable for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Solo developers who want help with routine tasks&lt;/li&gt;
&lt;li&gt;Open source maintainers dealing with lots of issues and PRs&lt;/li&gt;
&lt;li&gt;Teams looking to standardize their review process&lt;/li&gt;
&lt;li&gt;Anyone curious about autonomous AI coding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The setup is straightforward, it’s free to use (you just pay for Gemini API usage), and the prompts are completely customizable. Even if you don’t use it as-is, studying the prompts and workflow structure provides excellent insights into effective AI automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ready to Get Started?
&lt;/h3&gt;

&lt;p&gt;Here’s your action plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a test repository (not your main project!)&lt;/li&gt;
&lt;li&gt;Follow the setup steps above&lt;/li&gt;
&lt;li&gt;Create a simple issue to test triage functionality&lt;/li&gt;
&lt;li&gt;Try the fix command on something small&lt;/li&gt;
&lt;li&gt;Experiment with customizing the prompts to match your workflow&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Remember, this is still early-stage technology. Approach it with curiosity rather than expecting perfection, and you’ll likely find some genuinely useful automation for your development workflow.&lt;/p&gt;

&lt;p&gt;The future of coding is increasingly collaborative between humans and AI. Tools like this give us a glimpse of what that partnership might look like — and honestly, it’s pretty exciting.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you tried the Gemini CLI GitHub Action? I’d love to hear about your experience and any creative ways you’ve customized it for your projects.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gemini</category>
      <category>git</category>
    </item>
  </channel>
</rss>
