Forem: Daniel Gwerzman

Deploy Gemma 4 on Cloud Run: Pay Only When You Actually Use It

Daniel Gwerzman — Sat, 04 Apr 2026 10:42:37 +0000

Last year, Google flew me to Paris for the announcement of Gemma 3. It was an exciting event. The demos were impressive. But what really mattered happened later, back at my desk, when I ran my own tests and found out the demos weren't lying.

Gemma 3 was the first open model that closed the gap on the big commercial ones. It didn't beat Gemini. But it reached the level Gemini was at a year earlier. For an open model you could run on your own infrastructure, that was a meaningful leap. I started integrating it into my own pipelines. Specific tasks, small steps, places where the answer doesn't need a frontier model to get it right.

Then I made a mistake.

I deployed Gemma 3 on Vertex AI Model Garden over a weekend for testing. Left it running. Didn't turn it off. Came back to a bill that made me rethink my relationship with cloud infrastructure. I made a video about it on my YouTube channel so others wouldn't repeat the same mistake.

This article is the redemption.

Gemma 4 just launched. It's a bigger jump than Gemma 3 was. And this time, I'm deploying it on Cloud Run, which scales to zero when you're not using it. Forget to turn it off. I dare you. You won't pay a cent.

This article is in two parts. The first covers what Gemma 4 is, why running your own model changes what you can build, how the deployment stack works, and the performance data from my own tests.

The second part is the step-by-step deployment guide. Prerequisites, VPC setup, model upload, deploy commands for all four model sizes, and cleanup.

Part 1: Understanding Gemma 4 on Cloud Run

What Changed in Gemma 4

Gemma 4 ships as four distinct models, not one. Two small ones and two large ones, each with a different tradeoff.

Model	Parameters	Architecture	Context Window
E2B	2.3B effective	Dense	128k tokens
E4B	4.5B effective	Dense	128k tokens
26B A4B	26B total, 4B active	MoE	256k tokens
31B	31B	Dense	256k tokens

The 26B model deserves a closer look. It uses Mixture of Experts (MoE) architecture: a design where the model has 26 billion parameters on disk but only activates 4 billion of them per token during inference. Think of it like a large team of specialists where only the relevant experts are called in for each task, rather than everyone working on every problem. The result: capability that approaches a 26B model, at the compute cost of a 4B one. This matters enormously at inference time, as you'll see in the numbers below.

Beyond size, Gemma 4 adds multimodal input. Images, audio, video: all supported as inputs, with text output. The small models (E2B, E4B) can process video with audio. The larger ones handle images with extended context.

But the two improvements that matter most for anyone building agentic pipelines are reasoning and function calling.

Reasoning means the model works through a problem step by step before producing an answer, rather than jumping straight to a response. For complex tasks that previously required a frontier model, a reasoning-capable Gemma 4 can now handle them at a fraction of the cost. Function calling has also been significantly improved, the model reliably returns structured tool calls, which is what makes it composable inside an agent that orchestrates multiple steps.

Together, these two capabilities change where Gemma fits in a production system. My own pipelines split work into layers: small, focused steps that don't need deep reasoning, and orchestration steps that evaluate results and decide what happens next. Gemma 3 could handle the simple steps. Gemma 4 can handle more of the middle layer too, the tasks that previously needed a bigger, more expensive model to get right. Every step moved from a frontier API to a self-hosted Gemma is tokens you stop paying for.

Why Running Your Own Model Changes What You Can Build

First, let's be precise about what "open" means here. Gemma 4 is not open source. The training code, the training data, and the full recipe that produced the model are not public. What Google releases are the weights, the trained parameters of the model itself, under the Apache 2.0 license. You can download them, run them, modify them, and build commercial products on top of them. But you can't reproduce the training process.

That distinction matters less than people think for most use cases, and more than people think for one specific one: fine-tuning.

Because you have the weights, you can train on top of them. Take the 4B model, run it through your own domain-specific dataset, and produce a version that understands your terminology, follows your output format, and performs better on your specific tasks than the general-purpose model does. This is the path from "capable open model" to "model that knows your business." The data you use for fine-tuning never leaves your environment, and the result belongs to you.

There's a category of problem that big cloud-hosted models can't solve for certain customers. Not because the models aren't capable, but because the data can't leave the building.

Healthcare providers, financial institutions, legal firms. Organizations with serious data privacy obligations can't pipe sensitive information through external APIs. For them, powerful AI has always meant exposing data to someone else's infrastructure. That's a non-starter in many regulated environments.

A self-hosted Gemma on your own Cloud Run service changes the equation. The model runs in your project, your VPC, your infrastructure. The data never leaves.

But the most compelling example isn't in an office. It's in a field.

Imagine a drone with cameras flying over farmland, using computer vision and reasoning to detect crop diseases, identify irrigation problems, or spot pest damage. That drone can't wait for a round-trip API call to a cloud endpoint. It might not even have a reliable internet connection in a remote countryside location. The decision needs to happen on the device, or close to it. And it needs to happen fast.

Gemma 4's multimodal capability, combined with its small model sizes, makes that kind of on-device or edge deployment practical. A 2B or 4B model can run on hardware that a drone or industrial sensor could realistically carry or connect to. The reasoning capability means it can do more than classify. It can think through what it's seeing.

How Gemma 4 Gets Deployed: The Stack

When Gemma 3 came out, the typical deployment used Ollama, a tool designed for running models locally. It's simple, works well for small models, and you can get something running quickly. Ollama bakes the model weights directly into the container image. The container starts, the model is already there, and you're serving in seconds. For a 2B or 7B model this is fine.

Gemma 4's larger models break that pattern. A 31B model doesn't fit comfortably in a container image. You can't bake 65GB of weights into something you expect to deploy and scale quickly. Ollama also doesn't expose the production controls you need at scale: request batching, KV-cache sharing across concurrent requests, quantization that doesn't sacrifice accuracy. It's a great local tool. It's not designed for what we're building here.

Gemma 4's official path uses vLLM instead. vLLM is an inference engine built specifically for serving LLMs in production. It handles multiple concurrent requests efficiently by batching them together, shares the KV-cache across requests to reduce memory pressure, and supports fp8 quantization, which is what lets the 31B model fit inside the NVIDIA RTX Pro 6000 Blackwell's 96GB of VRAM without meaningful quality loss. No cluster to manage, no node pool to configure. One flag in your deploy command.

For the 26B and 31B models, there's a third piece: the Run:ai Model Streamer. Rather than waiting for the entire model to load before serving the first request, the streamer fetches weights in parallel from Google Cloud Storage while vLLM initializes. The model starts accepting requests before it's fully loaded. This is what makes large model cold starts on Cloud Run feasible rather than painful.

The Model Loading Decision: HuggingFace vs GCS

Before you deploy, there's one choice that affects everything else: where does the model come from at startup?

HuggingFace is the simple option. The container downloads the model weights on every cold start. No storage cost, no upfront setup. The tradeoff: you're downloading over the public internet each time, and that download time dominates your cold start.

Google Cloud Storage is the production option. You upload the weights once, and the container streams them from GCS on each cold start via the Run:ai Model Streamer. More setup upfront. But the streaming happens over Google's internal network, and the speed difference is significant.

Here's the part that surprised me when I tested it: GCS without proper VPC configuration is actually slower than HuggingFace for small models. The Run:ai streamer's advantage only materializes when traffic stays on Google's internal network. When it goes out to the public internet, the overhead eliminates the benefit.

The fix is Private Google Access on your VPC subnet. One command. It opens a route from your Cloud Run container to Google APIs (including GCS) without touching the public internet. The official documentation doesn't highlight this, and it's the single most important detail in this entire guide.

The Numbers

I deployed all four model sizes from HuggingFace and from GCS, with and without VPC, and measured cold start time (first request after scale-to-zero), time to first token, and warm response time. Same prompt every time: "What is the moon?"

A note on methodology: these are single measurements, not averages from a full benchmark suite. LLMs are nondeterministic by nature, and infrastructure performance varies with load, region capacity, and network conditions. The numbers below are directionally correct, not scientifically precise. Use them to understand the relative tradeoffs between deployment options. Before committing to one approach in production, run your own tests with your own workload.

Model	Source	Cold Start	Warm Response
2B	HuggingFace	311s	1.75s
2B	GCS (no VPC)	334s	1.81s
2B	GCS + VPC	245s	1.81s
4B	HuggingFace	452s	2.46s
4B	GCS (no VPC)	433s	2.47s
4B	GCS + VPC	246s	2.47s
26B	GCS + VPC	191s	1.61s
31B	GCS + VPC	251s	5.90s

A few things to unpack.

GCS without VPC is slower than HuggingFace for small models. The streamer adds overhead that only pays off when traffic is fast. Over the public internet, HuggingFace wins for small files.

VPC changes everything. With Private Google Access, the 4B cold start drops from 433 seconds to 246 seconds. That's a 43% reduction just from routing traffic differently.

The 26B model cold starts faster than the 2B from HuggingFace. Read that again. A 26 billion parameter model, streamed over Google's internal network, is ready to serve in 191 seconds. The 2B downloading from HuggingFace takes 311 seconds. Network path and streaming architecture matter more than model size on disk.

MoE vs. dense matters at inference time, not just startup. The 26B warm response is 1.61 seconds. The 31B is 5.90 seconds. The 31B is a dense model: every one of its 31 billion parameters participates in every token. The 26B only activates 4 billion at a time. That's why the 26B responds nearly four times faster despite being nominally "larger." For latency-sensitive applications with a 256k context requirement, the 26B A4B is the more interesting choice.

Scale-to-Zero and What It Means for Cost

Cloud Run scales running instances based on traffic. When there are no requests, it scales to zero. No instances, no GPU allocated, no cost. The moment a request arrives, a new instance starts, loads the model, and serves it.

This is fundamentally different from Vertex AI Model Garden, where a deployed endpoint keeps a running instance alive. Walk away for the weekend, come back to a bill.

With Cloud Run's scale-to-zero, the worst case is the cold start delay. And as the numbers show, even a 26B model is ready in about three minutes. For development and testing, that tradeoff is straightforward.

To verify an instance has scaled to zero: open the Cloud Run console, click your service, go to the Metrics tab, and look at Instance count. Zero means you're not being charged. Or just wait 5 minutes after your last request and send a new one. If it takes 200+ seconds instead of 2, the instance scaled down.

Scale-to-zero is the right choice for development and testing. But it's not the only option. Cloud Run also lets you set a minimum number of instances to keep alive at all times. For production serving consistent traffic, you'd configure at least one warm instance to eliminate cold starts entirely. That changes the cost model, you're paying for idle time again, but it's a deliberate tradeoff, not an accident.

The guide in this article is optimized for testing: minimal cost, maximum flexibility, scale-to-zero on everything. When you're ready to move to production and need to think about minimum instances, concurrency tuning, and traffic management, my post "This is Cloud Run: Configuration" covers those options.

Part 2: The Deployment Guide

I ran all of this myself before writing a word of this guide. Deployed every model size, hit every error, debugged every failure. The instructions below are what actually works for me.

What You'll Need

A Google Cloud project with billing enabled
Cloud Shell (recommended) or the gcloud CLI installed locally. This guide assumes you're using Cloud Shell.
Access to the Gemma 4 models on HuggingFace (requires accepting the license for each model)
A HuggingFace access token with read access

I ran everything below from Cloud Shell. Google Cloud's browser-based terminal, pre-authenticated with your account and gcloud already installed. No local setup, no version mismatches. Open it from the Cloud Console by clicking the terminal icon in the top right.

Step 1: Set Your Environment Variables

Set these once at the start of your Cloud Shell session. Every command in this guide uses them:

export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_CLOUD_REGION="us-central1"
export GCS_BUCKET="${GOOGLE_CLOUD_PROJECT}-model-cache"
export VPC_NETWORK="gemma-vpc"
export VPC_SUBNET="gemma-subnet"
export HF_TOKEN="your-huggingface-token"
export PROJECT_NUMBER=$(gcloud projects describe "${GOOGLE_CLOUD_PROJECT}" \
  --format='value(projectNumber)')

Note 1: RTX Pro 6000 available in us-central1 and europe-west4
Note 2: Cloud Shell sessions don't persist environment variables across reconnects. If you close and reopen Cloud Shell, run this block again before continuing.

Step 2: Enable the Required APIs

On a new project, you'll need all of these:

gcloud services enable \
    run.googleapis.com \
    compute.googleapis.com \
    storage.googleapis.com \
    artifactregistry.googleapis.com \
    iam.googleapis.com \
    --project $GOOGLE_CLOUD_PROJECT

compute.googleapis.com is required for VPC and GPU resources. iam.googleapis.com is needed to grant permissions to the Cloud Run service agent. The others cover model storage and Cloud Run itself.

Step 3: Check GPU Quota (optional)

Cloud Run GPU access requires quota approval. You can check your current allocation:

Go to IAM & Admin > Quotas & System Limits in the Google Cloud Console, filter by NVIDIA_RTX_PRO_6000 and region:us-central1. If the limit is 0 or the quota doesn't appear, you need to request it.

Request GPU quota through the Cloud Run quotas page. Approval is not instant, allow a few days.

Step 4: Create the VPC

As explained in Part 1, Private Google Access on the subnet is the critical step. Without it, the container can't reach GCS at all. It's included directly in the subnets create command below — no separate update needed.

gcloud compute networks create "${VPC_NETWORK}" \
  --subnet-mode=custom \
  --bgp-routing-mode=regional \
  --project="${GOOGLE_CLOUD_PROJECT}"

gcloud compute networks subnets create "${VPC_SUBNET}" \
  --network="${VPC_NETWORK}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --range=10.0.0.0/24 \
  --enable-private-ip-google-access \
  --project="${GOOGLE_CLOUD_PROJECT}"

Grant the Cloud Run service agent permission to use the subnet:

gcloud projects add-iam-policy-binding "${GOOGLE_CLOUD_PROJECT}" \
  --member="serviceAccount:service-${PROJECT_NUMBER}@serverless-robot-prod.iam.gserviceaccount.com" \
  --role="roles/compute.networkUser"

Note: In production, instead of using the default compute service account, create a dedicated one with least-privilege access to the GCS bucket.

Step 5: Create the GCS Bucket and Upload Models

Create a single-region bucket in the same region as your Cloud Run service:

gcloud storage buckets create "gs://${GCS_BUCKET}" \
  --location="${GOOGLE_CLOUD_REGION}" \
  --uniform-bucket-level-access \
  --project="${GOOGLE_CLOUD_PROJECT}"

gcloud storage buckets add-iam-policy-binding "gs://${GCS_BUCKET}" \
  --member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

On disk space: The 2B (~5GB) and 4B (~9GB) models fit in Cloud Shell. The 26B (~50GB) and 31B (~65GB) don't. Cloud Shell has about 5GB of disk. For the large models, spin up a temporary GCE VM:

gcloud compute instances create gemma-uploader \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --zone="${GOOGLE_CLOUD_REGION}-a" \
  --machine-type=n2-standard-8 \
  --boot-disk-size=300GB \
  --boot-disk-type=pd-ssd \
  --image-family=debian-12 \
  --image-project=debian-cloud \
  --network=default \
  --subnet=default \
  --scopes=storage-full

If you get a subnet error, add --network="${VPC_NETWORK}" --subnet="${VPC_SUBNET}" to the command. Cloud Shell handles this automatically.

gcloud compute ssh gemma-uploader \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --zone="${GOOGLE_CLOUD_REGION}-a" \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}"

Inside the VM, install dependencies and upload the models:

export HF_TOKEN="your-huggingface-token"
export GCS_BUCKET="your-bucket-name"

sudo apt-get update && sudo apt-get install -y python3-pip --fix-missing
python3 -m pip install huggingface_hub hf_transfer --break-system-packages

# Stop immediately if any step fails
set -e

# Download, upload, and clean up each model one at a time
for MODEL in "google/gemma-4-E2B-it:gemma-4-E2B-it" \
             "google/gemma-4-E4B-it:gemma-4-E4B-it" \
             "google/gemma-4-26B-A4B-it:gemma-4-26B-A4B-it" \
             "google/gemma-4-31B-it:gemma-4-31B-it"; do
  REPO="${MODEL%%:*}"
  DIR="${MODEL##*:}"
  export HF_HOME="/tmp/hf_cache_${DIR}"
  python3 -c "
from huggingface_hub import snapshot_download
import os
snapshot_download(repo_id='${REPO}', local_dir='/tmp/${DIR}', token=os.environ['HF_TOKEN'])
"
  gcloud storage cp /tmp/${DIR}/* "gs://${GCS_BUCKET}/models/${DIR}/" --recursive
  rm -rf /tmp/${DIR} "/tmp/hf_cache_${DIR}"
done

Exit the VM and delete it:

exit

gcloud compute instances delete gemma-uploader \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --zone="${GOOGLE_CLOUD_REGION}-a" \
  --quiet

Step 6: Deploy

Every deployment uses the same prebuilt container image from Google:

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4

One important detail covered in Part 1: the container doesn't read the model or vLLM flags from environment variables. They must be passed via --command="vllm" and --args. Without --command="vllm", the startup script fails immediately.

2B model:

export GCS_MODEL_PATH_2B="gs://${GCS_BUCKET}/models/gemma-4-E2B-it"

gcloud beta run deploy gemma4-2b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=64 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_2B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--max-num-seqs=64,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"

4B model: same command, replace GCS_MODEL_PATH_2B with GCS_MODEL_PATH_4B, service name with gemma4-4b.

export GCS_MODEL_PATH_4B="gs://${GCS_BUCKET}/models/gemma-4-E4B-it"

gcloud beta run deploy gemma4-4b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=64 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_4B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--max-num-seqs=64,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"

26B model: adds fp8 quantization and drops concurrency to 8.
The startup probe below allows 6 minutes total (1 minute initial delay + 5 checks × 1 minute each), based on my measured load time. If your deployment times out, increase failureThreshold — each unit adds one more minute:

export GCS_MODEL_PATH_26B="gs://${GCS_BUCKET}/models/gemma-4-26B-A4B-it"

gcloud beta run deploy gemma4-26b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=8 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_26B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--quantization=fp8,--kv-cache-dtype=fp8,--max-num-seqs=8,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0,--max-model-len=32767" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"

31B model: same as 26B with failureThreshold=6, allowing 7 minutes total based on my measured load time. Increase failureThreshold if needed, same rule applies: each unit adds one minute:

export GCS_MODEL_PATH_31B="gs://${GCS_BUCKET}/models/gemma-4-31B-it"

gcloud beta run deploy gemma4-31b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=8 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_31B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--quantization=fp8,--kv-cache-dtype=fp8,--max-num-seqs=8,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0,--max-model-len=32767" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=6,timeoutSeconds=60,periodSeconds=60"

Get the service URLs after deployment:

export SERVICE_URL_2B=$(gcloud run services describe gemma4-2b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

export SERVICE_URL_4B=$(gcloud run services describe gemma4-4b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

export SERVICE_URL_26B=$(gcloud run services describe gemma4-26b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

export SERVICE_URL_31B=$(gcloud run services describe gemma4-31b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

echo "2B:  $SERVICE_URL_2B"
echo "4B:  $SERVICE_URL_4B"
echo "26B: $SERVICE_URL_26B"
echo "31B: $SERVICE_URL_31B"

Step 7: Test It

The service exposes an OpenAI-compatible API. Any client that speaks the OpenAI protocol works against it.

The first request will be slow. If the instance has scaled to zero, Cloud Run needs to start a new one and load the model before responding. For the 2B and 4B models expect around 4 minutes. For the 26B and 31B, up to 5 minutes. Don't cancel the request — it will come back. Every request after that will be fast.

curl -X POST "${SERVICE_URL_2B}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "What is the moon?"}],
    "max_tokens": 200
  }'

The model name in the request must match the HuggingFace repo ID passed in --args (or a custom --served-model-name if you set one).

Production Hardening

Everything above uses --allow-unauthenticated and the default compute service account. That's fine for testing. Before real users or real data:

Authentication. Replace --allow-unauthenticated with --no-allow-unauthenticated. Cloud Run supports OIDC tokens for service-to-service calls and IAP for user-facing access.

Dedicated Service Account. Create one with only roles/storage.objectViewer on the model bucket. The default compute service account has broader permissions than necessary.

Private endpoint. For sensitive workloads, remove the public URL and access the service only from within your VPC via Cloud Run private networking.

Cleaning Up

To remove everything after testing is done, run these commands in order:

Delete the Cloud Run services:

for SERVICE in gemma4-2b gemma4-4b gemma4-26b gemma4-31b; do
  gcloud run services delete $SERVICE \
    --region="${GOOGLE_CLOUD_REGION}" \
    --project="${GOOGLE_CLOUD_PROJECT}" \
    --quiet
done

Delete the GCS bucket and all model weights:

gcloud storage rm -r "gs://${GCS_BUCKET}"

Delete the VPC subnet and network:

gcloud compute networks subnets delete "${VPC_SUBNET}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --quiet

gcloud compute networks delete "${VPC_NETWORK}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --quiet

If subnet deletion fails with an error about IP addresses still in use: Cloud Run holds onto internal IP addresses for a period after services are deleted. Nothing to force-release them. Give it a few hours and try again.

Remove the IAM binding:

gcloud projects remove-iam-policy-binding "${GOOGLE_CLOUD_PROJECT}" \
  --member="serviceAccount:service-${PROJECT_NUMBER}@serverless-robot-prod.iam.gserviceaccount.com" \
  --role="roles/compute.networkUser"

The Bigger Picture

I started this piece talking about a Paris visit and an unexpected cloud bill. But the real reason I spent time getting Gemma 4 running on Cloud Run isn't just the cost.

It's the access.

When an LLM runs in your own infrastructure, things become possible that weren't possible before. Regulated data that couldn't touch a third-party API can now be processed by a capable model. Privacy becomes a feature of the architecture, not a compromise against capability.

But there's a more practical argument too. Commercial frontier models are expensive per token, and they come with regional rate limits that cap how much you can do. When your production pipeline hits a rate limit, it will slow it down. When you hit your own model, there's no rate limit. You control the capacity. You control the cost. You decide when to scale.

Gemma 4 is the first open model where that tradeoff genuinely makes sense across the full range of AI tasks: text, vision, reasoning, function calling. Not every step in your pipeline needs a frontier model. The steps that don't, and with Gemma 4's reasoning capability, that's more steps than before, can run on infrastructure you own, at a cost you control, without a rate limit in sight.

The drone flying over the farmer's fields, making decisions on its own, is not a hypothetical. The only thing that was missing was a model good enough to run on hardware that fits in a backpack.

Now there is one. And you know how to deploy it. Enjoy discovering it.

This is Cloud Run: Configuration

Daniel Gwerzman — Fri, 27 Mar 2026 10:23:13 +0000

This is Part 3 of the "This is Cloud Run" series. In Part 1, we covered what Cloud Run is and when to choose it. In Part 2, we walked through the deployment options and revision management. Now let's tune it.

Cloud Run's defaults are good. We covered that in Part 1. But every workload has its own needs, and Cloud Run gives you the knobs to tune for them. This article covers the settings you'll reach for most often.

CPU and Memory

Every Cloud Run instance gets a share of CPU and memory. The defaults (1 vCPU, 512 MiB) are reasonable for a lightweight API, but you'll want to adjust them as you understand your workload's needs.

CPU ranges from 0.08 vCPU (less than a tenth of a core) to 8 vCPUs. Memory ranges from 128 MiB to 32 GiB. The two are linked: higher CPU allocations require minimum memory thresholds, and some memory configurations require minimum CPU.

But the more important decision is the CPU allocation mode:

Request-only (default). CPU is only allocated while your instance is actively processing a request. Between requests, CPU is throttled to near-zero. You pay only for the time spent handling requests. This is the serverless model, and it's the right choice for most HTTP APIs.
Always-on. CPU is always available, even between requests. This costs more, but it's required for workloads that do work outside of request handling: WebSocket connections that maintain state, background threads that process queues, or services that need to keep in-memory caches warm.

gcloud run deploy my-service \
  --image my-image \
  --cpu 2 \
  --memory 1Gi \
  --no-cpu-throttling \
  --region us-central1

The --no-cpu-throttling flag enables always-on CPU. Without it (or with --cpu-throttling), you get the default request-only mode.

The pricing difference is significant. With request-only allocation, you pay per vCPU-second and GiB-second only while handling requests. With always-on, you pay for the entire lifecycle of the instance. For a service that handles bursty HTTP traffic with idle periods between, request-only can be dramatically cheaper. For a service that runs background tasks or maintains WebSocket connections, always-on is the only option that works correctly.

Health Checks

Cloud Run won't send traffic to an instance until it's ready. By default, it uses a TCP startup probe: it waits for your container to listen on the expected port, then considers it ready.

For most services, that's enough. But if your application needs time to load data, warm caches, or establish database connections after the port is open, you'll want a custom HTTP startup probe:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 2
  failureThreshold: 15

This tells Cloud Run to GET /healthz every 2 seconds. If it fails 15 times, the instance is marked unhealthy and restarted. Only when the probe succeeds does the instance start receiving traffic. This prevents the 502 errors that happen when a load balancer sends requests to an instance that's technically listening but not yet ready to serve.

Cloud Run also supports liveness probes that run continuously after startup. If a liveness probe fails, Cloud Run restarts the instance. Useful for detecting stuck processes, deadlocks, or memory leaks that don't crash the container but make it unresponsive.

For gRPC services, Cloud Run supports gRPC health checking probes following the gRPC health checking protocol.

Request Timeout

Every Cloud Run request has a timeout. The default is 300 seconds (5 minutes). The maximum is 3600 seconds (60 minutes).

gcloud run deploy my-service \
  --image my-image \
  --timeout 600 \
  --region us-central1

If your service processes large file uploads, generates reports, or runs long-running computations, you'll want to increase this. But keep in mind: the timeout applies per-request. If a single request takes longer than the timeout, Cloud Run terminates it. WebSocket connections are also subject to this timeout, which is why Part 1 mentioned the ~60-minute connection limit.

A common pattern for long-running work: accept the request, kick off the processing asynchronously (via Cloud Tasks or Pub/Sub), and return a 202 immediately. The client polls for status or receives a callback when the work is done. This keeps your request timeout short and your service responsive.

If you find yourself regularly hitting the 60-minute maximum, that's a signal your workload might be better suited to Cloud Run jobs (for batch processing) or a different platform entirely.

Scaling: Instances and Concurrency

Cloud Run's autoscaler manages three related settings:

Minimum instances controls how many instances stay warm when there's no traffic. The default is 0 (scale-to-zero). Setting it to 1 or higher eliminates cold starts but means you're paying for idle instances. It's the classic serverless trade-off: latency vs. cost. For latency-sensitive production services, 1 is often the right number. For dev environments, 0 keeps your bill at zero.

Maximum instances caps how far Cloud Run can scale up. The default is 100. This protects you from runaway scaling (and a surprising bill) during unexpected traffic spikes. But set this thoughtfully: if your service talks to a database with a 20-connection pool, 100 instances all trying to connect will overwhelm it. Match your max instances to your backend's capacity.

Concurrency controls how many requests a single instance handles simultaneously. The default is 80. This is one of Cloud Run's key advantages over the old Cloud Functions 1st gen model, which processed one request per instance. With concurrency at 80, a single instance can serve 80 simultaneous requests before Cloud Run spins up another instance.

Lower the concurrency for CPU-heavy workloads where each request needs dedicated processing power. Raise it (up to 1000) for lightweight I/O-bound handlers that spend most of their time waiting on network calls. Setting concurrency to 1 mimics the one-request-per-instance model if your code isn't thread-safe.

gcloud run deploy my-service \
  --image my-image \
  --min-instances 1 \
  --max-instances 10 \
  --concurrency 80 \
  --region us-central1

And remember Startup CPU Boost from Part 1: Cloud Run temporarily doubles CPU during instance initialization to get instances ready faster. Combined with minimum instances, this makes cold starts a non-issue for most workloads.

Environment Variables and Secrets

Cloud Run supports two mechanisms for passing configuration to your containers, and it's important to use the right one for the job.

Environment variables are for non-sensitive configuration: feature flags, API endpoints, logging levels, database hostnames. Set them at deploy time with --set-env-vars:

gcloud run deploy my-service \
  --image my-image \
  --set-env-vars "DB_HOST=10.0.0.1,LOG_LEVEL=info,ENV=production" \
  --region us-central1

This follows the 12-Factor App methodology: configuration lives in the environment, not in the code.

Secrets are for sensitive credentials: API keys, database passwords, TLS certificates, OAuth client secrets. These should never be plain environment variables. Plain env vars are visible in the Cloud Run Console, show up in debug logs, and can leak into error reports. Instead, store them in Secret Manager and reference them at deploy time:

gcloud run deploy my-service \
  --image my-image \
  --set-secrets "API_KEY=my-api-key:latest" \
  --set-secrets "/secrets/tls.key=tls-private-key:latest" \
  --region us-central1

Secrets can be mounted as environment variables or as files. The first example above mounts the secret as an environment variable called API_KEY. The second mounts it as a file at /secrets/tls.key. Secrets are versioned, access-controlled via IAM, and audit-logged. If a secret is compromised, you rotate it in Secret Manager and redeploy. No code changes.

Volume Mounts

Cloud Run instances are ephemeral, but sometimes you need temporary storage or access to shared files. Cloud Run supports three types of volume mounts:

In-memory volumes are tmpfs-style mounts backed by your instance's RAM. They're fast but volatile (gone when the instance terminates) and count against your memory limit. Useful for temporary file processing, like downloading a file, transforming it, and uploading the result:

gcloud run deploy my-service \
  --image my-image \
  --add-volume name=scratch,type=in-memory,size-limit=256Mi \
  --add-volume-mount volume=scratch,mount-path=/tmp/work \
  --region us-central1

Cloud Storage FUSE mounts a Cloud Storage bucket as a local filesystem. Your code reads and writes files normally, and GCS FUSE translates those operations into Cloud Storage API calls:

gcloud run deploy my-service \
  --image my-image \
  --add-volume name=models,type=cloud-storage,bucket=my-ml-models \
  --add-volume-mount volume=models,mount-path=/mnt/models \
  --region us-central1

The catch: it's eventually consistent. No file locking, last write wins. Good for reading shared assets (ML models, configuration files) or writing artifacts (logs, exports). Not good for concurrent writes to the same file.

NFS via Filestore gives you a fully POSIX-compliant network filesystem with proper file locking. Lower latency than GCS FUSE for random reads. Requires VPC connectivity since Filestore instances live on your VPC. Best for workloads that need shared read/write access with file-level consistency.

For most Cloud Run services, you won't need any of these. But when you do (image processing pipelines, ML model serving, shared configuration across instances), they save you from building workarounds.

Network Configuration

Cloud Run's networking defaults are simple: your service is public, and it connects to the internet for outbound traffic. But when you need more control, there are three areas to configure.

Ingress

Ingress settings control who can reach your service:

All (default). Accepts traffic from anywhere on the internet. Fine for public APIs and web apps.
Internal. Only accepts traffic from within your VPC or from other Google Cloud services (like Pub/Sub, Cloud Scheduler, or Cloud Tasks). The service is invisible to the public internet. Use this for backend services that should never be called directly by external clients.
Internal + Cloud Load Balancing. Same as internal, but also accepts traffic through a global external Application Load Balancer. This is the path to custom domains, CDN caching with Cloud CDN, and WAF protection with Cloud Armor. You'll see this load balancer pattern come up again in the Custom Domains and Cloud Armor sections below.

gcloud run deploy my-service \
  --image my-image \
  --ingress internal \
  --region us-central1

Egress and VPC Connectivity

By default, your Cloud Run instances connect to the internet directly. But if your service needs to reach private resources (a Cloud SQL database, a Memorystore Redis instance, an internal API), it needs VPC access.

Two options:

Serverless VPC Access connectors. The original approach. You create a connector resource that bridges Cloud Run and your VPC. Works, but adds a network hop and has throughput limits.
Direct VPC egress. The newer approach. Cloud Run instances are placed directly on your VPC subnet. No connector needed, no extra hop, no throughput bottleneck. This is the recommended path for new deployments.

If you're starting fresh, go with Direct VPC egress. If you have existing services using connectors, they'll keep working, but consider migrating when convenient.

Custom Domains

Every Cloud Run service gets a *.run.app URL with automatic HTTPS. But for production, you'll want your own domain. Two paths:

Cloud Run domain mapping. The simpler option. Map a domain directly to your Cloud Run service. SSL certificates are provisioned and renewed automatically. Works for straightforward setups where you just need api.example.com pointing to your service.
Global external Application Load Balancer. The more capable option. Gives you CDN caching, Cloud Armor WAF, multi-region routing, and URL-based routing to different services. More setup, but it unlocks features that domain mapping alone can't provide.

Security Configuration

Cloud Run's security defaults are strong (covered in Part 1). But for production services, you'll want to customize a few settings.

Service Accounts

Every Cloud Run service runs as a service account, which determines what Google Cloud resources it can access. By default, Cloud Run uses the project's default compute service account, which typically has broad permissions.

For production, create a dedicated service account per service with only the permissions it needs. If your service reads from Cloud Storage and writes to Pub/Sub, its service account should have storage.objectViewer and pubsub.publisher. Nothing more. This is the principle of least privilege, and it limits the blast radius if a service is compromised.

gcloud run deploy my-service \
  --image my-image \
  --service-account my-sa@my-project.iam.gserviceaccount.com \
  --region us-central1

IAM Authentication

By default, Cloud Run requires authentication. Every request must include a valid identity token, and the caller must have the roles/run.invoker role on the service. This is the right default for service-to-service communication.

For public-facing services (APIs, webhooks, web apps), you explicitly opt out by granting the roles/run.invoker role to allUsers:

gcloud run services add-iam-policy-binding my-service \
  --member="allUsers" \
  --role="roles/run.invoker" \
  --region us-central1

But even with unauthenticated access enabled, you can implement your own authentication layer in your application code. Cloud Run handles transport security (HTTPS) and platform-level identity (the IAM invoker check). Your app handles application-level identity: user logins, API keys, JWT validation.

Binary Authorization

Binary Authorization enforces deploy-time policies: only container images that have been signed by your CI/CD pipeline can be deployed. This prevents someone from deploying an untested image directly to production, even if they have the IAM permissions to do so.

It's a layer of governance that makes sense for organizations with compliance requirements or strict change management processes.

Cloud Armor

Cloud Armor is Google Cloud's WAF (Web Application Firewall). It sits in front of your Cloud Run service and can enforce:

IP allowlists and denylists
Geographic restrictions
Rate limiting per client
Pre-configured WAF rules (SQL injection, XSS, etc.)

Cloud Armor requires a global external Application Load Balancer in front of your Cloud Run service. If you're using the default *.run.app URL without a load balancer, Cloud Armor isn't available. But if your service is public-facing and handles sensitive data, the load balancer + Cloud Armor combination is worth the extra setup.

Conclusion

Cloud Run gives you enough configuration knobs to tune for real workloads. But you don't need to touch all of them at once.

The pattern I recommend: start with the defaults. Deploy your service, see how it behaves under real traffic, then adjust. Bump the memory if you're hitting limits. Lower the concurrency if requests are CPU-heavy. Add a health check if your startup is slow. Set up a dedicated service account before going to production. Every change takes effect on the next deployment, with zero downtime. Nothing is permanent.

If Part 1 was about whether Cloud Run belongs in your architecture, and Part 2 was about getting your code onto it, this article is about making it work well for your specific needs. Start simple. Add complexity when your workload demands it, not before.

If you're just joining the series, Part 1 is the place to start.

Resources

This is Cloud Run: Nine Ways to Deploy (and When to Use Each)

Daniel Gwerzman — Fri, 20 Mar 2026 07:04:45 +0000

This is Part 2 of the "This is Cloud Run" series. In Part 1, we covered what Cloud Run is, what's behind the curtain, what you get for free, Cloud Run functions, the platform's boundaries, and the migration path to Kubernetes. Now let's get practical.

In Part 1, the question was "should I use Cloud Run?" Here, the question is "how do I get my code onto it?"

This isn't a step-by-step tutorial. This article is the why behind each option, so you can make informed choices instead of copying commands.

Deployment Options

One of Cloud Run's underrated strengths is how many ways you can get your code onto the platform. Between the CLI, the Console, YAML, Terraform, Cloud Build, GitHub Actions, Cloud Deploy, client libraries, and more, there's no shortage of options. I'm not going to cover all of them. Instead, I'll focus on the ones I find most useful across the projects I work on, from quick prototypes to production deployments.

From Source Code

gcloud run deploy my-service --source . --region us-central1

This is the "I just want this running" command. You point gcloud at your source code directory, and Cloud Run handles everything else. But "everything else" hides a multi-step pipeline that's worth understanding:

Upload. gcloud zips your source directory and uploads it to Cloud Build.
Detect. Cloud Build runs Google Cloud Buildpacks, which inspect your code to determine the language and framework. Found a requirements.txt? Python. Found gunicorn in the dependencies? That's your server.
Build. Buildpacks create a container image: secure base image, installed dependencies, configured entry point. If a Dockerfile is present in your directory, Cloud Build uses that instead of buildpacks.
Push. The built image is pushed to Artifact Registry, Google Cloud's container registry.
Deploy. Cloud Run pulls the image and deploys it as a new revision.

All from one command. You don't need to know Docker, you don't need to understand container registries, and you don't even need a Dockerfile.

The trade-off is control. You can't customize the base image or run multi-stage builds. If you don't need any of that, source deploy works perfectly fine in production. Many of my services run this way. But if you need precise control over what's in the container, the next option gives you that.

Best for: Quick iterations during development, prototyping, developers new to containers, and the "I just want this running" moments.

From a Pre-Built Container Image

gcloud run deploy my-service \
  --image us-docker.pkg.dev/my-project/repo/my-image:v1.2.3 \
  --region us-central1

This is the production path. You build your Docker image however you like (locally, in CI, in Cloud Build), push it to Artifact Registry, and deploy by image URL. Notice the flag: --image instead of --source. No build step happens on Google's side. Cloud Run pulls the image and starts running it, which makes the deployment itself much faster.

The key advantage is full control over the build process. Multi-stage builds to keep images small. Custom base images tuned for your runtime. Build-time secrets for private package registries. Whatever your Dockerfile needs. If you care about minimal images, pinned base image versions, and a small attack surface, this is where you get that.

Best for: Production deployments, teams with existing build pipelines, workloads that need custom build steps.

Cloud Run Functions

gcloud run deploy my-function \
  --source . \
  --function myEntrypoint \
  --base-image python312 \
  --region us-central1

We covered Cloud Run functions in depth in Part 1, so here we'll focus on the deployment mechanics. Two flags distinguish a function deployment from a service deployment:

--function myEntrypoint selects which function in your source code to use as the HTTP entry point. Your source can define multiple functions; each deployment serves one.
--base-image python312 selects a managed base image for the runtime. Google manages these base images, including security patches, so you don't maintain a Dockerfile.

Supported runtimes include Node.js, Python, Go, Java, .NET, Ruby, and PHP. You can also deploy Cloud Run functions from the Console UI with an inline code editor, which is handy for quick experiments.

Best for: Single-purpose endpoints, webhooks, event handlers, and the LLM API proxy pattern I described in Part 1.

Console UI

The Google Cloud Console provides a point-and-click interface for deploying Cloud Run services. You select a container image from Artifact Registry, configure settings through a guided form (CPU, memory, scaling, environment variables, networking), and deploy without touching the command line.

One thing the Console is surprisingly good for: discovery. Before you memorize CLI flags, clicking through the Console form shows you every configuration option Cloud Run offers. I've used it more than once to discover a setting I didn't know existed, then replicated it in my deployment scripts.

But the Console has an obvious limitation: it's not scriptable. Every deployment is manual, which means it's not repeatable, not version-controllable, and impossible to code review. You can't git diff a click.

Best for: Exploration, one-off deployments, reviewing and tweaking configurations visually, and learning what options exist before writing automation.

gcloud CLI with Full Configuration

You've already seen gcloud run deploy with minimal flags. But the same command is also a full-featured deployment tool with dozens of configuration options. Every Cloud Run configuration is a flag:

gcloud run deploy my-service \
  --image us-docker.pkg.dev/my-project/repo/my-image:v1.2.3 \
  --region us-central1 \
  --memory 512Mi \
  --cpu 1 \
  --min-instances 0 \
  --max-instances 10 \
  --concurrency 80 \
  --set-env-vars "DB_HOST=10.0.0.1,ENV=production" \
  --set-secrets "API_KEY=my-secret:latest" \
  --service-account my-sa@my-project.iam.gserviceaccount.com \
  --revision-suffix v1-2-3

Memory? A flag. Secrets from Secret Manager? A flag. VPC connectivity? A flag. This makes gcloud commands scriptable, repeatable, and easy to drop into shell scripts or CI/CD pipelines.

One thing that keeps it practical: configuration is sticky. Once you set a flag, it stays on the service until you explicitly change it. So your first deployment might be the long one with --memory, --cpu, --max-instances, and everything else. But subsequent deployments can go back to the simple gcloud run deploy my-service --image my-image:v2 --region us-central1 and all your previous settings carry over. You only specify what you want to change.

Best for: Scripted deployments, shell scripts, CI/CD integration, and when you need precise, repeatable control over every setting.

Declarative YAML

gcloud run services replace service.yaml --region us-central1

CLI flags are imperative: "change these things." YAML is declarative: "this is what I want." You define your entire service configuration in a file, and Cloud Run makes reality match. If something drifted (someone tweaked a setting in the Console), the YAML corrects it. If nothing changed, nothing happens.

The YAML follows the Knative Serving API v1 schema:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-service
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/ingress: all
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: '0'
        autoscaling.knative.dev/maxScale: '10'
    spec:
      containerConcurrency: 80
      containers:
        - image: us-docker.pkg.dev/my-project/repo/my-image:v1.2.3
          ports:
            - containerPort: 8080
          resources:
            limits:
              memory: 512Mi
              cpu: '1'
          env:
            - name: ENV
              value: "production"

This is the same model as Kubernetes and Terraform: you describe what you want, not how to get there. And because the schema is Knative-compatible, these YAML files are portable between Cloud Run and self-hosted Knative on Kubernetes.

A practical tip: you can export an existing service's configuration with gcloud run services describe SERVICE --format export > service.yaml, modify it, and reapply. This is a great way to bring a service that was originally deployed via the Console or CLI into version control.

One important nuance: IAM policies (who can invoke your service) are managed separately from the service definition. The YAML defines the service configuration; gcloud run services add-iam-policy-binding controls access. This is actually good practice, since access control is often managed by a different team.

Best for: GitOps workflows, infrastructure-as-code, teams that want configuration in version control, and teams migrating from Kubernetes or Knative.

Continuous Deployment from Git

Connect a GitHub, GitLab, or Bitbucket repository to Cloud Run through Cloud Build triggers. Every push to a specified branch automatically builds your container and deploys a new revision. You set it up once and forget it.

You can configure this from the Cloud Run Console UI under "Set up continuous deployment," or manually by creating Cloud Build triggers. Either way, the result is the same: push to main, wait a couple of minutes, and your changes are live. Under the hood, it uses the same buildpack pipeline as --source deploys: your code is auto-detected, built into an image, pushed to Artifact Registry, and deployed as a new revision.

The difference between this and the next two options (Cloud Build and GitHub Actions) is simplicity. Continuous deployment from Git is a pre-built pipeline. You don't write build steps or workflow files. The trade-off is flexibility: you can't run tests before deploying, can't deploy to staging first, and can't customize the build beyond what Cloud Build's auto-detection provides. If you need any of those things, keep reading.

Best for: Teams that want automated deployment on every push without building or maintaining custom CI/CD.

Cloud Build

Cloud Build is Google Cloud's serverless CI/CD platform. Where the previous option gives you a pre-built pipeline, Cloud Build gives you the building blocks to assemble your own.

You define your pipeline in a cloudbuild.yaml file at the root of your repository:

steps:
  - name: 'gcr.io/cloud-builders/docker'
    args: ['build', '-t', 'us-docker.pkg.dev/$PROJECT_ID/repo/my-image:$COMMIT_SHA', '.']
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', 'us-docker.pkg.dev/$PROJECT_ID/repo/my-image:$COMMIT_SHA']
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: gcloud
    args: ['run', 'deploy', 'my-service',
           '--image', 'us-docker.pkg.dev/$PROJECT_ID/repo/my-image:$COMMIT_SHA',
           '--region', 'us-central1']

Each step runs in its own container. The first step builds the Docker image, tagging it with the commit SHA for traceability. The second pushes it to Artifact Registry. The third deploys it to Cloud Run. You can chain as many steps as you need: run tests, lint code, scan for vulnerabilities, deploy to staging, run integration tests against staging, then deploy to production. The $PROJECT_ID and $COMMIT_SHA are built-in substitution variables that Cloud Build populates automatically.

Trigger this pipeline on every push to a branch, on pull requests, or on-demand with gcloud builds submit. That flexibility is the point: Cloud Build is the pipeline, and you decide what goes in it.

Best for: Complex build pipelines, multi-service deployments, teams that need test-before-deploy workflows, and teams already invested in Google Cloud's CI/CD ecosystem.

GitHub Actions

If your code lives on GitHub and your CI/CD already runs on GitHub Actions, Google provides an official action for deploying to Cloud Run:

- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider'
    service_account: 'deployer@my-project.iam.gserviceaccount.com'

- uses: google-github-actions/deploy-cloudrun@v2
  with:
    service: my-service
    image: us-docker.pkg.dev/my-project/repo/my-image:${{ github.sha }}
    region: us-central1

Authentication is handled via Workload Identity Federation, which lets GitHub Actions authenticate to Google Cloud without service account keys. Instead of storing a JSON key file as a GitHub secret (a secret that never expires and can be copied anywhere), Workload Identity Federation uses short-lived tokens granted through an identity mapping. No keys to store, no keys to rotate, no keys to accidentally leak in a log.

The google-github-actions/deploy-cloudrun action supports both image-based and source-based deployments, and the resulting service URL is available as a workflow output for downstream steps (useful for posting preview links on pull requests).

Best for: Teams with GitHub-centric workflows who want deployment integrated into their existing CI/CD.

Choosing the Right Deployment Method

Here's a quick reference for the methods we covered:

Method	Deploy Speed	Build Control	Repeatable	Best Scenario
From Source	Medium	Low	Yes (CLI)	Prototyping, quick iterations
Pre-Built Image	Fast	High	Yes (CI/CD)	Production, custom builds
Cloud Run Functions	Medium	Low	Yes (CLI)	Single-purpose endpoints
Console UI	Manual	Medium	No	Exploration, learning
gcloud CLI (full)	Fast	High	Yes (scripts)	Scripted deploys, CI/CD
Declarative YAML	Fast	High	Yes (GitOps)	Infrastructure-as-code
Git Continuous Deploy	Medium	Low	Yes (auto)	Simple auto-deploy on push
Cloud Build	Medium	High	Yes (pipeline)	Complex CI/CD pipelines
GitHub Actions	Medium	High	Yes (pipeline)	GitHub-centric teams

In practice, most teams follow a natural progression. You start with source deploy for prototyping: one command, instant feedback. As the project matures, you move to pre-built images for reproducibility and control. When the team grows, you add CI/CD (Cloud Build, GitHub Actions, or continuous deployment from Git) so deployments happen automatically and consistently.

You don't need to pick one and stick with it. Use source deploy for your dev environment and image-based deploys for production. Use the Console to explore, then codify what you learned in YAML. The deployment method is a tool, not a commitment.

These aren't the only options. Cloud Run also supports deployment via Terraform, Cloud Deploy for managed continuous delivery pipelines, Cloud Code for IDE integration, client libraries, and the REST API directly. The Cloud Run deployment documentation covers the full list.

Revisions and Traffic Management

Every time you deploy to Cloud Run, it creates a new revision: an immutable snapshot of your service's configuration and container image. Think of revisions as Cloud Run's built-in version control. Your service might accumulate dozens of revisions over its lifetime, each representing a specific deployment. Old revisions stick around and can serve traffic again at any time. Nothing is deleted unless you explicitly remove it.

By default, Cloud Run auto-generates revision names like my-service-00001-abc. That works, but it's not helpful when you're staring at a list of revisions trying to figure out which one introduced a bug. You can set meaningful names with the --revision-suffix flag:

gcloud run deploy my-service \
  --image us-docker.pkg.dev/my-project/repo/my-image:v1.2.3 \
  --revision-suffix v1-2-3 \
  --region us-central1

Now the revision is named my-service-v1-2-3. In a CI/CD pipeline, you might use the Git commit SHA: --revision-suffix=$(git rev-parse --short HEAD). When something goes wrong, you can immediately tell which commit is running.

Traffic Splitting

But the real power of revisions is traffic splitting. Because revisions are immutable and stick around, you can split traffic between them:

gcloud run services update-traffic my-service \
  --to-revisions my-service-v1-2-3=95,my-service-v1-3-0=5 \
  --region us-central1

This sends 95% of traffic to the old revision and 5% to the new one. Watch the metrics in Cloud Monitoring. If the new revision's error rate and latency look good, shift more traffic. If something's off, one command puts you back:

gcloud run services update-traffic my-service \
  --to-revisions my-service-v1-2-3=100 \
  --region us-central1

Instant rollback. No redeployment, no rebuild, no downtime. The old revision is still there, already running, ready to take 100% of traffic again. This is the real power of immutable revisions.

While the CLI works well for scripted rollouts, traffic splitting is one of those things that's often easier to do from the Cloud Run Console. You can see all your revisions, drag sliders to adjust percentages, and watch the changes take effect.

You can also use traffic splitting for:

A/B testing across different versions of your service
Gradual rollouts where you shift traffic incrementally (5% → 25% → 50% → 100%)
Blue/green deployments by deploying the new revision with 0% traffic, testing it via a revision tag (see below), then flipping traffic all at once

Revision Tags

Revision tags give individual revisions their own stable URLs without routing any production traffic to them:

gcloud run services update-traffic my-service \
  --set-tags staging=my-service-v1-3-0 \
  --region us-central1

This creates a URL like https://staging---my-service-abc123-uc.a.run.app that points directly to that revision. Your QA team can test the new version at that URL while production traffic continues hitting the current revision untouched. When you're satisfied, shift traffic. No separate staging environment needed.

You can have multiple tags active at once: staging, canary, pr-42. Each gets its own URL. This is particularly useful in CI/CD pipelines where you want to run automated tests against a deployed revision before routing real users to it.

Conclusion

Cloud Run gives you many paths to deployment, each designed for a different stage of your project. You don't need to use all of them.

The pattern I see most often: teams start with gcloud run deploy --source . and the default configuration. That gets them running in minutes. As the project matures, they move to pre-built images for reproducibility, add CI/CD for automation, and use revisions and traffic splitting for safe rollouts. Every change takes effect on the next deployment, with zero downtime.

Part 3 is coming soon. We'll dive into the configuration options that let you tune Cloud Run for your specific workload: CPU, memory, scaling, networking, secrets, and security. Follow so you don't miss it.

Resources

This is Cloud Run: A Decision Guide for Developers

Daniel Gwerzman — Sat, 14 Mar 2026 08:54:11 +0000

I like to throw spaghetti at the wall and see if it sticks.

Some of my best projects started exactly that way. An idea on a Saturday morning, a container deployed by lunch, a URL shared with a friend by dinner. No infrastructure planning, no provisioning tickets, no three-day detour through VPC configurations before writing a single line of business logic. Just code, deploy, done.

Every single one of them ran on Cloud Run.

For me, Cloud Run is my go-to, whether it's a weekend experiment that might go nowhere or a production solution for a client. More than once, a quick demo I built on Cloud Run ended up maturing into the actual production system, running on the exact same setup.

A recent example: every time I build something with AI, even a quick vibe-coded prototype, I need a server-side component to keep my LLM API calls secure and my credential keys away from public eyes. Cloud Run is perfect for this. In minutes I have a secure backend with HTTPS, and I didn't have to think about infrastructure at all. The prototype works, the keys are safe, and if the project grows into something real, the backend is already production-ready.

The idea that you need days of infrastructure preparation before you can test something with real users has always felt backwards to me. I believe in getting something live as fast as possible, putting it in front of people, and then deciding if it deserves more investment.

But this article isn't a love letter. I want to give you the understanding to make a real architectural decision: when is Cloud Run the right choice, and when isn't it? We'll look at what it actually is under the hood, what you get for free, where its boundaries are, and when you should consider moving to Kubernetes. By the end, you'll know whether Cloud Run belongs in your next project, or whether you should reach for something else entirely.

What Is Cloud Run?

Cloud Run is a fully managed serverless platform on Google Cloud that runs containers. You give it code, it gives you a URL. No clusters to provision, no nodes to manage, no load balancers to configure. You bring the code; Google handles everything else.

But what makes Cloud Run different is its core promise: the same configuration that runs your proof of concept can carry you to production.

Think about that for a moment. Most cloud services force you into one of two buckets: either you use a "quick and dirty" option for prototyping that you'll have to throw away later, or you invest days of infrastructure setup upfront for a production-grade environment. Cloud Run refuses that trade-off. Your weekend project and your production workload run on the same platform, with the same security, the same scaling, the same deployment model.

And when nobody is using your service? It scales to zero. You pay nothing. That means you can spin up ten experimental services, let them sit idle for a month, and your bill is exactly zero.

So where does Cloud Run sit in the serverless landscape? It's not a virtual machine, you don't manage an OS. It's not a Kubernetes cluster, you don't manage nodes or pods. It's not a function, you're not limited to a single entry point with a 15-minute timeout. It's a container-as-a-service: you provide something that can run in a container, and the platform handles everything else (placement, scaling, networking, TLS). If you already know containers, bring your image and Cloud Run runs it. If you don't? Just point it at your source code and Cloud Run will package and deploy it for you. And if you don't even want to manage an HTTP framework, Cloud Run functions let you write functions and deploy them individually, with Cloud Run wrapping each one in a server for you. Either way, you end up with a running service.

What's Behind the Curtain?

You don't need to understand any of this to use Cloud Run. But knowing what's underneath explains why the defaults are production-grade and why you can trust them.

Borg: Google's Internal Engine

Cloud Run doesn't run on some separate, less-proven infrastructure. It runs directly on Borg, Google's internal cluster management system. The same system that powers Gmail, YouTube, Google Search, and virtually every other Google service. Borg has been in production for over a decade, deploying billions of containers per week across clusters of tens of thousands of machines.

If Borg sounds familiar, it should. It was the direct predecessor to Kubernetes. Many of the same engineers and architectural concepts carried over. But while Kubernetes is the open-source version built for the rest of us, Borg is the battle-hardened original that still runs Google internally.

What does this mean for your containers? It means they inherit the same scheduling, failover, and resource management that Google trusts for its own products. It means your service benefits from Google's BeyondProd zero-trust security framework, where trust depends on code provenance and service identity, not network location. It means Binary Authorization for Borg verifies that only reviewed, properly-built code is deployed to the infrastructure.

In short: your containers run on the same infrastructure as Gmail.

Knative: The API Layer

Cloud Run's API is based on Knative Serving, an open-source project originally started by Google for running serverless workloads on Kubernetes. But Cloud Run is not "managed Knative". It reimplements the Knative Serving API on top of Borg, with no Kubernetes underneath.

The practical takeaway: if you define your service using a Knative YAML manifest, that definition is portable between Cloud Run and self-hosted Knative on Kubernetes. And because there's no Kubernetes under the hood, you don't pay the complexity tax of managing a cluster.

gVisor and Container Sandboxing

Every Cloud Run instance is sandboxed with two layers of isolation: not just Linux namespaces and cgroups like standard containers, but hardware-backed virtualization on top of that.

Cloud Run offers two execution environments:

Gen1 (gVisor-based): gVisor is an open-source container sandbox developed by Google. It acts as a user-space kernel, a process written in Go that intercepts your container's system calls and reimplements them, so the host kernel is never directly exposed. This gives you a smaller attack surface and faster cold starts, but some software that relies on unusual system calls may be incompatible.

Gen2 (Linux microVM-based): Instead of gVisor, Gen2 runs your container inside a lightweight virtual machine with a full Linux kernel. You get complete system call compatibility, better sustained CPU and network performance, but slightly longer cold starts.

Both environments use the same two-layer approach: a hardware-backed virtual machine monitor (VMM) boundary between instances, plus the software kernel layer (gVisor or microVM). Even if someone found a way to escape the container sandbox, they'd still face the hardware virtualization boundary.

You choose per service. Pricing is identical. Most developers never need to think about it. If you don't specify an execution environment, Cloud Run selects one automatically based on the features your service uses.

The Production-Ready Default

You can happily use Cloud Run without knowing any of this. But when someone asks "is Cloud Run production-ready?", the answer isn't "probably." Hardware-backed isolation between every instance, zero-trust security, and battle-tested scheduling. It comes hardened.

What You Get Out of the Box

Here's what a single deploy command gives you, before you touch a single config file:

$ gcloud run deploy my-api --source . --region us-central1 --allow-unauthenticated

Building using Dockerfile and deploying container to Cloud Run service [my-api] in project [my-project] region [us-central1]
✓ Building and deploying... Done.
  ✓ Uploading sources...
  ✓ Building Container...
  ✓ Creating Revision...
  ✓ Routing traffic...
Done.
Service [my-api] revision [my-api-00001-abc] has been deployed and is serving 100 percent of traffic.
Service URL: https://my-api-abc123-uc.a.run.app

That URL is live, load-balanced, auto-scaling, and secured with a managed TLS certificate. Let's break down what's included.

Security (Zero Config)

Every Cloud Run service automatically gets:

HTTPS with managed TLS certificates. Every *.run.app URL is served over HTTPS with auto-provisioned, auto-renewed certificates. There is no option to serve plain HTTP on the public endpoint. You cannot accidentally deploy an insecure service.
DDoS protection. The Google Front End (GFE) sits in front of every Cloud Run service, applying the same DDoS protections that guard Google's own services.
Hardware-backed container isolation. As we covered in the "behind the curtain" section, every instance is sandboxed behind a VMM boundary. This isn't namespace isolation, it's virtualization-level separation.
Encryption everywhere. Data encrypted at rest using Google-managed keys. All traffic between Google Cloud services encrypted in transit. This is the default and always on.
IAM-based access control. Every service integrates with Google Cloud IAM. By default, services require authentication. You explicitly opt in to public access with --allow-unauthenticated.

Scaling (Zero Config)

Scale-to-zero by default. No traffic? No instances. No cost. This is the default behavior, you don't configure it, you don't enable it. It just works.
Autoscaling up to 100 instances by default. Cloud Run automatically evaluates two key signals: request concurrency (targeting 60% of your configured max) and CPU utilization (targeting 60%). It scales up and down based on real demand. The default cap of 100 instances is configurable higher. To put that in perspective: if your service handles one request at a time and you get a sudden spike of 80 concurrent users, Cloud Run spins up roughly 80 instances to absorb the load, then scales back down as traffic drops.
Idle instance retention. After the last request, instances may be kept idle for up to 15 minutes before being terminated. This absorbs traffic bursts without cold starts. It's a small detail that makes a big difference in practice.
Startup CPU Boost. When an instance starts up, Cloud Run temporarily doubles (or more) its CPU allocation to speed up initialization. A service configured for 2 vCPU gets boosted to 4 vCPU during startup and for 10 seconds after. Google reported up to 50% faster startup times for Java/Spring applications when this feature is enabled.

Observability (Zero Config)

Automatic logging. Everything your container writes to stdout and stderr is automatically captured in Cloud Logging. No agent to install, no sidecar to configure. Write structured JSON logs and they're automatically parsed into searchable fields. For example, a JSON line like {"severity":"ERROR", "message":"connection refused", "sessionId":"abc-123", "userId":"user-42"} becomes a fully filterable log entry in the Cloud Logging console. That means you can add any custom fields you want to your JSON payload (session IDs, user IDs, request traces, feature flags) and later filter your logs by those exact fields. Debugging a problem for a specific user? Filter by jsonPayload.userId="user-42" and you get every log entry for that user across all instances.
Built-in metrics. Cloud Monitoring automatically tracks request count, latency distribution, CPU utilization, memory utilization, and instance count. These show up in the Cloud Run console with no setup.
Audit logs always on. Admin Activity audit logs record who deployed what, when, and with what configuration. These are always enabled and cannot be turned off.

Infrastructure (Zero Config)

Built-in load balancing. Requests are distributed across instances automatically. No load balancer to provision or configure.
Zero-downtime deployments. Every deployment creates a new immutable revision. Traffic switches to the new revision only after it passes its startup probe. Old instances keep serving in-flight requests. No deployment strategy to configure. It just happens. And because revisions are immutable and stick around, you can split traffic between them. Send 5% of traffic to the new revision while 95% stays on the current one, monitor the metrics, and gradually shift. Canary deployments without a deployment tool.
Automatic health checks. Cloud Run configures a TCP startup probe by default: it waits for your container to listen on the expected port before sending traffic. Your service doesn't receive requests until it's actually ready.
OS patching and runtime maintenance. You never patch the underlying OS, kernel, or runtime. Google handles the entire infrastructure stack beneath your container.

Cloud Run Functions: The Simpler Path

Everything above applies to Cloud Run services, where you bring a container (or source code) that runs an HTTP server. But what if you don't want to deal with an HTTP framework at all?

Cloud Run functions let you skip all of that. You write your functions, point a deployment at one of them, and Cloud Run wraps it in an HTTP server automatically. Your source code can define as many functions as you like. Each deployment serves one entry point, specified by the --function flag. Same codebase, multiple deployments, each with its own URL.

import functions_framework

@functions_framework.http
def hello(request):
    name = request.args.get("name", "World")
    return f"Hello, {name}!"

gcloud run deploy hello-func \
  --source . \
  --function hello \
  --base-image python312 \
  --region us-central1

That's it. No Flask, no FastAPI, no Dockerfile. Cloud Run builds the container, injects the HTTP server, and deploys it. You get the same scaling, the same security, the same zero-config observability that a full Cloud Run service gets.

If this sounds like Cloud Functions, here's the history. Cloud Functions 1st gen ran on older, separate infrastructure with strict limits: 9-minute timeouts, one request per instance, no concurrency. Cloud Functions 2nd gen (GA in 2022) was already built on top of Cloud Run under the hood, which unlocked 60-minute timeouts and multi-request concurrency. In 2024, Google made it official and rebranded 2nd gen as Cloud Run functions, consolidating everything under the Cloud Run name. So this isn't a new product. It's the recognition that the infrastructure was already unified. If your functions outgrow the one-entry-point-per-deployment model and you need routing, middleware, or multiple endpoints behind a single URL, you swap it for a full service on the same platform. No migration, no new infrastructure.

When to use functions vs. services: Cloud Run functions shine for single-purpose endpoints: webhooks, event handlers, lightweight APIs, scheduled tasks. A good example from my own workflow: when I build AI-powered front-end apps, I never call the LLM API directly from the client. That would mean shipping my API keys to the browser. Instead, I deploy a Cloud Run function that sits between my front end and the LLM provider. The function validates the user's authorization, makes the LLM call with my credentials server-side, and returns the response. My keys never leave the server. It takes minutes to set up, and it's exactly the kind of single-purpose endpoint where a function is the right fit. The moment you need multiple routes, middleware, or background processing within the same service, a full Cloud Run service with your own HTTP framework gives you that control. It's not a matter of which is "better." It's about matching the model to the job.

Supported runtimes include Node.js, Python, Go, Java, .NET, Ruby, and PHP.

Cloud Run Jobs: Run to Completion

Cloud Run services and functions are request-driven: they wait for traffic and respond to it. But not every workload fits that model. What about a nightly database export, a batch of image transformations, or a data pipeline that processes a million rows and then exits?

That's what Cloud Run jobs are for. Instead of listening for requests, a job runs your container to completion and stops. No HTTP endpoint, no scaling based on traffic. You tell it what to do, it does it, and it's done.

gcloud run jobs create my-etl-job \
  --image us-docker.pkg.dev/my-project/repo/etl:v1 \
  --tasks 100 \
  --task-timeout 30m \
  --max-retries 3 \
  --region us-central1

gcloud run jobs execute my-etl-job --region us-central1

The first command creates the job. The second runs it. You can also run jobs on a schedule using Cloud Scheduler, or trigger them from workflows and event-driven pipelines.

The --tasks flag is where it gets interesting. A job can run up to 10,000 parallel tasks, each receiving a CLOUD_RUN_TASK_INDEX environment variable (0 through 9,999) so it knows which chunk of work to handle. Need to process a million images? Create a job with 1,000 tasks, each processing 1,000 images. Cloud Run runs them in parallel, retries any that fail (up to --max-retries), and reports the result.

Task timeouts go up to 168 hours (7 days), or 1 hour with GPU, compared to the 60-minute request timeout on services. This makes jobs the natural fit for workloads that take hours to complete.

Jobs get the same infrastructure benefits as services: the same Borg scheduling, the same container isolation, the same scaling. The difference is the execution model. Services are long-lived and request-driven. Jobs are ephemeral and task-driven. Both are first-class Cloud Run workload types.

When Cloud Run Is NOT the Right Choice

Every platform has boundaries. Cloud Run's have narrowed significantly over the past two years (sidecars, GPU support, volume mounts, and worker pools have all landed), but real limits remain. Knowing them in advance saves you from the painful realization six months into a project that you're fighting the platform instead of building on it.

Statelessness by Design

Cloud Run instances are ephemeral. They can be created and destroyed at any moment. If your architecture requires any of the following, you need to understand the trade-offs:

Local disk persistence beyond the instance lifecycle. The local filesystem is ephemeral. When the instance is gone, so is everything on disk. That said, Cloud Run now supports mounting Cloud Storage buckets via FUSE and NFS file shares via Filestore, giving you read/write access to persistent shared storage. Cloud Storage mounts are eventually consistent (no file locking, last write wins), while Filestore gives you full POSIX semantics. Neither is local disk, but for many use cases they close the gap.
In-memory caching shared across instances. There are no sticky sessions by default (though session affinity is available on a best-effort basis). Each request might hit a different instance. If you need shared state, you need an external store like Redis or Memorystore.
WebSocket connections that must survive beyond ~60 minutes. Cloud Run supports WebSockets, and combined with session affinity this works well for real-time applications. But connections are limited to approximately 60 minutes (the maximum request timeout). If you need connections that live for hours or days, you need dedicated infrastructure.
Long-running background workers without HTTP triggers. Cloud Run services are request-driven. But this boundary is softening: Cloud Run worker pools (currently in preview) are designed for pull-based workloads like Kafka consumers and Pub/Sub subscribers, with no public HTTP endpoint required and up to 40% lower pricing than standard services.

Teams that need truly stateful workloads (ML model serving with warm caches that must survive across deploys, game servers with persistent connections beyond 60 minutes) find GKE's persistent volumes and StatefulSets a more honest fit.

Multi-Container Support: Better Than Before, Not Kubernetes

Cloud Run now supports multi-container instances (sidecars). You can run up to 10 containers per instance sharing the same network namespace and in-memory volumes. This enables patterns like running Nginx as a reverse proxy, OpenTelemetry collectors for custom metrics, Envoy for traffic management, or Prometheus for metric export.

But it's not full Kubernetes pod topology. The key differences:

Only one container receives inbound HTTP traffic (the "ingress container"). Sidecars can't independently serve external requests.
No init containers. You can control startup ordering (sidecar starts before ingress container), but unlike Kubernetes init containers, sidecars keep running. They don't run-to-completion before the main container starts.
Maximum 10 containers per instance.

For most sidecar patterns (proxies, observability agents, log processors), Cloud Run's implementation is sufficient. For complex pod topologies with init containers and multiple ingress points, GKE remains the answer.

Networking Depth

Cloud Run's networking has improved with Direct VPC egress (placing instances directly on your VPC without a connector), but teams still hit walls with:

Service mesh requirements. Istio and Anthos Service Mesh are native in GKE. You can run an Envoy sidecar on Cloud Run, but a full service mesh with mTLS, traffic policies, and observability across services is a different story.
Pod-to-pod direct communication without going through load balancers.
Custom network policies for zero-trust internal segmentation.
Multi-cluster routing and traffic mirroring.

If your architecture involves sophisticated network topologies or strict internal traffic control, GKE gives you the knobs. Cloud Run gives you simplicity at the cost of that control.

Cold Start Economics

Cloud Run's minimum instances feature mitigates cold starts, and Startup CPU Boost temporarily doubles CPU during initialization to get instances ready faster. For many workloads, these two features together make cold starts a non-issue.

But if your latency requirements are strict and you end up keeping instances always-on, you've lost the serverless cost model. You're now paying for always-on compute. And once you're paying for always-on instances anyway, the economic argument shifts toward GKE, where you have more control over resource packing, node utilization, and cost optimization across multiple services sharing the same cluster.

Workload Heterogeneity

Cloud Run primarily targets HTTP/gRPC workloads, though it keeps expanding. Cloud Run jobs handle batch processing, and GPU support makes AI/ML inference possible with scale-to-zero economics. The NVIDIA L4 (24 GB VRAM) is generally available, and the NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM) is available in preview.

But the moment a team needs:

Daemonsets for node-level operations
Priority classes and preemption policies
Multiple GPUs per instance (Cloud Run supports only one)
Large models beyond single-GPU capacity (the L4's 24 GB VRAM limits you to ~9B parameters, though the RTX PRO 6000 with 96 GB VRAM expands this significantly in preview)

...GKE becomes the natural fit. Cloud Run is opinionated about what it runs. That opinion keeps getting broader, but it has limits.

The Migration Path: Cloud Run to Kubernetes

Here's the good news: if you start on Cloud Run and later need to move to Kubernetes, the migration path is straightforward, at least for the container itself.

If you deployed with a Docker image, that same image runs on GKE without modification. Your container doesn't know or care whether it's running on Cloud Run or Kubernetes. It listens on a port, responds to HTTP requests, and that's it.

If you deployed from source code (using Cloud Run's buildpack-based deployment), converting to a Docker image is trivial. You're adding a Dockerfile to a project that already works. The application code, the dependencies, the runtime behavior, none of that changes. You're just making the packaging step explicit instead of letting buildpacks handle it.

But here's the honest part: the container isn't the hard part of the migration. The redesign is.

You're moving to Kubernetes because you need something Cloud Run doesn't offer: complex pod topologies, full service mesh, multi-GPU inference, or unlimited connection lifetimes. That means you're not just moving a container; you're evolving your architecture. What was an in-memory cache on Cloud Run becomes a Redis cluster backed by persistent volumes on GKE. What was a single ingress container with an Envoy sidecar becomes a pod with init containers, network policies, and custom scheduling rules. The container image stays the same; everything around it changes.

The container is portable. The architecture might not be. And that's fine. It's the right trade-off. Cloud Run lets you start fast, validate your idea with real users, and build confidence in the solution. When you hit the boundaries we discussed above, you graduate to Kubernetes with a proven container and a clear understanding of what you actually need.

Cloud Run isn't a dead end. It's a deliberate starting point.

Conclusion

So here's the decision framework:

Start with Cloud Run when you're building a containerized service and you want to move fast without worrying about infrastructure.
Stay on Cloud Run as long as your workload fits its model: stateless, request-driven, with scaling needs that the platform handles naturally.
Graduate to GKE when you hit the boundaries. You'll know when you do, because you'll be fighting the platform instead of building on it.

Your container is the unit of portability. Whether it ends up on Cloud Run, GKE, or another platform entirely, the work you put into building and packaging it is never wasted. That's not unique to Cloud Run. It's the power of containers in general. But Cloud Run is the fastest way I've found to prove that a container works in production, without the upfront investment that usually requires.

Part 2 We'll get hands-on: the different ways to deploy to Cloud Run (there are more than you'd expect). In Part 3 is coming soon we'll dive into the configuration options that let you tune CPU, memory, scaling, networking, and security for your specific needs. Follow so you don't miss it.

Resources

Automate Your GitHub Workflow with Gemini CLI

Daniel Gwerzman — Sun, 11 Jan 2026 10:58:08 +0000

Automate Your GitHub Workflow: Meet Your New AI Coding Partner

Google dropped something that caught my attention back at Cloud Next 25 Tokyo: a new AI coding teammate that lives directly in your GitHub repository. It’s called the Gemini CLI GitHub Action, and it can automatically triage issues, fix bugs, and even review your pull requests.

The best part? The Google team built this tool for themselves to handle the flood of requests on their own Gemini CLI repository, and now they’re sharing it with the rest of us. I spent some time testing it on my personal Flutter project, and I want to walk you through exactly how to set it up and what you can expect.

What Exactly Does This Tool Do?

Before we dive into setup, let’s clarify what we’re working with. The Gemini CLI GitHub Action gives you three main capabilities:

Issue Triage : It reads new issues and automatically labels them (bug, feature, etc.)
Code Fixes : It can analyze issues and write code to solve them
Pull Request Reviews : It provides automated code reviews with suggestions

Think of it as having an AI developer on your team that works 24/7, never gets tired, and can handle the routine tasks that eat up your time.

Full review on Youtube

Prerequisites: What You Need to Know

This guide assumes you have basic familiarity with:

GitHub repositories (creating repos, issues, and pull requests)
Command line basics
Having a project you can test with

If you’re new to GitHub, spend some time with their getting started guide first. You’ll need to be comfortable creating issues and managing repositories.

Step-by-Step Setup Guide

Step 1: Install and Update Gemini CLI

First, make sure you have the Gemini CLI installed and updated to the latest version. If you don’t have it yet, check the official installation guide.

Once installed, verify you’re on the latest version — this is crucial for the GitHub integration to work properly.

Step 2: Set Up the GitHub Action

Navigate to your project directory and run:

gemini setup github

This command takes just a few seconds and adds the necessary GitHub Action files to your repository. You’ll see new files in the .github/workflows directory that define how the AI teammate will respond to different events.

Step 3: Commit and Push

Don’t forget this crucial step! Commit all the new files and push them to GitHub. The actions won’t work until they’re actually in your repository.

git add .
git commit -m "Add Gemini CLI GitHub Actions"
git push

Step 4: Add Your Gemini API Key

Here’s the step that’s not immediately obvious from the documentation but is absolutely essential:

Go to your GitHub repository
Click*Settings*
Navigate to*Security* →Secrets and Variables →Actions
Click*New repository secret*
Name itGEMINI_API_KEY
For the value, you’ll need to get your API key from Google AI Studio

To get your API key:

Openai.dev in a new tab
This takes you to Google AI Studio
Click*Get API key*
Select*Create new API key*
Copy the generated string and paste it as your secret value

Testing Your New AI Teammate

Now for the fun part! Let’s see what this thing can actually do.

Testing Issue Triage

I started by creating a new issue in my Flutter shopping list project. Here’s what I wrote:

“ When a user hit enter on edit mode of an item, the system will end the edit mode and update as usual. and open a new empty item under the edited item that is focus and ready to be edit.”

Admittedly, this is a pretty vague description , normally I’d provide much more detail when working with AI tools. But I wanted to see how well it handled unclear requirements.

To trigger the triage, I added this comment to the issue:

@gemini-cli triage this issue

Within a few minutes, the action ran and automatically labeled my issue as a “bug.” Looking at the action logs, I could see exactly how it made this decision — it analyzed the issue description and determined this was describing missing functionality rather than a new feature request.

Testing Code Fixes

Next came the real test. I commented:

@gemini-cli fix this issue

This is where things got interesting. The AI spent several minutes analyzing my code, creating a detailed plan with checkboxes, and then implementing the changes. Unlike my usual coding workflow where I see every change being made, this felt completely autonomous. I just watched it work and waited for the results.

The process looked like this:

Read and analyze the issue
Examine the existing codebase
Create a detailed implementation plan
Execute the plan step by step
Create a new branch with the changes

When it finished, I had a new branch called something like “fix-enter-key-functionality” with actual working code.

Testing Pull Request Review

The final piece was testing the automated code review. I manually created a pull request from the branch the AI had created (it couldn’t create the PR automatically for some reason).

Almost immediately, another action triggered: the pull request review. After about three minutes, I had a comprehensive code review with:

A summary of the changes
Specific feedback on potential issues
Security and performance considerations
An overall assessment (in my case, it was marked as low-to-medium risk)

What Actually Worked (And What Didn’t)

Let’s be honest about the results. The good news: the code actually worked! When I tested my Flutter app, pressing enter while editing an item did indeed update the item and create a new one. The functionality was implemented correctly despite my vague description.

However, there were some limitations:

The AI had trouble with Flutter/Dart compared to what I’d expect with JavaScript projects (there’s simply more training data available for web technologies)
It took longer to debug and correct Flutter-specific issues
The autonomous nature felt strange compared to tools like Cursor where I can approve changes line by line
It made some unwanted changes to my configuration files
The pull request creation failed, requiring manual intervention

Pro Tips for Better Results

Based on my testing, here are some recommendations:

Write Better Issue Descriptions : Even though my vague description worked, you’ll get much better results with detailed requirements, acceptance criteria, and context about your project. And always break it down into small pieces.

Use Different AI Tools for Different Tasks : I recommend using different AI tools for writing code versus reviewing it. For example, if Claude writes the code, use Gemini or another tool for the review. This provides a fresh perspective and catches issues the original tool might miss.

Review the Prompts : One of the best features is that all the prompts are visible and customizable. Check out the .github/workflows files to see exactly what instructions the AI is following, and modify them to match your team's standards.

Start Small : Test this on smaller, non-critical repositories first. Get comfortable with how it works before deploying it on your main projects.

The Bottom Line: Should You Try It?

Absolutely. Even with its limitations, this tool represents a significant step forward in AI-assisted development. It’s particularly valuable for:

Solo developers who want help with routine tasks
Open source maintainers dealing with lots of issues and PRs
Teams looking to standardize their review process
Anyone curious about autonomous AI coding

The setup is straightforward, it’s free to use (you just pay for Gemini API usage), and the prompts are completely customizable. Even if you don’t use it as-is, studying the prompts and workflow structure provides excellent insights into effective AI automation.

Ready to Get Started?

Here’s your action plan:

Pick a test repository (not your main project!)
Follow the setup steps above
Create a simple issue to test triage functionality
Try the fix command on something small
Experiment with customizing the prompts to match your workflow

Remember, this is still early-stage technology. Approach it with curiosity rather than expecting perfection, and you’ll likely find some genuinely useful automation for your development workflow.

The future of coding is increasingly collaborative between humans and AI. Tools like this give us a glimpse of what that partnership might look like — and honestly, it’s pretty exciting.

Have you tried the Gemini CLI GitHub Action? I’d love to hear about your experience and any creative ways you’ve customized it for your projects.