Forem: Shir Meir Lador

Agent Factory Recap: How Gemma 4 Taught Itself Physics

Shir Meir Lador — Thu, 14 May 2026 14:10:49 +0000

In this episode of The Agent Factory, Vlad Kolesnikov and I sat down with Omar Sanseviero from the Developer Experience team at Google DeepMind. We explored the groundbreaking release of Gemma 4: a new family of open models designed to bring high-level intelligence and agentic capabilities directly to consumer hardware and mobile devices. Since the launch last month, Gemma 4 had over 50 million downloads!

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Gemma 4 - What is it?

Gemma 4 is the latest generation of open models from Google DeepMind, built on the same foundational research as Gemini 3. The family is designed to deliver exceptional "intelligence per parameter" across a range of deployment scenarios, from mobile phones to powerful workstations. The Gemma 4 model family now spans three distinct architectures:

Small Sizes (E2B & E4B): Optimized for ultra-mobile, edge, and browser deployment (such as Pixel or Chrome).
Dense (31B): A powerful 31-billion parameter model that provides server-grade performance for local execution on consumer GPUs.
Mixture-of-Experts (26B MoE): A highly efficient architecture designed for high-throughput tasks and advanced reasoning.

With the shift to an Apache 2 license, these models provide developers and startups with the flexibility to build, modify, and commercialize applications while maintaining full control over their infrastructure.

Omar Sanseviero on how Gemma 4 changes the landscape for agent developers

Timestamp: 1:40

Omar highlighted that Gemma 4 brings "very high intelligence per parameter," making it possible to run agentic workflows entirely offline. We saw examples of multiple Gemma instances running locally to generate SVGs (1:53) and an Android-based agent picking specific skills, like playing the piano, to complete tasks (2:45). As Omar noted, "This means that you can run very powerful things with very little hardware overhead...even in the phone that you have in your pocket."

The Factory Floor

Building a Local Food Tour Agent

Timestamp: 5:29

We showcased a food tour agent powered by Gemma 4 using the Agent Development Kit (ADK) and a Google Maps MCP server. We demonstrated how a local model can handle complex, multi-step reasoning tasks.

The agent identified the best ramen spots in Seattle under a $30 budget.
It verified that the locations were within walking distance of each other.
It processed search results to provide specific tips on what to order and what to avoid.

Autonomous Python Code Execution

Timestamp: 8:03

In this demo, we pushed Gemma 4's coding capabilities to the limit by asking it to express itself through animation. Using a sandbox execution environment, the model performed the following:

Wrote Python code using the Matplotlib library.
Attempted to build a physics engine to simulate a bouncing ball.
Self-corrected when the initial execution environment lacked certain CPU features, finding an alternative path to successfully generate the animation.
Demonstrated a deep understanding of real-world physics and gravity through code.

The Shift to Apache 2 Licensing

Timestamp: 4:05

A major theme of the conversation was the community-driven decision to move Gemma 4 to an Apache 2 license. This change provides developers and startups with maximum flexibility to build, modify, and commercialize applications. Omar emphasized that this was a direct response to developer feedback, aiming to unlock a new wave of innovation in the open models ecosystem.

Developer Q&A

Architectural Decisions and Mixture of Experts (MoE)

Timestamp: 17:23

Omar explained the technical shifts that make Gemma 4 so efficient. For the first time, the Gemma family includes a Mixture of Experts (MoE) architecture, which optimizes for extremely low latency in production. Additionally, the smaller E2B and E4B models utilize per-layer embeddings to remain "cheap" to run on GPUs. For vision tasks, the model now supports variable aspect ratios, allowing it to understand images of various sizes more accurately than previous fixed-resolution versions.

Comparing Gemma to Gemini

Timestamp: 19:51

When asked how Gemma stacks up against its larger sibling, Gemini, Omar clarified that they serve different purposes. While Gemini excels at massive-scale tasks and deep "world knowledge" due to its size, Gemma is the "best open model that can run on a single consumer GPU." It is specifically optimized for instruction following, coding, and agentic use cases where local deployment or fine-tuning is required.

Fine-Tuning for Specialized Industries

Timestamp: 21:10

The conversation touched on the importance of "Sovereign AI" and privacy. Because Gemma is an open model, developers in regulated industries, like healthcare or finance, can fine-tune the model on their private data and deploy it within their own air-gapped infrastructure. This gives developers full control over their data and the model's specialized expertise.

Conclusion

Gemma 4 marks a turning point for agentic development, proving that you don't always need a massive cloud cluster to build something smart. Whether it's running a physics simulation on a laptop or a travel guide on a phone, the barrier to entry for high-performance AI has never been lower. We are entering an era where the "conductor" of the AI orchestra can be any developer with a single GPU and a great idea.

Your turn to build

Now that you've seen what Gemma 4 can do, it's time to start building. Check out the resources in our show notes, the food tour agent, the coding agent, explore the ADK support, and try running Gemma 4 on your local machine or on Cloud Run. We can't wait to see what agents you create!

Watch more of The Agent Factory → Reinforcement learning & fine-tuning on TP...

Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech

Connect with us

Shir Meir Lador → LinkedIn, X
Vlad Kolesnikov → LinkedIn, X
Omar Sanseviero → LinkedIn, X

Deploying a Multi-Agent System with Terraform and Cloud Run

Shir Meir Lador — Thu, 07 May 2026 21:04:12 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal: a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first three parts of this series, we laid the essential groundwork by establishing its core capabilities and local verification process:

In part 1, we standardize the agent's capabilities through the Model Context Protocol (MCP), connecting it to Reddit for trend discovery and Google Cloud Docs for technical grounding. In part 2, we built a multi-agent architecture and integrated the Vertex AI memory bank to allow the system to learn and persist user preferences across different conversations. In part 3, we verified the full end-to-end lifecycle locally using a dedicated test runner to ensure that research, content creation, and cloud-based memory retrieval were perfectly synchronized.

If you'd like to dive straight into the code, you can clone the repository here.

Deployment to Cloud Run and the Path to Production

To help you transition from this local prototype to a production service, this final part focuses on building the production backbone of your agent using the foundational deployment patterns provided by the Agent Starter Pack. We will implement the essential structural components required for monitoring, data integrity, and long-term state management in the cloud. You will learn to implement the application server and helper utilities needed for a production-ready deployment before provisioning secure, reproducible infrastructure with Terraform.

While the Dockerfile packages your agent's code and its specialized dependencies, such as Node.js for the Reddit MCP tool, Terraform is used to build the platform it lives on. Terraform automates the creation of your Artifact Registry, least-privilege service accounts, and Secret Manager integrations to ensure your API keys remain protected.

By the end of this part, you will have a standardized application framework deployed on Google Cloud Run and a roadmap for graduating your prototype through continuous evaluation, CI/CD and advanced observability.

Production Utilities and Server: Building the System's Body

In this section, you implement the structural components required for monitoring and long-term state management in the cloud.

The Application Server: Initializing the FastAPI server and establishing a vital connection to the Vertex AI memory bank.
Implementing Telemetry: Enabling 'Agent Traces' for visibility into internal reasoning.

The Application Server

The fast_api_app.py file serves as the vital entry point for your agent, transforming the core logic into a production FastAPI server that acts as the "body" of your system. When deploying to Cloud Run, this server is essential because it provides the necessary web interface to listen for incoming HTTP requests and dispatch them to the agent for processing. Beyond basic serving, its most critical role is establishing a connection to the Vertex AI memory bank by defining a MEMORY_URI, which allows the ADK framework to persist and retrieve user preferences across different production sessions. Additionally, the application server initializes production-grade telemetry for real-time monitoring.

Go back to the dev_signal_agent folder.

cd ..

Paste the following code in dev_signal_agent/fast_api_app.py:

import os
from fastapi import FastAPI
from google.adk.cli.fast_api import get_fast_api_app
from google.cloud import logging as cloud_logging
from vertexai import agent_engines
from dev_signal_agent.app_utils.env import init_environment

# --- Initialization & Secure Secret Retrieval ---
# We now unpack the SECRETS dictionary returned by our updated env.py
PROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()
logger = cloud_logging.Client().logger(__name__)

# Access sensitive credentials from the SECRETS dictionary
# These keys stay in memory and are NOT injected into os.environ
REDDIT_CLIENT_ID = SECRETS.get("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = SECRETS.get("REDDIT_CLIENT_SECRET")
REDDIT_USER_AGENT = SECRETS.get("REDDIT_USER_AGENT")
DK_API_KEY = SECRETS.get("DK_API_KEY")

# --- Configuration & Sessions ---
AGENT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Non-sensitive configuration uses environment variables
BUCKET = os.environ.get("AI_ASSETS_BUCKET")
USE_IN_MEMORY = os.environ.get("USE_IN_MEMORY_SESSION", "").lower() in ("true", "1")

# --- MEMORY BANK CONNECTION ---
def _get_memory_bank_uri():
    if USE_IN_MEMORY: return None, None
    # We use 'dev_signal_agent' as the display name for the Vertex AI memory bank
    name = os.environ.get("AGENT_ENGINE_MEMORY_BANK_NAME", "dev_signal_agent")
    existing = list(agent_engines.list(filter=f"display_name={name}"))
    ae = existing[0] if existing else agent_engines.create(display_name=name)
    uri = f"agentengine://{ae.resource_name}"
    print(f"DEBUG: Connecting to Memory Bank: {uri} (display_name={name})")
    return uri, uri

SESSION_URI, MEMORY_URI = _get_memory_bank_uri()

# --- Initialize FastAPI with ADK ---
app: FastAPI = get_fast_api_app(
    agents_dir=AGENT_DIR,
    web=True,
    artifact_service_uri=f"gs://{BUCKET}" if BUCKET else None,
    allow_origins=os.getenv("ALLOW_ORIGINS", "").split(",") if os.getenv("ALLOW_ORIGINS") else None,
    session_service_uri=SESSION_URI,
    memory_service_uri=MEMORY_URI, # <--- Connects the Memory Bank
    otel_to_cloud=True, # <--- Enables production telemetry
)

if __name__ == "__main__":
    import uvicorn
    # Standard Cloud Run port is 8080
    uvicorn.run(app, host="0.0.0.0", port=8080)

Implementing Telemetry

In a production environment, visibility into your agent's reasoning is critical. We leverage the built-in observability features of the Google ADK by setting the otel_to_cloud=True flag in our application server. This single parameter handles the majority of the instrumentation automatically, exporting "Agent Traces" directly to the Google Cloud Console. These traces provide a "visual waterfall" of the agent's operation, including individual agent thought processes, LLM invocations, and MCP tool calls.

Monitoring vs. Targeted Evaluation

It is essential to understand that production tracing is subject to sampling to balance performance and cost. Because Cloud Run captures only a subset of requests, not every individual user interaction will be visible.

System Traces (Monitoring): Used to analyze behavior "at large," such as identifying latency bottlenecks or system timeouts.
Reasoning Traces (Evaluation): High-quality evaluation mandates targeted trace capture. This means calling the agent specifically for a test case where you know you will evaluate that particular request in full detail.

Viewing the Trace

To see your traces, navigate to the Trace Explorer in the Google Cloud Console and filter for your service (e.g., dev-signal). Clicking a specific Trace ID opens a Gantt chart that allows you to distinguish between cognitive reasoning failures (wrong decisions) and physical system issues (timeouts).

For advanced configurations, refer to the following documentation:

Infrastructure as Code: Provisioning Secure Cloud Resources

We utilize the infrastructure-as-code patterns provided by the Agent Starter Pack's security-first design. The starter pack builds the professional platform required to automate the creation of least-privilege service accounts and robust secret management in seconds.

Using Terraform ensures that your entire Google Cloud environment - from IAM roles to Secret Manager versions - is defined in reproducible, secure code. We break our infrastructure into the following logical blocks:

Resources & Variables: Define the specific project, region, and sensitive API secrets used by the agent.
Core Infrastructure: Enable essential APIs and provision a private Artifact Registry to host your agent's container images.
Identity & Access Management (IAM): Configure specialized Service Accounts that strictly follow the Principle of Least Privilege to ensure your system remains secure.
Secret Management: Securely ingest API credentials into Google Secret Manager for protected runtime access.
Cloud Run Configuration: Define the container environment, resource limits, and automated secret injection for the final deployment.

To begin provisioning, return to the root folder of your project (dev-signal) and create the necessary deployment directories:

cd ..
mkdir deployment
cd deployment
mkdir terraform
cd terraform

Terraform Resources and Variables

The variables.tf file defines the configurable parameters for your deployment, allowing you to customize the infrastructure without altering the underlying logic. It includes variables for the project_id, the deployment region (defaulting to us-central1), and the service_name for your Cloud Run instance. Furthermore, it defines a secrets map used to securely ingest sensitive API credentials—such as Reddit and Developer Knowledge keys—into Google Secret Manager for runtime access. This modular approach ensures your production environment remains reproducible, secure, and adaptable across different projects.

Paste the following code into deployment/terraform/variables.tf:

variable "project_id" {
  description = "The Google Cloud Project ID"
  type        = string
}
variable "region" {
  description = "The Google Cloud region to deploy to"
  type        = string
  default     = "us-central1"
}
variable "service_name" {
  description = "The name of the Cloud Run service"
  type        = string
  default     = "dev-signal"
}
variable "secrets" {
  description = "A map of secret names and their values (e.g., REDDIT_CLIENT_ID, DK_API_KEY)"
  type        = map(string)
  default     = {}
}
variable "ai_assets_bucket" {
  description = "The GCS bucket for storing AI assets"
  type        = string
}

Core Infrastructure Logic

We define our infrastructure in logical blocks. Here is what each part does:

1. Enable APIs: Ensures the project has the necessary services active (Cloud Run, Vertex AI, etc.). We use disable_on_destroy = false to prevent accidental data loss if the Terraform is destroyed.

Paste the following code into deployment/terraform/main.tf:

resource "google_project_service" "services" {
  project = var.project_id
  for_each = toset([
    "run.googleapis.com",
    "artifactregistry.googleapis.com",
    "cloudbuild.googleapis.com",
    "aiplatform.googleapis.com",
    "secretmanager.googleapis.com",
    "logging.googleapis.com"
  ])
  service            = each.key
  disable_on_destroy = false
}

2. Artifact Registry: Creates a private Docker registry to store our agent's container images.

resource "google_artifact_registry_repository" "repo" {
  location      = var.region
  project       = var.project_id
  repository_id = "dev-signal-repo"
  description   = "Docker repository for Dev Signal Agent"
  format        = "DOCKER"
  depends_on    = [google_project_service.services]
}

3. Service Account & IAM: Adhering to the Principle of Least Privilege - This is a critical security step. In accordance with the Principle of Least Privilege, we avoid using the default compute service account and instead provision a dedicated user-managed service account (dev-signal-sa). By designating this as the Cloud Run service identity, we can grant it only the minimum necessary permissions—specifically roles/aiplatform.user, roles/logging.logWriter, and roles/storage.objectAdmin. This granular access control ensures that the agent has the exact permissions required to interact with Vertex AI and Cloud Storage without over-granting access to other sensitive cloud resources, significantly reducing the potential impact of a compromised account. Learn more best practices for using service accounts securely.

resource "google_service_account" "agent_sa" {
  project      = var.project_id
  account_id   = "${var.service_name}-sa"
  display_name = "Dev Signal Agent Service Account"
}

4. Secret Management: This handles your API keys securely. It creates secrets in Google Secret Manager and gives the agent's Service Account permission to access them at runtime.

resource "google_secret_manager_secret" "agent_secrets" {
  project  = var.project_id
  for_each = toset(keys(var.secrets))
  secret_id = each.key
  replication {
    auto {}
  }
  depends_on = [google_project_service.services]
}
resource "google_secret_manager_secret_version" "agent_secrets_version" {
  for_each    = toset(keys(var.secrets))
  secret      = google_secret_manager_secret.agent_secrets[each.key].id
  secret_data = var.secrets[each.key]
}
resource "google_secret_manager_secret_iam_member" "secret_accessor" {
  project  = var.project_id
  for_each = toset(keys(var.secrets))
  secret_id = google_secret_manager_secret.agent_secrets[each.key].id
  role      = "roles/secretmanager.secretAccessor"
  member    = "serviceAccount:${google_service_account.agent_sa.email}"
}

5. Cloud Run Configuration:

Security Best Practice: To satisfy production security standards, our main.tf grants the Service Account the secretmanager.secretAccessor role. Our Python application then uses the Secret Manager SDK to pull these credentials directly into local memory at runtime, ensuring they never touch the container's environment configuration

# 6. Cloud Run Service Deployment
resource "google_cloud_run_v2_service" "default" {
  project  = var.project_id
  name     = var.service_name
  location = var.region
  ingress  = "INGRESS_TRAFFIC_ALL"

  template {
    service_account = google_service_account.agent_sa.email

    containers {
      image = "us-docker.pkg.dev/cloudrun/container/hello" # Placeholder until first build

      env {
        name  = "GOOGLE_CLOUD_PROJECT"
        value = var.project_id
      }
      env {
        name  = "GOOGLE_CLOUD_LOCATION"
        value = "global"
      }
      env {
        name  = "GOOGLE_GENAI_USE_VERTEXAI"
        value = "True"
      }
      env {
        name  = "AI_ASSETS_BUCKET"
        value = var.ai_assets_bucket
      }

      resources {
        limits = {
          cpu    = "1"
          memory = "2Gi"
        }
      }
    }
  }

  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }

Provision the Infrastructure

Before we can deploy our code, we need to provision the Google Cloud infrastructure we just defined.

Initialize Terraform: This downloads the necessary provider plugins. Run this in deployment/terraform folder:

terraform init

Create a Variables File:

Paste this code in deployment/terraform/terraform.tfvars and update it with your project details and secrets.

project_id       = "your-project-id"
region           = "us-central1"
service_name     = "dev-signal"
ai_assets_bucket = "your-bucket-name"
secrets = {
  REDDIT_CLIENT_ID     = "your_client_id"
  REDDIT_CLIENT_SECRET = "your_client_secret"
  REDDIT_USER_AGENT    = "your_user_agent"
  DK_API_KEY           = "your_dk_api_key"
}

Plan configuration: This allows you to review the changes before they are applied. Run this in the deployment/terraform folder:

terraform plan -out=plan.tfplan

Apply Configuration: Once you have reviewed the plan and confirmed it does what you want, run:

terraform apply plan.tfplan

Deployment: Containerization and the Cloud Build Pipeline

In this final stage of the build process, we package our agent's "body" and "brain" into a portable, production-ready container. This ensures that every component - from our Python logic to the Node.js environment required for the Reddit MCP tool - is bundled together with its exact dependencies.

We utilize a Dockerfile to define this environment and a Makefile to orchestrate the deployment pipeline. When you trigger the deployment, Google Cloud Build takes your local source code, builds the container image according to the Dockerfile, and stores it in the private Artifact Registry created earlier by Terraform. Finally, the pipeline automatically updates your Cloud Run service to serve traffic using this fresh image, completing the journey from local code to a live, secure cloud workload.

Paste this code in dev-signal/Dockerfile:

FROM python:3.12-slim

# Install Node.js and npm for MCP tools (like reddit-mcp)
RUN apt-get update && apt-get install -y \
    curl \
    && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    && apt-get install -y nodejs \
    && npm install -g reddit-mcp \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir uv==0.8.13

WORKDIR /code

COPY ./pyproject.toml ./README.md ./uv.lock* ./
COPY ./dev_signal_agent ./dev_signal_agent

RUN uv sync --frozen

EXPOSE 8080

CMD ["uv", "run", "uvicorn", "dev_signal_agent.fast_api_app:app", "--host", "0.0.0.0", "--port", "8080"]

The Makefile automates the build and deploys.

Paste this code in dev-signal/Makefile:

PROJECT_ID ?= $(shell gcloud config get-value project)
REGION     ?= us-central1
IMAGE_REPO ?= dev-signal-repo
IMAGE := $(REGION)-docker.pkg.dev/$(PROJECT_ID)/$(IMAGE_REPO)/agent:latest

# Deploy via Cloud Build & Container
docker-deploy:
    @echo "? Building and deploying to $(PROJECT_ID) via Cloud Build..."
    gcloud builds submit --tag $(IMAGE) --project $(PROJECT_ID) .
    gcloud run services update dev-signal \
        --image $(IMAGE) \
        --region $(REGION) \
        --project $(PROJECT_ID) \
        --labels dev-tutorial=dev-signal-agent

Deploy Application

Now that our infrastructure is ready, we can build and deploy the application code.

Run the following command from the root of your project:

make docker-deploy

What happens when you run this?

Build: Google Cloud Build takes your local code and the Dockerfile, builds a container image, and stores it in the Artifact Registry.
Deploy: It updates the Cloud Run service defined in Terraform to use this new image.

When the deployment completes, you should get a message like this:

Service [dev-signal] revision [dev-signal...] has been deployed and is serving 100 percent of traffic.

Service URL: https://dev-signal-...-.us-central1.run.app

Verification: Accessing and Testing Your Deployed Agent

Since production services are private by default, this section covers how to grant permissions and access the agent securely.

Managing IAM Permissions: Granting the necessary run.invoker role to authorized users.

Secure Access via Cloud Run Proxy: Using the gcloud proxy to interact with your live service.

Granting User Permissions

Before you can invoke the service, you must grant your Google account the roles/run.invoker role for this specific service. Run the following command:

gcloud run services add-iam-policy-binding dev-signal \
  --member="user:$(gcloud config get-value account)" \
  --role="roles/run.invoker" \
  --region=us-central1 \
  --project=$(gcloud config get-value project)

Launch the Proxy

Now, access your private service securely via the proxy:

gcloud run services proxy dev-signal \
  --region us-central1 \
  --project $(gcloud config get-value project)

Visit http://localhost:8080 to chat with your deployed agent! See a possible test scenario in part 3 of the series.

Summary

Congratulations! You have successfully built Dev Signal.

What we covered:

Tooling (MCP): You connected your agent to Reddit, Google Docs, and a Local Image Generator using the Model Context Protocol.
Architecture: You implemented a Root Orchestrator managing specialized agents (Scanner, Expert, Drafter).
Memory: You integrated Vertex AI memory bank to give your agent long-term persistence across sessions.
Production: You deployed the entire stack to Google Cloud Run using Terraform for secure, reproducible infrastructure.

You now have a solid foundation for building sophisticated, stateful AI applications on Google Cloud.

Local Testing of a Multi-Agent System with Memory

Shir Meir Lador — Thu, 07 May 2026 21:03:03 +0000

In part 1 and part 2 of this series, we established the essential groundwork by standardizing the core capabilities through the Model Context Protocol (MCP) and constructing a multi-agent architecture integrated with the Vertex AI memory bank to provide long-term intelligence and persistence. Now, we'll explore how to test your multi-agent system locally!

If you'd like to dive straight into the code and explore it at your own pace, you can clone the repository here.

Testing the Agent Locally

Before transitioning your agentic system to Google Cloud Run, it is essential to ensure that its specialized components work seamlessly together on your workstation. This testing phase allows you to validate trend discovery, technical grounding, and creative drafting within a local feedback loop, saving time and resources during the development process.

In this section, you will configure your local secrets, implement environment-aware utilities, and use a dedicated test runner to verify that Dev Signal can correctly retrieve user preferences from the Vertex AI memory bank on the cloud. This local verification ensures that your agent's "brain" and "hands" are properly synchronized before moving to deployment.

Environment Setup

Create a .env file in your project root. These variables are used for local development and will be replaced by Terraform/Secret Manager in production.

Paste this code in dev-signal/.env and update with your own details.

Note: GOOGLE_CLOUD_LOCATION is set as global because that is where gemini-3-flash-preview is supported. We will use GOOGLE_CLOUD_LOCATION for the model location.

# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=global
GOOGLE_CLOUD_REGION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=True
AI_ASSETS_BUCKET=your_bucket_name

# Reddit API Credentials
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USER_AGENT=my-agent/0.1

# Developer Knowledge API Key
DK_API_KEY=your_api_key

Helper Utilities

Create a new directory for your application utils:

cd dev_signal_agent
mkdir app_utils
cd app_utils

Environment Configuration

This module standardizes how the agent discovers the active Google Cloud Project and Region, ensuring a seamless transition between development environments. Using load_dotenv(), the script first checks for local configurations before falling back to google.auth.default() or environment variables to retrieve the Project ID. This automated approach ensures your agent is properly authenticated and grounded in the correct cloud context without requiring manual configuration changes.

Beyond basic project discovery, the script provides a robust Secret Management layer. It attempts to resolve sensitive credentials, such as Reddit API keys, first from the local environment (for rapid development) and then dynamically from the Google Cloud Secret Manager API for production security. By returning these as a dictionary rather than injecting them into environment variables, the module maintains a clean security posture.

The script further calibrates the environment by distinguishing between global and regional requirements for different AI services. It specifically assigns the "global" location for models to access cutting-edge preview features while designating a regional location, such as us-central1, for infrastructure like the Vertex AI Agent Engine.

Paste this code in dev_signal_agent/app_utils/env.py:

import os
import google.auth
import vertexai
from google.cloud import secretmanager
from dotenv import load_dotenv

def _fetch_secrets(project_id: str):
    """Fetch secrets from Secret Manager and return them as a dictionary."""
    secrets_to_fetch = ["REDDIT_CLIENT_ID", "REDDIT_CLIENT_SECRET", "REDDIT_USER_AGENT", "DK_API_KEY"]
    fetched_secrets = {}

    # First, check local environment (for local development via .env)
    for s in secrets_to_fetch:
        val = os.getenv(s)
        if val:
            fetched_secrets[s] = val

    # If keys are missing (common in production), fetch from Secret Manager API
    if len(fetched_secrets) < len(secrets_to_fetch):
        client = secretmanager.SecretManagerServiceClient()
        for secret_id in secrets_to_fetch:
            if secret_id not in fetched_secrets:
                name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
                try:
                    response = client.access_secret_version(request={"name": name})
                    fetched_secrets[secret_id] = response.payload.data.decode("UTF-8")
                except Exception as e:
                    print(f"Warning: Could not fetch {secret_id} from Secret Manager: {e}")
    return fetched_secrets

def init_environment():
    """Consolidated environment discovery."""
    load_dotenv()
    try:
        _, project_id = google.auth.default()
    except Exception:
        project_id = os.getenv("GOOGLE_CLOUD_PROJECT")

    model_location = os.getenv("GOOGLE_CLOUD_LOCATION", "global")
    service_location = os.getenv("GOOGLE_CLOUD_REGION", "us-central1")

    secrets = {}
    if project_id:
        vertexai.init(project=project_id, location=service_location)
        secrets = _fetch_secrets(project_id)

    return project_id, model_location, service_location, secrets

Local Testing Script

The Google ADK comes with a built-in Web UI that is excellent for visualizing agent logic and tool composition.

You can launch it by running in the project root:

uv run adk web

However, the default Web UI will not test the long-term memory integration described in this tutorial because it is not pre-connected to a Vertex AI memory session. By default, the generic UI often relies on in-memory services that do not persist data across sessions. Therefore, we use the dedicated test_local.py script to explicitly initialize the VertexAiMemoryBankService. This ensures that even in a local environment, your agent is communicating with the real cloud-based memory bank to validate preference persistence.

The test_local.py script:

Connects to the real Vertex AI Agent Engine in the cloud for memory storage.
Uses an in-memory session service for local chat history (so you can wipe it easily).
Runs a chat loop where you can talk to your agent.

Go back to the root folder dev-signal:

cd ../..

Paste this code in dev-signal/test_local.py:

import asyncio
import os
import google.auth
import vertexai
import uuid
from dotenv import load_dotenv
from google.adk.runners import Runner
from google.adk.memory.vertex_ai_memory_bank_service import VertexAiMemoryBankService
from google.adk.sessions import InMemorySessionService
from vertexai import agent_engines
from google.genai import types
from dev_signal_agent.agent import root_agent

# Load environment variables
load_dotenv()

async def main():
    # 1. Setup Configuration
    project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
    # Agent Engine (Memory) MUST use a regional endpoint
    resource_location = "us-central1"
    agent_name = "dev-signal"

    print(f"--- Initializing Vertex AI in {resource_location} ---")
    vertexai.init(project=project_id, location=resource_location)

    # 2. Find the Agent Engine Resource for Memory
    existing_agents = list(agent_engines.list(filter=f"display_name={agent_name}"))
    if existing_agents:
        agent_engine = existing_agents[0]
        agent_engine_id = agent_engine.resource_name.split("/")[-1]
        print(f"✅ Using persistent Memory Bank from Agent: {agent_engine_id}")
    else:
        print(f"❌ Error: Agent Engine '{agent_name}' not found. Please deploy with Terraform first.")
        return

    # 3. Initialize Services
    session_service = InMemorySessionService()
    memory_service = VertexAiMemoryBankService(
        project=project_id,
        location=resource_location,
        agent_engine_id=agent_engine_id
    )

    # 4. Create a Runner
    runner = Runner(
        agent=root_agent,
        app_name="dev-signal",
        session_service=session_service,
        memory_service=memory_service
    )

    # 5. Run a Test Loop
    user_id = "local-tester"
    print("\n--- TEST SCENARIO ---")
    print("1. Start a session, tell the agent your preference (e.g., 'write in rhymes').")
    print("2. Type 'new' to start a FRESH session (local state wiped).")
    print("3. Ask for a blog post. The agent should retrieve your preference from the CLOUD memory.")

    current_session_id = f"session-{str(uuid.uuid4())[:8]}"
    await session_service.create_session(
        app_name="dev-signal",
        user_id=user_id,
        session_id=current_session_id
    )
    print(f"\n--- Chat Session (ID: {current_session_id}) ---")

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ["exit", "quit"]:
            break

        if user_input.lower() == "new":
            current_session_id = f"session-{str(uuid.uuid4())[:8]}"
            await session_service.create_session(
                app_name="dev-signal",
                user_id=user_id,
                session_id=current_session_id
            )
            print(f"\n--- Fresh Session Started (ID: {current_session_id}) ---")
            print("(Local history is empty, retrieval must come from Memory Bank)")
            continue

        print("Agent is thinking...")
        async for event in runner.run_async(
            user_id=user_id,
            session_id=current_session_id,
            new_message=types.Content(parts=[types.Part(text=user_input)])
        ):
            if event.content and event.content.parts:
                for part in event.content.parts:
                    if part.text:
                        print(f"Agent: {part.text}")
            if event.get_function_calls():
                for fc in event.get_function_calls():
                    print(f"🛠️ Tool Call: {fc.name}")

if __name__ == "__main__":
    asyncio.run(main())

Running the Test

First, ensure you have your Application Default Credentials set up:

gcloud auth application-default login

Then run the script:

uv run test_local.py

Test Scenario

This scenario validates the full end-to-end lifecycle of the agent: from discovery and research to multimodal content creation and long-term memory retrieval.

Phase 1: Teaching & Multimodal Creation (Session 1)

Goal: Establish technical context and set a specific stylistic preference.

Discovery

Ask the agent to find trending Cloud Run topics.

Input: "Find high-engagement questions about AI agents on Cloud Run from the last 21 days."

Research

Instruct the agent to perform a deep dive on a specific result.

Input: "Use the GCP Expert to research topic #1."

Personalization

Request a blog post and explicitly set your style preference.

Input: "Draft a blog post based on this research. From now on, I want all my technical blogs written in the style of a 90s Rap Song."

Image Generation

Ask the agent to generate an image that demonstrates the main ideas in the blog using the Nano Banana Pro tool. The image will be saved to your bucket in Google Cloud and you should get the path to see it, which will look like: https://storage.mtls.cloud.google.com/...

Phase 2: Long-Term Memory Recall (Session 2)

Goal: Verify the agent recalls preferences across a completely fresh session.

Type new in the console to wipe local session history and start a fresh state.
Retrieval: Inquire about your stored preferences to test the Vertex AI memory bank.

Input: "What are my current topics of interest and what is my preferred blogging style?"

Verification: Confirm the agent successfully retrieves your "AI Agents on Cloud Run" interest and "Rap" style from the cloud.

Final Test: Ask for a new blog on a different topic (e.g., "GKE Autopilot") and ensure it is automatically written as a rap song without being prompted.

Summary

In this part of our series we focused on verifying the agent's functionality in a local environment before proceeding to cloud deployment. By configuring local secrets and utilizing environment-aware utilities, we used a dedicated test runner to confirm that the core reasoning and tool logic are properly integrated. We successfully validated the full lifecycle: from Reddit discovery to expert content creation, confirming that the agent correctly retrieves preferences from the cloud-based Vertex AI memory bank even in completely fresh sessions.

Ready to run the test scenario yourself? Clone the repository and try the test_local.py script to see 'Dev Signal' retrieve your preferences from the Vertex AI memory bank in real-time. For a deeper dive into the underlying mechanics of memory orchestration, check out this quickstart guide.

In the final part of this series, we will transition our prototype into a production service on Google Cloud Run using Terraform for secure infrastructure, and explore the roadmap to production excellence through continuous evaluation and security.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on LinkedIn and X.

Architect A Personalized Multi-Agent System with Long-Term Memory

Shir Meir Lador — Thu, 07 May 2026 21:01:06 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal — a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first part of this series for the Dev Signal, we laid the essential groundwork for this system by establishing a project environment and equipping core capabilities through the Model Context Protocol (MCP). We standardized our external integrations, connecting to Reddit for trend discovery, Google Cloud Docs for technical grounding, and building a custom Nano Banana Pro MCP server for multimodal image generation. If you missed Part 1 or want to explore the code directly, you can find the complete project implementation in our GitHub repository.

Now, in Part 2, we focus on building the multi-agent architecture and integrating the Vertex AI memory bank to personalize these capabilities. We will implement a Root Orchestrator that manages three specialist agents: the Reddit Scanner, GCP Expert, and Blog Drafter, to provide a seamless flow from trend discovery to expert content creation. We will also integrate a long-term memory layer that enables the agent to learn from your feedback and persist your stylistic preferences across different conversations. This ensures that Dev Signal doesn't just process data, but actually learns to match your professional voice over time.

Infrastructure and Model Setup

First, we initialize the environment and the shared Gemini model.

Paste this code in dev_signal_agent/agent.py:

from google.adk.agents import Agent
from google.adk.apps import App
from google.adk.models import Gemini
from google.adk.tools import google_search, AgentTool, load_memory_tool, preload_memory_tool
from google.adk.tools.tool_context import ToolContext
from google.genai import types
from dev_signal_agent.app_utils.env import init_environment
from dev_signal_agent.tools.mcp_config import (
    get_reddit_mcp_toolset,
    get_dk_mcp_toolset,
    get_nano_banana_mcp_toolset
)

PROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()

shared_model = Gemini(
    model="gemini-3-flash-preview",
    vertexai=True,
    project=PROJECT_ID,
    location=MODEL_LOC,
    retry_options=types.HttpRetryOptions(attempts=3),
)

Memory Ingestion Logic

We want Dev Signal to do more than just follow instructions — we want it to learn from you. By capturing your preferences, such as specific technical interests on Reddit or a preferred blogging style, the agent can personalize its output for future use. To achieve this, we use the Vertex AI memory bank to persist session history across different conversations.

Long-term Memory

We automate this through the save_session_to_memory_callback function. This callback is configured to run automatically after every turn, ensuring that session details are captured and stored in the memory bank without manual intervention.

How Managed Memory Works:

Ingestion: The save_session_to_memory_callback sends the conversation data to Vertex AI.
Embedding: Vertex AI converts the text into numerical vectors (embeddings) that capture the semantic meaning of your preferences.
Storage: These vectors are stored in a managed index, enabling the agent to perform semantic searches and retrieve relevant history in future sessions.
Retrieval: The agent recalls this history using built-in ADK tools. The PreloadMemoryTool proactively brings in context at the start of an interaction, while the LoadMemoryTool allows the agent to fetch specific memories on an as-needed basis.

Paste this code in dev_signal_agent/agent.py:

async def save_session_to_memory_callback(*args, **kwargs) -> None:
    """
    Defensive callback to persist session history to the Vertex AI memory bank.
    """
    ctx = kwargs.get("callback_context") or (args[0] if args else None)
    # Check connection to Memory Service
    if ctx and hasattr(ctx, "_invocation_context") and ctx._invocation_context.memory_service:
        # Save the session!
        await ctx._invocation_context.memory_service.add_session_to_memory(
            ctx._invocation_context.session
        )

Short-term Memory

The add_info_to_state function serves as the agent's short-term working memory, allowing the gcp_expert to reliably hand off its detailed findings to the blog_drafter within the same session. This working memory and the conversation transcript are managed by the Vertex AI Session Service to ensure that active context survives server restarts or transient failures.

The boundary between session-based state and long-term persistence — It is important to note that while this service provides stability during an active interaction, this short-term memory does not persist between different sessions. Starting a fresh session ID effectively resets this working state, ensuring a clean slate for new tasks. Cross-session continuity, where the agent remembers your stylistic preferences or past feedback, is handled by the Vertex AI Memory Bank.

Paste this code in dev_signal_agent/agent.py

def add_info_to_state(tool_context: ToolContext, key: str, data: str) -> dict:
    tool_context.state[key] = data
    return {"status": "success", "message": f"Saved '{key}' to state."}

Specialist 1: Reddit Scanner (Discovery)

The Reddit Scanner is our "Trend Spotter," it identifies high-engagement questions from the last 21 days (3 weeks) to ensure that all research findings remain both timely and relevant.

Memory Usage: It leverages load_memory to retrieve your past areas of interest and preferred topics from the Vertex AI memory bank. If relevant history exists, the agent prioritizes those specific topics in its search to provide a personalized discovery experience.

Beyond simple retrieval, each sub-agent actively updates its memories by listening for new preferences and explicitly acknowledging them during the chat. This process captures relevant information in the session history, where an automated callback then persists it to the long-term Vertex AI memory bank for future use.

This memory management is supported by two distinct retrieval patterns within the Google Agent Development Kit (ADK). The first is the PreloadMemoryTool, which proactively brings in historical context at the beginning of every interaction to ensure the agent is fully briefed before addressing the current request. The second is the LoadMemoryTool, which the agent uses on an as-needed basis, calling upon it only when it decides that deeper past knowledge would be beneficial for the current step in the workflow.

Paste this code in dev_signal_agent/agent.py

# Singleton toolsets
reddit_mcp = get_reddit_mcp_toolset(
    client_id=SECRETS.get("REDDIT_CLIENT_ID", ""),
    client_secret=SECRETS.get("REDDIT_CLIENT_SECRET", ""),
    user_agent=SECRETS.get("REDDIT_USER_AGENT", "")
)

reddit_scanner = Agent(
    name="reddit_scanner",
    model=shared_model,
    instruction="""
You are a Reddit research specialist. Your goal is to identify high-engagement questions
from the last 3 weeks on specific topics of interest, such as AI/agents on Cloud Run.

Follow these steps:
1. **MEMORY CHECK**: Use `load_memory` to retrieve the user's **past areas of interest** and **preferred topics**. Calibrate your search to align with these interests.
2. Use the Reddit MCP tools to search for relevant subreddits and posts.
3. Filter results for posts created within the last 21 days (3 weeks).
4. Analyze "high-engagement" based on upvote counts and the number of comments.
5. Recommend the most important and relevant questions for a technical audience.
6. **CRITICAL**: For each recommended question, provide a direct link to the original thread and a concise summary of the discussion.
7. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[reddit_mcp, load_memory_tool.LoadMemoryTool()],
    after_agent_callback=save_session_to_memory_callback,
)

Specialist 2: GCP Expert (Grounding)

The GCP Expert is our "Technical Authority". It triangulates facts by synthesizing official documentation from the Google Cloud Developer Knowledge MCP Server, community sentiment from Reddit, and broader context from Google Search.

Paste this code in dev_signal_agent/agent.py

dk_mcp = get_dk_mcp_toolset(api_key=SECRETS.get("DK_API_KEY", ""))

search_agent = Agent(
    name="search_agent",
    model=shared_model,
    instruction="Execute Google Searches and return raw, structured results (Title, Link, Snippet).",
    tools=[google_search],
)

gcp_expert = Agent(
    name="gcp_expert",
    model=shared_model,
    instruction="""
You are a Google Cloud Platform (GCP) documentation expert.
Your goal is to provide accurate, detailed, and cited answers to technical questions by synthesizing official documentation with community insights.

For EVERY technical question, you MUST perform a comprehensive research sweep using ALL available tools:
1. **Official Docs (Grounding)**: Use DeveloperKnowledge MCP (`search_documents`) to find the definitive technical facts.
2. **Social Media Research (Reddit)**: Use the Reddit MCP to research the question on social media. This allows you to find real-world user discussions, common pain points, or alternative solutions that might not be in official documentation.
3. **Broader Context (Web/Social)**: Use the `search_agent` tool to find recent technical blogs, social media discussions, or tutorials.

Synthesize your answer:
- Start with the official answer based on GCP docs.
- Add "Social Media Insights" or "Common Issues" sections derived from Reddit and Web Search findings.
- **CRITICAL**: After providing your answer, you MUST use the `add_info_to_state` tool to save your full technical response under the key: `technical_research_findings`.
- Cite your sources specifically at the end of your response, providing **direct links** (URLs) to the official documentation, blog posts, and Reddit threads used.
- **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[dk_mcp, AgentTool(search_agent), reddit_mcp, add_info_to_state],
    after_agent_callback=save_session_to_memory_callback,
)

Specialist 3: Blog Drafter (Creativity)

The Blog Drafter is our Content Creator. It drafts the blog based on the expert's findings and offers to generate visuals.

Memory Usage: It checks load_memory for the user's preferred writing style (e.g. "Witty", "Rap") stored in the Vertex AI memory bank.

Paste this code in dev_signal_agent/agent.py

nano_mcp = get_nano_banana_mcp_toolset()

blog_drafter = Agent(
    name="blog_drafter",
    model=shared_model,
    instruction="""
You are a professional technical blogger specializing in Google Cloud Platform.
Your goal is to draft high-quality blog posts based on technical research provided by the GDE expert and reliable documentation.

You have access to the research findings from the gcp_expert_agent here:
{{ technical_research_findings }}

Follow these steps:
1. **MEMORY CHECK**: Use `load_memory` to retrieve past blog posts, **areas of interest**, and user feedback on writing style. Adopt the user's preferred style and depth.
2. **REVIEW & GROUND**: Review the technical research findings provided above. **CRITICAL**: Use the `dk_mcp` (Developer Knowledge) tool to verify key facts, technical limitations, and API details. Ensure every claim in your blog is grounded in official documentation.
3. Draft a blog post that is engaging, accurate, and helpful for a technical audience.
4. Include code snippets or architectural diagrams if relevant.
5. Provide a "Resources" section with links to the official documentation used.
6. Ensure the tone is professional yet accessible, while adhering to any style preferences found in memory.
7. **VISUALS**: After presenting the drafted blog post, explicitly ask the user: "Would you like me to generate an infographic-style header image to illustrate these key points?" If they agree, use the `generate_image` tool (Nano Banana).
8. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[dk_mcp, load_memory_tool.LoadMemoryTool(), nano_mcp],
    after_agent_callback=save_session_to_memory_callback,
)

The Root Orchestrator

The root agent serves as the system's strategist, managing a team of specialist agents and orchestrating their actions based on the specific goals provided by the user. At the start of a conversation, the orchestrator retrieves memory to establish context by checking for the user's past areas of interest, preferred topics, or previous projects.

Paste this code in dev_signal_agent/agent.py

root_agent = Agent(
    name="root_orchestrator",
    model=shared_model,
    instruction="""
You are a technical content strategist. You manage three specialists:
1. reddit_scanner: Finds trending questions and high-engagement topics on Reddit.
2. gcp_expert: Provides technical answers based on official GCP documentation.
3. blog_drafter: Writes professional blog posts based on technical research.

Your responsibilities:
- **MEMORY CHECK**: At the start of a conversation, use `load_memory` to check if the user has specific **areas of interest**, preferred topics, or past projects. Tailor your suggestions accordingly.
- **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
- If the user wants to find trending topics or questions from Reddit, delegate to reddit_scanner.
- If the user has a technical question or wants to research a specific theme, delegate to gcp_expert.
- **CRITICAL**: After the gcp_expert provides an answer, you MUST ask the user:
  "Would you like me to draft a technical blog post based on this answer?"
- If the user agrees or asks to write a blog, delegate to blog_drafter.
- Be proactive in helping the user navigate from discovery (Reddit) to research (Docs) to content creation (Blog).
""",
    tools=[load_memory_tool.LoadMemoryTool(), preload_memory_tool.PreloadMemoryTool()],
    after_agent_callback=save_session_to_memory_callback,
    sub_agents=[reddit_scanner, gcp_expert, blog_drafter]
)

app = App(root_agent=root_agent, name="dev_signal_agent")

Summary

In this part of our series, we built multi-agent architecture and implemented a robust, dual-layered memory system. We established a Root Orchestrator, managing three specialist agents: a Reddit Scanner for trend discovery, a GCP Expert for technical grounding, and a Blog Drafter for creative content creation.

By utilizing short-term state to pass information reliably between specialists and integrating the Vertex AI memory bank for long-term persistence, we've enabled the agent to learn from your feedback and remember specific writing styles across different conversations.

In Part 3, we will show you how to test the agent locally to verify these components on your workstation, before transitioning to a full production deployment on Google Cloud Run in Part 4. Can't wait for part 3? The full implementation is already available for you to explore on GitHub.

To learn more about the underlying technology, explore the Vertex AI Memory Bank overview or dive into the official ADK Documentation to see how to orchestrate complex multi-agent workflows.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on LinkedIn and X.

Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run

Shir Meir Lador — Thu, 07 May 2026 21:00:10 +0000

My team's mission is to accelerate the developer journey from writing code to running secure AI workloads on Google Cloud. To help developers succeed, we focus on identifying their most pressing questions and building demos that provide straightforward, easy-to-implement solutions.

Recently, I was struck with inspiration when the new Developer Knowledge MCP server was released. It led me to build Dev Signal—a multi-agent system designed with Google Agent Development Kit (ADK)—to identify technical questions from Reddit, research them using official documentation, and draft detailed technical blogs. Dev Signal also provides custom visuals using Nano Banana Pro. I even integrated a long-term memory layer so the agent remembers my specific preferences and blogging style.

By connecting my coding assistant, Gemini CLI, to the developer knowledge MCP server, I built and deployed this entire system to Google Cloud Run in just two days.

Whether you want to learn how to architect a complex multi-agent system with long term memory, leverage local and remote MCP servers for tool standardization, or write detailed Terraform scripts for secure Cloud Run deployment, I'll show you how!

If you'd rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What you'll learn

In this four-part blog series, I'll walk you through the step-by-step process of how I brought this project to life. Each blog post captures the journey of building and deploying Dev Signal:

Part 1: Tools for building agent capabilities – You'll begin by setting up your project environment and equipping your agent with tools using the Model Context Protocol (MCP). You'll learn how to connect to Reddit for trend discovery, Google Cloud docs for technical grounding, and a custom Nano Banana Pro tool for image generation.
Part 2: The Multi-Agent Architecture with long term memory – You'll build the "brain" of the system by implementing a root orchestrator and a team of specialized agents. You'll also integrate the Vertex AI memory bank, enabling the agent to learn and persist your preferences across sessions.
Part 3: Testing the agent Locally – Before moving to the cloud, you'll synchronize the agent's components and verify its performance on your workstation. You'll use a dedicated test runner to simulate the full lifecycle of discovery, research, and multimodal creation, with a special focus on validating long-term memory persistence by connecting your local agent directly to the cloud-based Vertex AI memory bank.
Part 4: Deployment to Cloud Run and the Path to Production – Finally, you'll deploy your service on Google Cloud Run using Terraform for reproducible infrastructure. You'll also discuss the next steps required for a high quality secure production system.

Getting started with Dev Signal

Dev Signal is an intelligent monitoring agent designed to filter noise and create value. Dev Signal operates in the following ways:

Discovery: Scouts Reddit for high-engagement technical questions.
Grounding: Researches answers using official Google Cloud documentation to ensure accuracy.
Creation: Drafts professional technical blog posts based on its findings.
Multimodal Generation: Generates custom infographic headers for those posts.
Long-Term Memory: Uses Vertex AI memory bank to remember your feedback across different sessions.

Prerequisites

Before you begin, verify the following is installed:

Python 3.12+
uv (Python package manager): curl -LsSf https://astral.sh/uv/install.sh | sh
Google Cloud SDK (gcloud CLI) installed and authenticated.
Terraform (for infrastructure as code).
Node.js & npm (required for the Reddit MCP tool).

You will also need:

A Google Cloud Project with billing enabled.
APIs Enabled: Vertex AI, Cloud Run, Secret Manager, Artifact Registry.
Reddit API Credentials (Client ID, Secret) - You can get these from the Reddit Developer Portal.
Developer Knowledge API Key (for Google Cloud docs search) - Instructions on how to get it are here.

Project Setup

The Dev Signal system was built by first running the Agent Starter Pack, following the automated architect workflow described in the Agent Factory episode by Remigiusz Samborski and Vlad Kolesnikov. This foundation provided the project's modular directory structure, which is used to separate concerns between Agent Logic, Server Code, Utilities, and Tools.

The starter pack acts as a powerful starting point because it automates the creation of professional infrastructure, CI/CD pipelines, and observability tools in seconds. This allows you to focus entirely on the agent's unique intelligence while ensuring the underlying platform remains secure and scalable. By building on top of this generated boilerplate with AI assistance from Gemini CLI and Antigravity, the development process is highly accelerated.

The agent starter pack high level architecture:

1. Initialize the Project

Create a new directory for your project and initialize it. We'll use uv, which is an extremely fast Python package manager.

uv init dev-signal

2. Folder Structure

Our project will follow this structure. We will populate these files step-by-step.

dev-signal/
├── dev_signal_agent/
│   ├── __init__.py
│   ├── agent.py                # Agent logic & orchestration
│   ├── fast_api_app.py         # Application server & memory connection
│   ├── app_utils/              # Env Config
│   │   └── env.py
│   └── tools/                  # External capabilities
│       ├── __init__.py
│       ├── mcp_config.py       # Tool configuration (Reddit, Docs)
│       └── nano_banana_mcp/    # Custom local image generation tool
│           ├── __init__.py
│           ├── main.py
│           ├── nano_banana_pro.py
│           ├── media_models.py
│           ├── storage_utils.py
│           └── requirements.txt
├── deployment/
│   └── terraform/              # Infrastructure as Code
├── .env                        # Local secrets (API keys)
├── Makefile                    # Shortcuts for building/deploying
├── Dockerfile                  # Container definition
└── pyproject.toml              # Dependencies

3. Define Dependencies

Update your pyproject.toml with the necessary dependencies. We use google-adk for the agent framework and google-genai for the model interaction.

[project]
name = "dev-signal"
version = "0.1.0"
description = "A multi-agent system for monitoring and content creation."
readme = "README.md"
requires-python = ">=3.12, <3.14"
dependencies = [
    "google-adk>=0.1.0",
    "google-genai>=1.0.0",
    "mcp>=1.0.0",
    "python-dotenv>=1.0.0",
    "fastapi>=0.110.0",
    "uvicorn>=0.29.0",
    "google-cloud-logging>=3.0.0",
    "google-cloud-aiplatform>=1.38.0",
    "fastmcp>=2.13.0",
    "google-cloud-storage>=3.6.0",
    "google-auth>=2.0.0",
    "google-cloud-secret-manager>=2.26.0",
]

Run uv sync to install everything.

Create a new directory for the agent code.

mkdir dev_signal_agent
cd dev_signal_agent

Building the agent capabilities: MCP tools

Our agent needs to interact with the outside world. We use the Model Context Protocol (MCP) to standardize this. The Model Context Protocol (MCP) is a universal standard for connecting AI agents to external data and tools. Instead of writing custom API wrappers, we use standard MCP servers. This allows us to connect to APIs (Reddit), Knowledge Bases (Google Cloud Docs), and even local scripts (Image Generation using Nano Banana Pro) using a common interface. Create a new directory for the agent tools.

mkdir tools
cd tools

Tools Configuration

We'll define our toolsets in dev_signal_agent/tools/mcp_config.py.

This file defines the connection parameters for our three main tools.

Reddit: Connected via a local stdio subprocess.
Developer Knowledge: Connected via a remote HTTP endpoint.
Nano Banana: Connected via a local stdio subprocess (our custom Python script).

Reddit Search (Discovery Tool)

The Reddit MCP server acts as a bridge to the Reddit API, allowing your agent to discover trending posts and analyze engagement without you having to write complex API wrappers. To ensure portability, the code uses a "find or fetch" strategy: it first checks for a local installation and, if missing, automatically uses npx to download and run the server on demand.

Instead of a network connection, the agent launches the server as a local subprocess and communicates via standard input and output (stdio). Within the Google ADK, the McpToolset class acts as a universal wrapper that standardizes these connections, enabling your agent to interact with various tools, from community resources to custom scripts like the Nano Banana image generator, using a common interface. By securely passing API credentials through environment variables, the system ensures these "plug-and-play" modules function as a seamless bridge between the AI and external platforms.

Paste this code in dev_signal_agent/tools/mcp_config.py:

import os
import shutil
from mcp import StdioServerParameters
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool import StreamableHTTPConnectionParams, StdioConnectionParams

def get_reddit_mcp_toolset(client_id: str = "", client_secret: str = "", user_agent: str = ""):
    """
    Connects to the Reddit MCP server.
    This server runs as a local subprocess (stdio) and proxies requests to the Reddit API.
    """
    # Check if 'reddit-mcp' is installed globally, otherwise use npx to run it
    cmd = "reddit-mcp" if shutil.which("reddit-mcp") else "npx"
    args = [] if shutil.which("reddit-mcp") else ["-y", "--quiet", "reddit-mcp"]

    # Inject secrets into the environment of the subprocess only
    env = {
        **os.environ,
        "DOTENV_CONFIG_SILENT": "true",
        "LANG": "en_US.UTF-8"
    }
    if client_id: env["REDDIT_CLIENT_ID"] = client_id
    if client_secret: env["REDDIT_CLIENT_SECRET"] = client_secret
    if user_agent: env["REDDIT_USER_AGENT"] = user_agent

    return McpToolset(
        connection_params=StdioConnectionParams(
            server_params=StdioServerParameters(
                command=cmd,
                args=args,
                env=env  # Pass injected secrets directly to the subprocess
            ),
            timeout=120.0
        )
    )

Google Cloud Docs (Knowledge Tool)

The Developer Knowledge MCP server provides grounding for your agent by allowing it to search the entire corpus of official Google Cloud documentation. Unlike the local Reddit server, this is a managed service hosted by Google and accessed as a remote endpoint over the internet. It exposes specialized tools like google_developer_documentation_search for semantic queries and google_developer_documentation_fetch to retrieve full markdown content, ensuring that every technical claim the agent makes is supported by definitive, up-to-date facts.

Note: You can also connect your coding assistant tools such as Gemini CLI or Antigravity to the developer knowledge MCP server to empower them with handy up to date Google Cloud documentation. I used it when writing this blog!

To connect, the agent uses the McpToolset class with StreamableHTTPConnectionParams, pointing to a web URL instead of launching a local process. It securely authenticates using a DK_API_KEY (create your api key) passed in the request headers, allowing the agent to perform a "comprehensive research sweep" across official docs, community sentiment, and broader web context through a single standardized interface.

Paste this code in dev_signal_agent/tools/mcp_config.py:

def get_dk_mcp_toolset(api_key: str = ""):
    """
    Connects to Developer Knowledge (Google Cloud Docs).
    This is a remote MCP server accessed via HTTP.
    """
    headers = {}
    if api_key:
        headers["X-Goog-Api-Key"] = api_key
    else:
        # Fallback to os.environ for local testing if not passed via API
        headers["X-Goog-Api-Key"] = os.getenv("DK_API_KEY", "")

    return McpToolset(
        connection_params=StreamableHTTPConnectionParams(
            url="https://developerknowledge.googleapis.com/mcp",
            headers=headers
        )
    )

The Image Generator (Nano Banana MCP)

While we've used external MCP servers for Reddit and documentation, we can also build our own custom MCP server to wrap specific Python logic. In this case, we are creating an image generation tool powered by Gemini 3 Pro Image (also known as Nano Banana Pro). This demonstrates that any Python function can be standardized into a tool that any agent can understand.

How the image generation works:

FastMCP: We use the fastmcp library to drastically simplify server creation, allowing us to register Python functions as tools with just a few lines of code.
Gemini Integration: The server uses the Google GenAI SDK to call the gemini-3-pro-image-preview model, which converts the agent's descriptive prompts into raw image bytes.
GCS Upload & Hosting: Because agent interfaces typically require a URL to display images, the server automatically uploads the generated bytes to Google Cloud Storage (GCS) and returns a public link.

To connect this local tool, we use StdioConnectionParams because the server runs as a local subprocess communicating via standard input and output. This transport method directly matches the transport="stdio" configuration we will define in our server entrypoint, ensuring a seamless connection for your custom local scripts.

The following code defines the MCP connection in dev_signal_agent/tools/mcp_config.py. We use uv run to ensure the server starts in an isolated environment with all its dependencies correctly installed.

Paste this code in dev_signal_agent/tools/mcp_config.py:

def get_nano_banana_mcp_toolset():
    """
    Connects to our local 'Nano Banana' image generator.
    This demonstrates how to wrap a local Python script as an MCP tool.
    """
    path = os.path.join("dev_signal_agent", "tools", "nano_banana_mcp", "main.py")
    bucket = os.getenv("AI_ASSETS_BUCKET")

    return McpToolset(
        connection_params=StdioConnectionParams(
            server_params=StdioServerParameters(
                command="uv",
                args=["run", path],
                env={**os.environ, "AI_ASSETS_BUCKET": bucket}
            ),
            timeout=600.0  # Image generation can take time
        )
    )

Implementing the Nano Banana Pro Server Logic

Now, we will implement the actual logic for this server. This implementation is based on the Agent Factory demo code by Remigiusz Samborski. While Remi's original code provides instructions for deploying the MCP server to Cloud Run, we will run it here as a local subprocess for faster development and testing.

To get started, create the directory for our new server:

mkdir -p dev_signal_agent/tools/nano_banana_mcp
cd dev_signal_agent/tools/nano_banana_mcp

The Server Entrypoint (`main.py`)

This file acts as the "brain" that initializes and starts the MCP server.

FastMCP Initialization: We use the FastMCP library to create a server named "MediaGenerators" and register our generate_image function as a tool.
Safe Logging: The _initialize_console_logging function is critical. It forces all logs to sys.stderr. This is because the MCP "stdio" transport uses sys.stdout for communication between the agent and the tool; standard logs sent to stdout would corrupt that protocol.
Execution: The mcp.run(transport="stdio") line starts the server as a local subprocess, allowing it to listen for requests from your agent via standard input.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/main.py:

import logging
import os
import sys
from fastmcp import FastMCP
from dotenv import load_dotenv
from nano_banana_pro import generate_image

def _initialize_console_logging(min_level: int = logging.INFO):
    # Ensure logs go to STDERR so they don't break the MCP stdio protocol
    handler = logging.StreamHandler(sys.stderr)
    logging.basicConfig(level=min_level, handlers=[handler], force=True)

tools = [generate_image]
mcp = FastMCP(name="MediaGenerators", tools=tools)

if __name__ == "__main__":
    load_dotenv()
    _initialize_console_logging()
    mcp.run(transport="stdio")

The Generation Logic (`nano_banana_pro.py`)

This is where the actual image generation happens using Gemini.

GenAI Client: We initialize the genai.Client() to interact with Google's generative models.
Model Selection: It specifically targets the gemini-3-pro-image-preview model. We set the response_modalities to "IMAGE" to tell the model we want pixels, not just text.
Robustness: The code includes a MAX_RETRIES loop (set to 5) to handle any transient generation errors, ensuring the agent has multiple attempts to get a valid image.
Byte Processing: Once the model generates the image, it arrives as raw inline data. We extract these bytes and call our helper to move them to the cloud.
URI Conversion: Finally, it replaces the internal gs:// path with a browser-accessible https:// URL so the user can actually see the image.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/nano_banana_pro.py:

import logging
from typing import Literal, Optional
from google import genai
from google.genai import types
from media_models import MediaAsset
from storage_utils import upload_data_to_gcs

AUTHORIZED_URI = "https://storage.mtls.cloud.google.com/"
MAX_RETRIES = 5

async def generate_image(
    prompt: str,
    aspect_ratio: Literal["16:9", "9:16"] = "16:9",
) -> MediaAsset:
    """Generates an image using Gemini 3 Image model."""
    genai_client = genai.Client()
    content = types.Content(parts=[types.Part.from_text(text=prompt)], role="user")
    logging.info(f"Starting image generation for prompt: {prompt[:50]}...")

    asset = MediaAsset(uri="")
    for _ in range(MAX_RETRIES):
        response = genai_client.models.generate_content(
            model="gemini-3-pro-image-preview",
            contents=[content],
            config=types.GenerateContentConfig(
                response_modalities=["IMAGE"],
                image_config=types.ImageConfig(aspect_ratio=aspect_ratio)
            )
        )
        if response and response.parts:
            for part in response.parts:
                if part.inline_data and part.inline_data.data:
                    # Upload the raw bytes to GCS
                    gcs_uri = await upload_data_to_gcs(
                        "mcp-tools",
                        part.inline_data.data,
                        part.inline_data.mime_type
                    )
                    asset = MediaAsset(uri=gcs_uri)
                    break
        if asset.uri: break

    if not asset.uri:
        asset.error = "No image was generated."
    else:
        # Convert gs:// URI to an HTTP accessible URL if needed
        asset.uri = asset.uri.replace('gs://', AUTHORIZED_URI)
    logging.info(f"Image URL: {asset.uri}")
    return asset

GCS Upload Helper (`storage_utils.py`)

Since agents need a web link to display images, this utility handles the hosting on Google Cloud Storage (GCS).

Dynamic Bucket Selection: It looks for a bucket name in your environment variables, falling back from AI_ASSETS_BUCKET to LOGS_BUCKET_NAME to ensure it always has a place to save data.
Unique Filenames: We use an MD5 hash of the raw image data to create a unique filename. This prevents filename collisions and acts as a simple way to avoid duplicate uploads of the same image.
Cloud Upload: The blob.upload_from_string method pushes the raw image bytes directly to your GCS bucket.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/storage_utils.py:

import hashlib
import mimetypes
import os
from google.cloud.storage import Client, Blob
from dotenv import load_dotenv

load_dotenv()

storage_client = Client()
ai_bucket_name = os.environ.get("AI_ASSETS_BUCKET") or os.environ.get("LOGS_BUCKET_NAME")
ai_bucket = storage_client.bucket(ai_bucket_name)

async def upload_data_to_gcs(agent_id: str, data: bytes, mime_type: str) -> str:
    file_name = hashlib.md5(data).hexdigest()
    ext = mimetypes.guess_extension(mime_type) or ""
    blob_name = f"assets/{agent_id}/{file_name}{ext}"
    blob = Blob(bucket=ai_bucket, name=blob_name)
    blob.upload_from_string(data, content_type=mime_type, client=storage_client)
    return f"gs://{ai_bucket_name}/{blob_name}"

Data Model (`media_models.py`)

This file ensures that our data follows a strict structure (Schema).

Structured Output: By using a Pydantic BaseModel, we guarantee that the tool always returns a consistent JSON object containing a uri (the link) and an optional error message. This makes it much easier for the AI agent to understand and process the tool's result.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/media_models.py:

from typing import Optional
from pydantic import BaseModel

class MediaAsset(BaseModel):
    uri: str
    error: Optional[str] = None

Tool Dependencies (`requirements.txt`)

While we use uv to run our code, a requirements.txt file remains essential because it defines the specific dependencies uv needs to install for the Nano Banana server to function. This provides the necessary "ingredients" to set up the isolated environment before the server starts.

This file lists the three core libraries required for this tool:

google-cloud-storage: Used for hosting the generated images on the cloud.
google-genai: Provides the logic for the Gemini 3 Pro image generation.
fastmcp: The framework that turns our Python script into a standardized MCP tool.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/requirements.txt:

google-cloud-storage==3.6.*
google-genai==1.52.*
fastmcp==2.13.*

Summary

In this first part of our series, we focused on establishing the agent's core capabilities by standardizing its external integrations through the Model Context Protocol (MCP). We initialized the project using uv for high-speed dependency management and successfully configured three critical toolsets: Reddit for trend discovery, Google Cloud Docs for technical grounding, and a custom "Nano Banana" MCP server for multimodal image generation. By utilizing the Google ADK's McpToolset, we've abstracted away complex API logic into simple, plug-and-play modules, ensuring that our tools share a common interface that decouples integration from intelligence.

For a deeper look into our technical foundation, you can explore the Developer Knowledge MCP server to learn more about knowledge grounding or visit the Google ADK GitHub repository to explore the framework's core capabilities.

With our toolset fully configured and ready for action, we can now move to Part 2, where we will build the multi-agent architecture and integrate the Vertex AI memory bank to orchestrate these capabilities. You can also jump ahead to Part 3, where we will show you how to test the agent locally to verify these components on your workstation. If you’d like to dive ahead, you can explore the complete code for the entire series in our GitHub repository.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow Shir on LinkedIn and X.

Fine-Tuning Gemma 4 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕

Shir Meir Lador — Tue, 28 Apr 2026 19:54:21 +0000

Google has just announced the release of Gemma 4! This new generation of open models brings significant advancements, particularly in reasoning capabilities and architectural efficiency.

Bridging Reasoning and Precision with Gemma 4

In my previous blog, I demonstrated how to fine-tune Gemma 3 27B on Cloud Run Jobs using NVIDIA RTX PRO 6000 Blackwell Edition GPUs for pet breed classification. With the release of Gemma 4, I couldn't wait to update my pipeline and see how the new model performs.

In this follow-up post, I'll explain what makes Gemma 4 different, the benefits it brings, and exactly what file modifications and workarounds are needed to successfully fine-tune it using PEFT (LoRA) on Cloud Run. We'll cover everything from memory requirements and dynamic label masking to prompt structures for reasoning models. Whether you read the previous post or are new to this pipeline, this guide will provide a complete, working solution for Gemma 4.

If you'd rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What's New in Gemma 4?

Gemma 4 introduces groundbreaking improvements over Gemma 3, making it Google's most intelligent open model family to date:

Apache 2.0 License: Gemma 4 is released under a commercially permissive Apache 2.0 license, providing full developer flexibility.
Highly Competitive Benchmarks: The 31B model ranks as the #3 open model on the Arena AI text leaderboard, while the 26B MoE model ranks #6, outcompeting models 20x their size!
Advanced Reasoning & Agents: Purpose-built for multi-step planning and deep logic. It features native support for function-calling, structured JSON output, and native system instructions.
Multimodal & Long Context: Natively processes images, video, and even audio (in edge models). It supports up to a 256K context window for larger models.
Versatile Architectures: Includes a 26B Mixture of Experts (MoE) model that only activates 3.8B parameters during inference for fast response times.

Because of these changes, simply dropping Gemma 4 into a Gemma 3 fine-tuning script won't work out of the box. Here is a breakdown of what needed to change in the codebase to make it work.

GPU Memory and Parameter Capacity

With the availability of NVIDIA RTX PRO 6000 GPUs on Cloud Run, we now have access to 96GB of VRAM. This is a game-changer for hosting and fine-tuning large models.

According to the formula discussed in my blog post on Decoding high-bandwidth memory: Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)

When using LoRA (Low-Rank Adaptation), we freeze the base model weights and only train a small subset of parameters. This means the memory-hungry gradients and optimizer states are negligible for the base model. For Gemma 4 31B loaded in 16-bit precision (bfloat16), the base model size is roughly 31 billion parameters × 2 bytes/parameter ≈ 62 GB. While this 62GB model fits comfortably within the 96GB of VRAM available on the RTX 6000 Pro, we can do even better!

By applying 4-bit quantization (QLoRA) via the bitsandbytes library, we dramatically shrink this base memory footprint to roughly 18–20GB. This leaves an enormous amount of VRAM overhead exclusively dedicated to the high-memory activations required by multi-modal processing and long-context training batches, unlocking unparalleled serverless efficiency!

Key Code Changes for Gemma 4 Migration

If you are updating your own script or starting fresh, these are the critical adjustments made to the pipeline:

1. Multimodal Input Ordering & Integrated Instructions

While Gemma 4 supports interleaved inputs and a native system role, we recommend providing the image data before the text as a stable convention and merging instructions into the user prompt for this pipeline. We found this 'single-turn' structure more effective for maintaining instruction-following precision and simplifying our custom masking logic.

In the code below, the {"type": "image"} entry acts as a placeholder that signals the processor to inject special image tokens into the chat template. The actual image tensors are then passed separately during the data collation step to ensure the multimodal architecture is adapted correctly.

full_user_content = f"{prompt}\n\nIdentify the breed of the animal in this image."
messages = [
  {
    "role": "user",
    "content": [
      {"type": "image"},  # Image must come first!
      {"type": "text", "text": full_user_content},
    ],
  },
  {
    "role": "assistant",
    "content": [{"type": "text", "text": example["caption"]}]
  }
]

2. Loading the Correct Multimodal Architecture

Gemma 4 natively processes images, video, and even audio (in the E2B and E4B models), which changes how the model must be loaded. To correctly handle these diverse inputs, we explicitly use the AutoModelForMultimodalLM class. While AutoModelForImageTextToText remains a valid option for purely image-based tasks, the multimodal class is the more precise choice for the Gemma 4 architecture, ensuring it is ready to handle video and audio data natively.

from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained(model_id, **model_kwargs)

3. Label Masking for Multimodal Data

In Gemma 3, we could hardcode specific token IDs to find where the assistant's response started to mask the prompt. For Gemma 4, we initially tried tokenizing the text prompt separately to find its length, but hit a major snag.

Gemma 4 is highly efficient with media: each image gets a dynamic number of soft tokens exactly fitted to its content. While these image soft tokens are highly stable and pre-computable (their count does not change whether the image is alone or accompanied by text), standard tokenizers can still introduce slight boundary quirks when concatenating text and control tokens after these media tokens. If you tokenize the prompt in isolation, the length might be slightly off compared to the fully assembled chat template, tanking the model's accuracy.

To achieve the highest precision, we implemented a bulletproof backward-search collator. Instead of trying to calculate the prompt length, we search the full _input_ids_ array for the exact tokens of our breed name label. Once found, we step backwards to locate the <|turn> control token that marks the start of the assistant's response, and mask everything before it. This mathematically guarantees the model is trained exactly on the required template structure and the label, without any masking misalignment.

4. Bypassing Custom Layers & Unlocking the Vision Tower

This was the most critical breakthrough! The official Hugging Face implementation for Gemma 4 uses a custom neural network wrapper called Gemma4ClippableLinear for its projection layers. This custom class wraps a standard nn.Linear layer but adds specific logic to clip minimum and maximum activations (input_min, output_max, etc.) to stabilize training.

When we tried to apply standard LoRA by targeting specific layer names like q_proj or v_proj, we hit two major issues:

Activation Clipping Bypass: Standard PEFT/LoRA doesn't natively recognize Gemma4ClippableLinear. If forced to attach to the inner .linear weights, it bypasses the parent wrapper entirely. Without that crucial activation clipping during the forward pass, the model's activations become unstable, and the training loss explodes.
Frozen Vision Tower: Even if we fixed the text backbone, standard text-focused LoRA configurations often miss the vision tower's projection layers, leaving the model's "eyes" frozen during training.

The solution is to use the macro target_modules="all-linear". This tells the PEFT library to recursively scan the entire model tree. It safely identifies and wraps nested linear layers without breaking the outer Gemma4ClippableLinear clipping logic. Crucially, it also ensures that every linear layer across both the language model and the vision tower is adapted to your data, without sacrificing architectural stability.

5. Results

By combining the multimodal architecture, bulletproof masking, and full-tower LoRA, we achieved a nice improvement in the model accuracy.

Note that Gemma 4 baseline performance (89% accuracy) was significantly higher than Gemma 3 Baseline performance (67% accuracy) so in this case the accuracy improvement is more modest, but still significant.

Intermediate Results (700 Samples, ~50 minutes Run)

Even with a small subset of 700 training images, we saw a nice boost over the baseline in less than one hour:

Results on 700 training samples and 200 evaluation samples

Final Results (Full Dataset, ~4.25 Hours Run)

Running the full Oxford-IIIT Pet dataset (~4,000 training images and 3,669 evaluation images) yielded our peak performance (STOA for this dataset is 94% accuracy):

Results on 4000 training samples and 3669 evaluation samples

In this run, we utilized a more aggressive LoRA configuration than typical text-only runs: a Rank 64 / Alpha 64 setup with a 5e-5 learning rate. This gave the model enough "surface area" to refine its visual features for the specific nuances of the pet dataset.

6. Managing VRAM with QLoRA & Gradient Checkpointing

While 96GB of VRAM on the RTX 6000 Pro is massive, training a 31B parameter model with LoRA still pushes the boundaries of a single GPU. To ensure absolute stability and prevent Out-Of-Memory (OOM) errors during the backward pass, our script implements a two-pronged optimization strategy:

QLoRA (4-bit Quantization): Utilizing BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4") to drastically reduce the model's footprint when loaded on CUDA.
Gradient Checkpointing: Specifically enabled for the 31B model, this trades a slight increase in compute time for a significant reduction in VRAM usage by recalculating activations instead of storing them all in memory.

The Complete Fine-Tuning Workflow on Cloud Run

Before you begin the fine-tuning process, ensure you have the following software and environment configurations in place.

Prerequisites

Google Cloud Project with billing enabled and APIs active (Cloud Run, Artifact Registry, Cloud Build, Secret Manager).
NVIDIA RTX PRO 6000 availability in your region (e.g., europe-west4).
Hugging Face Token: A valid token with access to the Gemma 4 model weights.

Step 0: Set Environment Variables

Set the following environment variables to align with the steps below:

export PROJECT_ID=[YOUR_PROJECT_ID]
export REGION=europe-west4
export HF_TOKEN=[YOUR_HF_TOKEN]
export SERVICE_ACCOUNT="finetune-gemma-job-sa"
export BUCKET_NAME=$PROJECT_ID-gemma4-finetuning-eu
export AR_REPO=gemma4-finetuning-repo
export SECRET_ID=HF_TOKEN
export IMAGE_NAME=gemma4-finetune
export JOB_NAME=gemma4-finetuning-job

Step 1: Get the Code

Whether you're running locally or on the cloud, you'll need the code. Clone the repository and navigate to the project directory:

git clone https://github.com/GoogleCloudPlatform/devrel-demos
cd devrel-demos/ai-ml/finetune_gemma/

Step 2: Test Locally Before Cloud Deployment

Before spinning up massive GPUs in the cloud, it is always a best practice to verify your pipeline locally using a smaller model variant (like the 2B IT model) on a subset of the data.

To run a local CPU test, first activate your virtual environment:

source .venv/bin/activate

Then, execute the script with a very small dataset to ensure the pipeline completes successfully:

python3 finetune_and_evaluate.py \
  --model-id google/gemma-4-e2b-it \
  --device cpu \
  --train-size 20 \
  --eval-size 20 \
  --gradient-accumulation-steps 4 \
  --num-epochs 1

Once you verify that the training pipeline completes successfully, you are ready to scale up to Cloud Run!

Step 3: Stage the Model in GCS

To save startup time and avoid repetitive downloads from the internet during training, stage the model weights (e.g., google/gemma-4-31b-it) in a GCS bucket located in the same region as your Cloud Run job. We provide a utility script within the repository to perform this transfer directly:

# Navigate to the utility directory
cd hf-to-gcs
# Execute the transfer script
python3 hf_to_gcs.py \
  --model-id google/gemma-4-31b-it \
  --bucket $BUCKET_NAME \
  --hf-token $HF_TOKEN

This script ensures that the weights are stored in your project's bucket, enabling high-speed access via volume mounts when the Cloud Run job executes.

Step 4: Build the Container

Use Cloud Build to package your script and dependencies into a container image compatible with CUDA 12.8:

gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest .

[!TIP] You can track the real-time progress of your build in the Cloud Build console.

Step 5: Create and Execute the Cloud Run Job

Create the job with GPU support and volume mounts for the GCS bucket holding the model:

gcloud beta run jobs create gemma4-finetuning-job \
  --region $REGION \
  --image gcr.io/$PROJECT_ID/gemma4-finetune \
  --gpu 1 \
  --gpu-type nvidia-rtx-pro-6000 \
  --cpu 30.0 \
  --memory 120Gi \
  --labels dev-tutorial=finetune-gemma \
  --add-volume name=model-volume,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=model-volume,mount-path=/mnt/gcs \
  --args="--model-id","/mnt/gcs/google/gemma-4-31b-it/","--output-dir","/mnt/gcs/gemma4-finetuned","--train-size","700","--eval-size","200","--merge"

Then execute it:

gcloud beta run jobs execute gemma4-finetuning-job --region $REGION --async

Conclusion

Migrating to Gemma 4 requires handling its new architecture and response formats, but the effort pays off with its superior reasoning and adherence to instructions. By leveraging Cloud Run Jobs and Serverless Blackwell GPUs, you can train these massive models efficiently without managing servers.

To get started with inference, explore this codelab: Run inference of Gemma 4 model on Cloud Run with RTX 6000 Pro GPU with vLLM.

To learn more about production serving, refer to the Cloud Run Gemma 4 documentation.

Happy fine-tuning! 🎉

Special thanks to Ryan Mullins, Juyeong Ji and Gus Martins from the Gemma 4 team for the helpful review and feedback on this blog.

Fine-Tuning Gemma 3 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕

Shir Meir Lador — Thu, 09 Apr 2026 13:07:00 +0000

Architectural worklow: fine tuning Gemma 3 27B on Cloud Run Jobs

Recently, I was inspired by a major new release on Google Cloud: the availability of NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Cloud Run Jobs. This launch is important because it unlocks the ability to tackle fine-tuning workloads for open models with the simplicity of a serverless batch job. To put this new hardware to the test in a fun way, I fine tuned a multi-modal model to identify a pet’s breed from a photo using The Oxford-IIIT Pet Dataset. This model could be used for a “Smart pet care” — an AI application that identifies a pet’s breed from a photo and provides tailored health and nutrition advice.

Image taken from The Oxford-IIIT Pet Dataset and showcase the images of cats and dogs and their corresponding breed — the classification label

Why Fine-Tuning?

In a recent Agent Factory episode, we discussed that while foundational models are a powerful ‘one-size-fits-all’ starting point, they essentially remain generalists. You should consider fine-tuning when you have a problem that requires high specialization that a generalist model might not excel in on its own, or when you need more control and cost-efficiency over your own hosting.

For this pet-care use case, distinguishing between 37 different breeds isn’t just about ‘knowledge’, it’s about taking that foundational reasoning and adding a specific capability based on a unique dataset. As we explored in the episode and as mentioned in this Nvidia paper, this kind of specialization is what allows smaller, focused models to become sufficiently powerful and economical for production agentic systems. Fine-tuning acts as the necessary bridge, transforming a broad reasoner into a high-precision classification expert.

Bridging Reasoning and Precision

For this project, I chose the multimodal breadth of Gemma 3 27B. While specialized vision models often provide superior accuracy for narrow identification tasks, I wanted to use a model capable of both identifying breeds and reasoning about the specific health and dietary needs associated with them. By leveraging the power of the new Blackwell GPUs, I was able to fine-tune this model to bridge the performance gap, all while keeping the setup reproducible, cost-effective, and entirely container-native.

From Batch to Production: Economically Efficient Hosting

The true ‘deploy and forget’ magic happens after the weights are saved. With high-performance inference now supported on Cloud Run, you can host your fine-tuned Gemma 3 27B model on the same NVIDIA RTX PRO 6000 Blackwell GPU without managing any underlying infrastructure. This setup delivers a highly economical production environment: Cloud Run automatically scales your GPU instances to zero when they aren’t in use, ensuring you only pay for the exact minutes your model is active.

In this guide, I’m excited to show you how this new hardware release transforms complex fine-tuning into a scalable, serverless experience without the need to manage complex clusters or maintain idle instances.

Simplifying 27B Fine-Tuning on Cloud Run

Fine-tuning an open model can seem like a daunting task that requires complex orchestration, from provisioning high-capacity VMs and manually installing CUDA drivers to managing tedious data transfers and scaling down manually to control costs. Cloud Run Jobs elegantly solves this by allowing you to package your training logic as a container, now backed by the fully managed environment of NVIDIA RTX PRO 6000 Blackwell GPUs and their 96GB of VRAM.

This setup delivers on-demand availability without the need for reservations, rapid 5-second startup times with drivers pre-installed, and automatic scale-to-zero efficiency that ensures you only pay for the minutes your model is training. By leveraging built-in GCS volume mounting for high-speed access to model weights, we can now move past infrastructure hurdles and focus on the core task: fine-tuning Gemma 3 27B to achieve high-precision results for Pet Breed Classification on the Oxford-IIIT Pet Dataset.

If you’d like to dive straight into the code, you can clone the repository here.

Prerequisites

Before you begin the fine-tuning process, ensure you have the following software and environment configurations in place.

Python 3.12+
uv (Python package manager): will be used to manage our local Python environment and speed up our Docker builds. Use curl to download the script and execute it with sh:

curl -LsSf https://astral.sh/uv/install.sh | sh

Google Cloud SDK (gcloud CLI) installed and authenticated.
A Google Cloud Project with billing enabled.
APIs Enabled Ensure the following APIs are active in your project: Cloud Run Admin API, Artifact Registry API, Cloud Build API, Secret Manager API, Compute Engine API (for GPU provisioning)
Hugging Face Token: A valid token with access to the Gemma 3 27B-IT model weights.

Access to gated models: Gemma 3 27B-IT is a gated model, which means you must explicitly accept the terms of use before you can download or fine-tune the weights.

Accept the License: Visit the Gemma 3 27B-IT model page on Hugging Face and click the “Agree and access repository” button.
Generate a Token: Once access is granted, ensure your Hugging Face Token has “read” permissions (or “write” if you plan to push your fine-tuned model back to the Hub) to authenticate your training job.

Step 1 — Setting the stage: Your environment

Step 1.1 — Prepare your Google Cloud environment

Set environment variables.

[!IMPORTANT] Regional Alignment is Critical: To use Cloud Storage volume mounting, your GCS bucket must be in the same region as your Cloud Run job. We recommend using europe-west4 (Netherlands) as it supports the RTX PRO 6000 Blackwell GPU and ensures zero-latency access to your model weights.

export PROJECT_ID=YOUR_PROJECT_ID
export REGION=europe-west4
export HF_TOKEN=YOUR_HF_TOKEN
export SERVICE_ACCOUNT="finetune-gemma-job-sa"
export BUCKET_NAME=$PROJECT_ID-gemma3-finetuning-eu
export AR_REPO=gemma3-finetuning-repo
export SECRET_ID=HF_TOKEN
export IMAGE_NAME=gemma3-finetune
export JOB_NAME=gemma3-finetuning-job

Step 1.2 — Get the code

Whether you’re running locally or on the cloud, you’ll need the code. After you open Cloud Shell or install your local Google Cloud CLI, you need to clone the repository. The finetune_gemma repository contains the finetune_and_evaluate.py script, a Dockerfile, and the requirements.txt file to your machine.

git clone https://github.com/GoogleCloudPlatform/devrel-demos
cd devrel-demos/ai-ml/finetune_gemma/

gcloud auth login

Set your Project:

gcloud config set project $PROJECT_ID

Create the service account and grant storage permissions:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service Account for Gemma 3 fine-tuning"

gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/storage.objectAdmin

Create an Artifact Registry repository and store your HF Token in Secret Manager:

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="Gemma 3 finetuning repository"

# Create the secret (ignore error if it already exists)
gcloud secrets create $SECRET_ID --replication-policy="automatic" || true

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
  --role='roles/secretmanager.secretAccessor'

Step 2 — Staging the Model with cr-infer (Recommended)

To avoid downloading the model every time the job runs, we’ll stage the Gemma 3 27B weights in Google Cloud Storage. We’ll use cr-infer, which allows you to run model transfers directly via uvx without needing a local installation.

Before running the transfer, you must set up your Application Default Credentials. This is required for running scripts locally. In this case it allows the cr-infer tool to use your local identity to write the weights to your GCS bucket.

gcloud auth application-default login

Download Gemma 3 27B to GCS: Now, execute the transfer using uvx. This clones the model into gs://$BUCKET_NAME/google/gemma-3–27b-it/, allowing our Cloud Run job to mount the weights as a local volume and save gigabytes of container startup time

uvx — from git+https://github.com/oded996/cr-infer.git cr-infer model download \- source huggingface \
 - model-id google/gemma-3–27b-it \
 - bucket $BUCKET_NAME \
 - token $HF_TOKEN

Step 3 — Build and push the container image

Our Dockerfile leverages uv for fast dependency installation.

Option A: Use Google Cloud Build (Recommended — No local Docker needed)

This is the easiest way to build your image directly in the cloud and push it to Artifact Registry. (The build typically takes 10–15 minutes as it downloads large ML dependencies like PyTorch).

gcloud builds submit — tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest .

[!TIP] You can track the real-time progress of your build in the Cloud Build console.

Option B: Build locally with Docker

If you have Docker Desktop installed locally:

Install uv locally (if you haven’t already):

curl -LsSf https://astral.sh/uv/install.sh | sh

Build the image:

docker build -t $IMAGE_NAME .

Push to AR:

docker tag $IMAGE_NAME $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME
docker push $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME

Step 3.1 — Test locally (Optional)

I like to start with a quick local test run to validate the setup. It serves as a sanity check for your environment and scripts before moving the workload to Cloud Run. For this test, we use parameters optimized for speed and a smaller model, google/gemma-3–4b-it, to ensure the model correctly learns the task format:

python3 finetune_and_evaluate.py \
- model-id google/gemma-3–4b-it \
 - train-size 20 \
 - eval-size 20 \
 - gradient-accumulation-steps 2 \
 - learning-rate 2e-4 \
 - batch-size 1 \
 - num-epochs 3

On my Apple M4 Pro, running this on the CPU took about 20–30 minutes. If you want to see early signs of progress locally, you can increase the sample size — I found that a one-hour run on my Mac with 50 training and testing samples already yielded a 4% improvement in accuracy and a 3% boost in F1-score.

Results from a local run on my Mac with 50 train and 50 test samples

Inside the Fine-Tuning Script: How it Works

The finetune_and_evaluate.py script is designed to be a complete, self-contained pipeline, handling everything from data preparation to hardware-aware optimization and evaluation. Here is a look at the core logic that makes this possible:

1. Memory-Efficient Model Loading

To fit a 27B parameter model into the 96GB VRAM of the Blackwell GPU, the script uses 4-bit quantization via the bitsandbytes library. By setting low_cpu_mem_usage=True, it also ensures the model is loaded efficiently without exhausting the system RAM.

2. Vision-Language LoRA Configuration

Instead of updating all 27 billion parameters, we use LoRA (Low-Rank Adaptation). We target all the primary projection layers in the transformer blocks, allowing the model to adapt its internal representations to the visual nuances of the pet breeds while keeping the total trainable parameter count extremely low. More details on efficient GPU memory usage can be found in this blog.

3. The Custom Data Collator

This is a crucial part for fine-tuning vision-language models (VLMs). Because VLMs process a mix of image and text tokens, the data_collator ensures that the model only learns from the breed label (the model’s response). The turn marker is a structural boundary that signals the exact point where the user stops speaking and the model’s response begins. The script ensures the model learns only from the breed label by searching for the model’s turn marker in the token sequence and masking out the user’s prompt and image tokens, so they don’t contribute to the training loss.

4. Breed Extraction

Generative models often add conversational filler (e.g., “The animal in this image is a Samoyed”). Our evaluation logic includes a robust extraction heuristic that sorts class names by length. This ensures that if the model mentions “English Cocker Spaniel,” it correctly identifies the full breed rather than just matching “Cocker Spaniel”.

5. Automated GCS Archiving

Once the training completes and the final evaluation is calculated, the script doesn’t just stop. It bundles the fine-tuned LoRA adapters with the original model processor and automatically uploads the entire directory to your Google Cloud Storage bucket. This ensures your model is immediately ready for deployment or serving.

Step 4 — Create and execute the Cloud Run job

Now, we harness the power of the NVIDIA RTX PRO 6000 Blackwell GPU. Our container is built with CUDA 12.8 for full Blackwell/PyTorch 2.7 compatibility and uses an ENTRYPOINT configuration, allowing you to pass script arguments directly via the — args flag.

[!TIP] If the job already exists, use gcloud beta run jobs update instead of create.

gcloud beta run jobs create $JOB_NAME \
 - region $REGION \
 - image $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest \
 - set-env-vars BUCKET_NAME=$BUCKET_NAME \
 - set-secrets HF_TOKEN=$SECRET_ID:latest \
 - no-gpu-zonal-redundancy \
 - cpu 20.0 \
 - memory 80Gi \
 - task-timeout 60m \
 - gpu 1 \
 - gpu-type nvidia-rtx-pro-6000 \
 - service-account $SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
 - add-volume name=model-volume,type=cloud-storage,bucket=$BUCKET_NAME \
 - add-volume-mount volume=model-volume,mount-path=/mnt/gcs \
 - network=default \
 - subnet=default \
 - vpc-egress=private-ranges-only \
 - args=" - model-id","/mnt/gcs/google/gemma-3–27b-it/"," - output-dir","/tmp/gemma3-finetuned"," - gcs-output-path","gs://$BUCKET_NAME/gemma3-finetuned"," - train-size","800"," - eval-size","200"," - learning-rate","5e-5"

Note on Execution Limits: Tasks using GPUs on Cloud Run Jobs currently have a maximum execution time of 60 minutes. To ensure this training job completes within the standard public limit, we have set the — num_epochs to 3 and restricted the — train-size to 800 samples. If your specific fine-tuning workload requires more time, you can sample your training dataset into segments that fit in under 60 minutes (like 800 samples in our case) and process them as a sequence of independent tasks while using checkpointing for the model training.

Understanding the Deployment Flags

To ensure a stable and production-ready environment, we use several specialized flags:

— gpu-type nvidia-rtx-pro-6000: Targets the NVIDIA RTX PRO 6000 Blackwell GPU. With 96GB of GPU memory (VRAM), 1.6 TB/s bandwidth, and support for FP4/FP6 precision, it provides the ample overhead and high-speed throughput needed for multimodal fine-tuning.
— memory 80Gi: We allocate high system RAM (scalable up to 176GB) to handle the low_cpu_mem_usage model loading and our memory-efficient streaming data generator.
— cpu 20.0: Cloud Run Jobs allows scaling up to 44 vCPUs per instance, ensuring that preprocessing and data loading never become a bottleneck for the GPU.
— add-volume & — add-volume-mount: This mounts your GCS bucket as a local directory at /mnt/gcs. Note: This requires the bucket and the job to be in the same region (europe-west4). It allows the script to read the base model weights at data-center speeds without copying them into the container’s writable layer.
— network & — subnet: Configures Direct VPC Egress, allowing the job to communicate securely with other resources in your VPC. To make sure this works you need to enable “Private Google Access”.
— vpc-egress=all-traffic: Ensures all outgoing traffic, including requests to Hugging Face, is routed through your VPC for enhanced security and monitoring.

[!TIP] If you skipped Step 2 and didn’t stage the model in your GCS bucket, you must change the — model-id in the — args to google/gemma-3–27b-it. This tells the script to download the weights directly from Hugging Face at runtime, though this will be significantly slower than using the GCS mount

Execute the job:

gcloud beta run jobs execute $JOB_NAME — region $REGION — async

Step 5 — Check Results and Evaluate Performance

Once your job finishes, you can jump into the Google Cloud Console to inspect the detailed logs. You’ll find your newly fine-tuned model waiting for you in your Cloud Storage bucket at gs://$BUCKET_NAME/gemma3-finetuned.

To rigorously quantify how well Gemma 3 learned to identify these breeds, we used Accuracy and Macro F1 Score as our primary metrics. While accuracy gives us a clear overall percentage, the F1 score ensures the model is accurate across all 37 breeds, not just the most common ones.

In my testing, I saw a clear progression as we scaled our data and compute:

Results with different sample size

79% Accuracy, 77% F1-score (1.1h run): Trained on 1,000 samples and evaluated against 200 test samples, this was a significant jump from the zero-shot baseline of 66%.
93% Accuracy, 91% F1-score (2.3h run): By scaling up to 2,500 training samples (and 1,500 test samples), the model reached nearly state-of-the-art performance.
94% Accuracy & 91.5% F1 (3.3h run): With a larger run on 3,600 training samples (evaluated against 3,500 test samples), the model effectively hit the state-of-the-art benchmark for this dataset.

Performance summary report for 3600 train samples and 3500 test sample — reached state of the art with 94% accuracy!

It is important to note that the standard public limit for GPU jobs is currently 60 minutes. As mentioned in step 4, sampling and checkpointing can help overcome this limitation.

These results prove that fine-tuning is the necessary bridge for generalist models, by leveraging serverless Blackwell GPUs, we’ve transformed a massive reasoner into a high-precision expert ready for production

Next Steps: Serving your fine-tuned model on Cloud Run

Now that you’ve fine-tuned Gemma 3, the next challenge is serving it efficiently for production-grade inference.

The true “deploy and forget” magic happens when you transition your saved weights into a serving environment. By hosting your fine-tuned model on Cloud Run with serverless Blackwell GPUs, you get a highly economical production environment where your GPU instances automatically scale to zero when they aren’t in use. This setup eliminates the operational toil of cluster management and manual maintenance, allowing you to serve massive models with no reservations, you only pay for the exact minutes your model is active.

To get started with inference, explore this codelab: Run inference using a Gemma model on Cloud Run with RTX 6000 Pro GPU.

To learn more about production serving, refer to the official guide on Running Gemma 3 on Cloud Run. The documentation provides a comprehensive roadmap for building a robust inference service, including:

Optimized Deployment: Instructions for serving Gemma models using GPU accelerators and loading model weights via high-speed Cloud Storage volume mounts.
Secure Interaction: Guidance on using IAM authentication to securely call your deployed service with the Google Gen AI SDK.
Performance Configuration: Best practices for setting concurrency to achieve optimal request latency and high GPU utilization

Special thanks to Sara Ford and Oded Shahar from the Cloud Run team for the helpful review and feedback on this article.

Agent Factory Recap: Supercharging Agents on GKE with Agent Sandbox and Pod Snapshots

Shir Meir Lador — Tue, 07 Apr 2026 13:04:00 +0000

In the latest episode of the Agent Factory, Mofi Rahman and I had the pleasure of hosting, Brandon Royal, the PM working on agentic workloads on GKE. We dove deep into the critical questions around the nuances of choosing the right agent runtime, the power of GKE for agents, and the essential security measures needed for intelligent agents to run code.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Why GKE for Agents?

Timestamp: 01:49

We kicked off our discussion by tackling a fundamental question: why choose GKE as your agent runtime when serverless options like Cloud Run or fully managed solutions like Agent Engine exist?

Brandon explained that the decision often boils down to control versus convenience. While serverless options are perfectly adequate for basic agents, the flexibility and governance capabilities of Kubernetes and GKE become indispensable in high-scale scenarios involving hundreds or thousands of agents. GKE truly shines when you need granular control over your agent deployments.

ADK on GKE

Timestamp: 06:58

We've discussed the Agent Development Kit (ADK) in previous episodes, and Mofi highlighted to us how seamlessly it integrates with GKE and even showed a demo with the agent he built. ADK provides the framework for building the agent's logic, traces, and tools, while GKE provides the robust hosting environment. You can containerize your ADK agent, push it to Google Artifact Registry, and deploy it to GKE in minutes, transforming a local prototype into a globally accessible service.

The Sandbox problem

Timestamp: 15:20

As agents become more sophisticated and capable of writing and executing code, a critical security concern emerges: the risk of untrusted, LLM-generated code. Brandon emphasized that while code execution is vital for high-performance agents and deterministic behavior, it also introduces significant risks in multi-tenant systems. This led us to the concept of a "sandbox."

What is a Sandbox?

Timestamp: 19:18

For those less familiar with security engineering, Brandon clarified that a sandbox provides kernel and network isolation. Mofi further elaborated, explaining that agents often need to execute scripts (e.g., Python for data analysis). Without a sandbox, a hallucinating or prompt-injected model could potentially delete databases or steal secrets if allowed to run code directly on the main server. A sandbox creates a safe, isolated environment where such code can run without harming other systems.

Agent Sandbox on GKE Demo

Timestamp: 20:25

So, how do we build this "high fence" on Kubernetes? Brandon introduced the Agent Sandbox on Kubernetes, which leverages technologies like gVisor, an application kernel sandbox. When an agent needs to execute code, GKE dynamically provisions a completely isolated pod. This pod operates with its own kernel, network, and file system, effectively trapping any malicious code within the gVisor bubble.

Mofi walked us through a compelling demo of the Agent Sandbox in action.We observed an ADK agent being given a task requiring code execution. As the agent initiated code execution, GKE dynamically provisioned a new pod, visibly labeled as "sandbox-executor," demonstrating the real-time isolation. Brandon highlighted that this pod is configured with strict network policies, further enhancing security.

The Future: Pod Snapshots

Timestamp: 29:39

While the Agent Sandbox offers incredible security, the latency of spinning up a new pod for every task is a concern. Mofi demoed the game-changing solution: Pod Snapshots. This technology allows us to save their state of running sandboxes and then near-instantly restore them when an agent needs them. Brandon noted that this reduces startup times from minutes to seconds, revolutionizing real-time agentic workflows on GKE.

Conclusion

It's incredible to see how GKE isn't just hosting agents; it's actively protecting them and making them faster.

Your turn to build

Ready to put these concepts into practice? Dive into the full episode to see the demos in action and explore how GKE can supercharge your agentic workloads.

Learn how to deploy an ADK agent to Google Kubernetes Engine and how to get your run agent to run code safely using the GKE agent Sandbox.

Connect with us

Shir Meir Lador → LinkedIn, X
Mofi Rahman → LinkedIn
Brandon Royal → LinkedIn

Agent Factory Recap: Reinforcement Learning and Fine-Tuning on TPUs

Shir Meir Lador — Tue, 31 Mar 2026 18:56:42 +0000

In our agent factory holiday special, Don McCasland and I were joined by Kyle Meggs, Senior Product Manager on the TPU Training Team at Google, to dive deep into the world of model fine tuning. We focused specifically on reinforcement learning (RL), and how Google's own infrastructure of TPUs are designed to power these massive workloads at scale.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

When to Consider Fine-Tuning

Timestamp: 3:13

We started with a fundamental question: with foundational models like Gemini becoming so powerful out of the box, and customization through the prompt can often be good enough, when should you consider fine-tuning?

Fine tuning your own model is relevant when you need high specialization for unique datasets where a generalist model might not excel (such as in the medical domain), or when you have strict privacy restrictions that require hosting your own models trained on your data.

The Model Lifecycle: Pre-training and Post-training (SFT and RL)

Timestamp: 3:52

Kyle used a great analogy inspired by Andrej Karpathy to break down the stages of training. He described pre-training as "knowledge acquisition," similar to reading a chemistry textbook to learn how things work. Post-training is further split into Supervised Fine-Tuning (SFT), which is analogous to reading already-solved practice problems within the textbook chapter, and Reinforcement Learning (RL), which is like solving new practice problems without help and then checking your answers in the back of the book to measure yourself against an optimal approach and correct answers.

Why Reinforcement Learning (RL) is Essential

Timestamp: 5:50

We explored why RL is currently so important for building modern LLMs. Kyle explained that unlike SFT, which is about imitation, RL is about grading actions to drive "alignment." It’s crucial for teaching a model safety (penalizing what not to do), enabling the model to use tools like search and interact with the physical world through trial and error, and for performing verifiable tasks like math or coding by rewarding the entire chain of thought that leads to a correct answer.

The Agent Industry Pulse: Why 2025 is the year of RL

Timestamp: 8:33

In this segment, we looked at the rapidly evolving landscape of RL. Kyle noted that it is fair to call 2025 the "year of RL," highlighting the massive increase in investment and launches across the industry:

January: DeepSeek-R1 launched, making a huge splash with open-source GRPO.
Summer: xAI launched Grok 4, reportedly running a 200k GPU cluster for RL at "pre-training scale."
October: A slew of new tooling launches across Google, Meta, and TML.
November: Gemini 3 launched as a premier thinking model.
Recent: Google launched MaxText 2.0 for fine-tuning on TPUs.

The Hurdles of Implementing RL

Timestamp: 10:46

Following the industry trends, we discussed why RL is so difficult to implement. Kyle explained that RL combines the complexities of both training and inference into a single process. He outlined three primary challenges: managing infrastructure at the right balance and scale to avoid bottlenecks; choosing the right code, models, algorithms (like GRPO vs. DPO), and data; and finally, the difficulty of integrating disparate components for training, inference, orchestration, and weight synchronization.

To provide a solution across these dimensions of complexity, Google offers MaxText, a vertically integrated solution to help you perform RL in a highly scalable and performant fashion. MaxText provides highly optimized models, the latest post-training algorithms, high performance inference via LLM, and powerful scalability/flexibility via Pathways.

In contrast to DIY approaches where users assemble their own stack of disparate components from many different providers, Google’s approach offers a single integrated stack of co-designed components, from silicon to software to solutions.

The Factory Floor

The Factory Floor is our segment for getting hands-on. Here, we moved from high-level concepts to practical code with a live demo.

Why TPUs Shine for RL

Timestamp: 12:52

Before diving into the demo, Kyle explained why TPUs are uniquely suited for complex AI workloads like RL. Unlike other hardware, TPUs were designed system-first. A TPU Pod can connect up to 9,216 chips over low-latency interconnects, allowing for massive scale without relying on standard data center networks. This is a huge advantage for overcoming RL bottlenecks like weight synchronization. Furthermore, because they are purpose-built for AI, they offer superior price-performance and thermal efficiency.

Demo: Reinforcement Learning (GRPO) with TPU

Timestamp: 15:53

Don led a hands-on demonstration showing what RL looks like in action using Google's infrastructure. The demo showcased:

Using MaxText 2.0 as an integrated solution for the workload.
Leveraging models from MaxText and algorithms from Tunix.
Handling inference using vLLM.
Utilizing Pathways for orchestration and scaling to run GRPO (Group Relative Policy Optimization).

Conclusion

This holiday special was a great deep dive into the cutting edge of model fine tuning. While foundational models are getting better every day, the future of highly specialized, capable agents relies on mastering post-training techniques like RL, and having the right vertically integrated infrastructure, like TPUs, to run them efficiently.

Your turn to build

We hope this episode gave you valuable tools and perspectives to think about fine-tuning your own specialized agents. Be sure to check out the resources below to explore MaxText 2.0 and start experimenting with TPUs for your workloads. We'll see you next year for a revamped season of The Agent Factory!

Resources

Post-Training Docs https://maxtext.readthedocs.io/en/latest/tutorials/post_training_index.html

Google Cloud TPU (Ironwood) Documentation: https://docs.cloud.google.com/tpu/docs/tpu7x
Google Cloud open source code:
- MaxText - https://github.com/AI-Hypercomputer/maxtext
- GPU recipes - https://github.com/AI-Hypercomputer/gpu-recipes
- TPU recipes - https://github.com/AI-Hypercomputer/tpu-recipes
Andrej Karpathy - Chemistry Analogy: Deep Dive into LLMs like ChatGPT
Paper: "Small Language Models are the Future of Agentic AI" (Nvidia): https://arxiv.org/abs/2506.02153
Fine tuning blog: https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification?e=48754805

Connect with us

Shir Meir Lador → https://www.linkedin.com/in/shirmeirlador/, X
Don McCasland → https://www.linkedin.com/in/donald-mccasland/
Kyle Meggs → https://www.linkedin.com/in/kyle-meggs/

My First Experience Creating Antigravity Skills

Shir Meir Lador — Fri, 20 Mar 2026 15:23:02 +0000

Experimenting with Agent skills for the first time, feeling empowered!

Last week, I was at an event where we taught developers how to build MCP servers, agents, and deploy open models to Google Cloud Run. After the session, one of the developers shared something that really stuck with me: he was already using our content to create specialized Skills to share with his entire team.

I got inspired and decided it was time to dive into Agent Skills. During my last project, the dev-signal agent, I had a lot of learnings about how to bring agents and AI applications to production in a robust and scalable manner. I thought, this is a great opportunity to give my favorite coding agent, Google’s Antigravity (Google’s “agent-first” IDE), those skills so that going forward, it will just do it for me!

In this post, I’ll walk through how I built the 13 production skills in this repository and the patterns behind them.

What are Agent Skills?

As Romin Irani explains in “Getting Started with Google Antigravity Skills”, skills represent a shift from monolithic context loading to Progressive Disclosure.

Agents get “overwhelmed” when providing them too many tools all at once (a phenomenon known as “Tool Bloat”), to solve for that, Skills allow the agent to “load” specialist knowledge only when needed. When you ask an agent to “evaluate a shadow revision,” it will figure out it will need to leverage the Shadow Deployer skill as context for this operation.

Workspace vs. Global Scope

In Antigravity, you can manage these skills in two distinct ways depending on how you want to use them:

Workspace Scope: Located in .agent/skills/ within your project root. These are specific to your project and can be committed to GitHub so your entire team can benefit from the same production patterns.
Global Scope: Located in ~/.gemini/antigravity/skills/. These are your personal utilities that stay with you across every project you work on.

How I built the skills

Following the principles in Daniela Petruzalek’s “Building Agent Skills with skill-creator”, I took a “methodology-first” approach. I used the existing dev-signal blog series I’ve been working on and the codebase itself as core context, asking Antigravity to identify and codify the unique skills needed to build a production agent on Google Cloud.

For some of the more specialized areas, I provided additional context with patterns I’d like to follow, such as the agent evaluation codelab and blog and the agent security codelab, both written by my awesome team.

These 13 skills provide Antigravity (or any developer using them) the crucial toolkit of a Google Cloud Production Engineer. I’m currently finalizing a detailed, step-by-step walkthrough of the dev-signal agent which will be published on the Google Cloud Blog very soon! (follow me for future updates)

In the meantime, you don’t have to wait — the full repository and the skills are available for you to explore and leverage in your own projects today.

Here is the full inventory of the skills:

🏗️ Production Agent

adk-memory-bank-initializer: Long-term state logic with Vertex AI Memory Bank.
agent-containerizer: Mixed-runtime Dockerfiles (Python + Node.js).
cloud-run-agent-architect: Least-privilege Terraform for Cloud Run.
gcp-production-secret-handler: In-memory secret fetching pattern (Secret Manager).
mcp-connector-generator: Standardized MCP connection logic.

📊 Evaluation

gcp-agent-eval-engine-runner: Parallel inference and reasoning trace capture.
gcp-agent-eval-metric-configurator: Setup for Grounding and Tool Use rubrics.
gcp-agent-golden-dataset-builder: Tools for building datasets with reference trajectories.
gcp-agent-shadow-deployer: “Dark Canary” deployment scripts with revision tagging.
gcp-agent-tool-trajectory-evaluator: Custom Python metrics for Precision and Recall.

🛡️ Security

gcp-agent-model-armor-shield: Intelligent firewall (Prompt Injection, RAI, Malicious URL filters).
gcp-agent-safety-gatekeeper: Python integration pattern (safety_util.py) for sanitizing user inputs.
gcp-agent-sdp-template-factory: Terraform for Sensitive Data Protection (PII/Secret redaction).

By codifying these patterns to production skills, Antigravity can now leverage these automatically in my day to day development. I hope you find these as helpful as I do!

Pro tip - self improving skills!

Because these skills were AI-generated, they might not work perfectly for your specific environment on the first try. But that’s actually the best part of working with an agentic IDE. If a skill doesn’t work well for you, don’t just manually fix the code, let the coding agent figure it out. Once it finds the solution, you can ask it to update the corresponding SKILL.md with the learned workflow. This will capture the corrected workflow for the future, ensuring the agent doesn’t repeat the mistake while saving you tokens and time on the next run. Think of these as living documents that actively improve as you build.

Ready to get started? Clone the repository and add these skills to your Workspace or Global Scope to start building your own production-ready agents. Learn more about Agent skills.

Follow me on LinkedIn and X for updates on my next blogs and videos.

How I Turned an Ugly Spreadsheet into an AI Assisted App with Antigravity

Shir Meir Lador — Wed, 18 Feb 2026 17:39:12 +0000

I have a confession to make.

Up until now, I wasn’t that much into “vibe coding.” I used AI all the time for Python coding, but I never really built a whole app from scratch in a language I knew nothing about.

That changed today. I encountered a really annoying problem: I had to review a massive amount of talk submissions for a conference. We’re talking about a massive spreadsheet. Staring at those tiny cells was literally making my eyes hurt.

My initial thought was, “Hey, let’s create a really sharp UI for the submission review.” But then I thought, why stop there? Why not let AI provide me valuable inputs from social media to help me with the review itself?

So, I decided to build TalkScout. And since I wanted to test drive Google Antigravity (Google’s new AI-powered coding agent), I figured this was the perfect opportunity.

Talkscout Dashboard (synthetic data)

Here is how I went from a painful CSV to a fully deployed Cloud Run app-without writing a single line of React code myself.

Step 1: The “Meta-Prompt” (Asking Gemini to Talk to Antigravity)

I didn’t start by coding; I started by chatting. I used meta-prompting to get started.

So, what is meta-prompting, you may ask? It’s actually when you go to Gemini 3 and ask it to write the prompt for the coding agent.

I explained my problem to Gemini 3 in simple words. Gemini 3 acted as my architect. It turned my “brain dump” requirements into a technical spec, defining the component structure and data model. I didn’t have to guess the right words, I just pasted that polished spec into Antigravity.

Step 2: Ditching the Spreadsheet for a Dashboard

With that prompt, Antigravity built the app of my dreams. It allowed me to:

Upload the CSV with all the conference talks.
Get a dashboard showing the status of each talk.
See a beautiful, high-contrast UI to review abstracts and demo plans without squinting at cells.

TalkScout submission review page with high contrast UI

The “Vibe” Fix: It wasn’t all smooth sailing — I actually hit a nasty React hydration error. This can take hours to debug, especially if you’re not a frontend developer… But I simply provided the error message to Antigravity and the coding agent pinpointed the mismatch in the DOM and fixed it in minutes.

Step 3: Integrating Grounded Intelligence

I didn’t just want a UI; I wanted to overcome my own bias. How do I know if a niche topic is actually hot?

I added a button to get an AI Assessment. But I didn’t want hallucinations. I used Google Search Grounding so the AI could search through Reddit, X (Twitter), and LinkedIn for real-world developer signals. That provided me inputs based on the current developer audience mindshare.

TalkScout submission review page with AI social media analysis

Step 4: Calibrating the “Strict” Reviewer

Initially, the AI was way too nice. It was giving high scores to anything with trendy keywords.

I used what’s called few-shot prompting to calibrate it. I gave examples of my scores vs. its scores and introduced what I call the “Marketing Fluff Penalty”.

If a submission reads like a documentation/marketing page? Points docked.
If the submission was way too short? We capped the score at a hard 2.
If it includes war stories and actual learnings — increase rating. After a few examples, it became more calibrated to my taste.

Step 5: The Pivot to Batch Mode

I realized it was taking me too long to ask the AI to evaluate each talk individually while I reviewed it.

So, I asked Antigravity to refactor the backend for Batch Mode. Now, TalkScout processes the entire submission pool in the background. By the time I grab a coffee, the “AI Draft” column is full of insights, allowing me to focus only on the final decisions.

Step 6: Sharing the Goodness (Deploy to Cloud Run)

TalkScout was working great for me, but I thought, “It would be great to share this with the other reviewers.”

This is where Antigravity really showed off. I simply asked it to deploy the app. It automatically recognized my Google Cloud Project ID, handled the containerization, generated the exact deployment commands, and deployed it to Cloud Run.

One simple ask, and minutes later, I had a URL to share with the team.

It Was Pretty Fun!

It was pretty fun to actually solve a real problem I had using Antigravity and vibe coding. I built a tool that handles ingestion, provides a distraction-free rating interface, and provides valuable inputs for my reviews.

I would love to hear from you all - have you recently solved a problem using vibe coding?

If you haven’t already - try playing around with Antigravity and easily deploy your apps to Cloud Run.

Decoding high-bandwidth memory: A practical guide to GPU memory for fine-tuning AI models

Shir Meir Lador — Thu, 15 Jan 2026 15:27:00 +0000

We've all been there. You've meticulously prepared your dataset and written your training script. You hit run, and your excitement builds, only to be crushed by the infamous error: CUDA out of memory.

This is one of the most common roadblocks in AI development. Your GPU's High Bandwidth Memory (HBM), is the high-speed memory that holds everything that's needed for computation, and running out of it is a hard stop. But how do you know how much you need?

To build a clear foundation, we'll start by breaking down the HBM consumers on a single GPU and we'll present key strategies to reduce HBM consumption on a single GPU. Later, we'll explore advanced multi-GPU strategies like data and model parallelism that can help relieve memory pressure and scale your training in the cloud.

Understanding HBM: What's using all the memory?

When you fine-tune a model, your HBM is primarily consumed by three things:

Model Weights: This is the most straightforward. It's the storage space required for the model's parameters—the "brain" that it uses to make predictions. A 7-billion parameter model loaded in 16-bit precision will take up roughly 14 GB before you even process a single piece of data.
Optimizer States and Gradients: This is the overhead that's required for learning. To update the model's weights, the training process needs to calculate gradients (the direction of learning) and the optimizer (like the popular AdamW) needs to store its own data to guide the training. In full fine-tuning, this can be the largest consumer of HBM.
Activations and Batch Data: This is the most dynamic part. When your data (images, text, etc.) flows through the model's layers, the intermediate calculations, or activations, are stored in HBM. The memory needed here is directly proportional to your batch size. A larger batch size means more activations are stored simultaneously, which leads to faster training but much higher memory usage.

Note: These calculations are theoretical minimums. Real-world frameworks add up to 30% overhead due to temporary buffers, kernel launches, and memory fragmentation.

Although it's impossible to get a perfect number without experimentation, you can estimate your HBM needs with this general formula:

Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)

Further reading: See this excellent JAX e-book that covers these topics in great detail and even has some "try it out yourself" test questions.

Example: Why full fine-tuning is so demanding

To see why running out of memory is such a common problem, let's walk through a real-world example that I recently worked on: fine-tuning the medgemma-4b-it model, which has 4 billion parameters. Our script loads it in bfloat16 precision (2 bytes per parameter).

First, let's calculate the static HBM footprint. This is the memory that's required just to load the model and prepare it for training, before you've even processed a single piece of data.

1. Model Size: The memory that's needed to simply hold the model on the GPU.

4 billion parameters × 2 bytes/parameter = 8 GB

2. Gradients and Optimizer States: The overhead for training every parameter with the AdamW optimizer.

Gradients: 4 billion parameters × 2 bytes/parameter = 8 GB Optimizer States (AdamW): 2 × 4 billion parameters × 2 bytes/parameter = 16 GB

Note: While AdamW is a popular optimizer, other optimizers, such as Adafactor and Lion, have different memory footprints.

Adding these together gives us our baseline HBM cost for a full fine-tuning attempt:

8 GB (Model) + 8 GB (Gradients) + 16 GB (Optimizer) = 32 GB

This 32 GB is the baseline just to start the training process. On top of this, the GPU needs additional memory for activations, which is a dynamic cost that grows with your batch size and input data size. This is why full fine-tuning of large models is so demanding and often reserved for the most powerful hardware.

Key strategies to reduce HBM consumption

The HBM requirement for a full fine-tune can seem impossibly high. But several powerful techniques can reduce memory consumption, making it feasible to train large models on consumer-grade or entry-level professional GPUs.

Parameter-Efficient Fine-Tuning (PEFT) with LoRA

Instead of training all the billions of parameters in a model, Parameter-Efficient Fine-Tuning (PEFT) methods focus on training only a small subset of parameters. The most popular of these is LoRA (Low-Rank Adaptation).

LoRA works by freezing the original model's weights and injecting a tiny number of new, trainable adapter layers into the model architecture. This means the memory-hungry gradients and optimizer states are only needed for these few million new parameters, not the full 4 billion.

The math behind LoRA's memory savings

LoRA doesn't remove the base model from your GPU. The full 8 GB of the original model's weights are still loaded and taking up HBM. They're just frozen, which means that the GPU isn't training them. All of the memory savings come from the fact that you no longer need to store the huge gradients and optimizer states for that massive, frozen part of the model.

Let's recalculate the static HBM footprint with LoRA, assuming it adds 20 million trainable parameters:

1. Model Size (unchanged): The base model is still loaded.

4 billion parameters × 2 bytes/parameter = 8 GB

2. LoRA Gradients & Optimizer States: We now only need overhead for the tiny set of new parameters.

Gradients: 20 million parameters × 2 bytes/parameter = 40 MB Optimizer States: 2 × 20 million parameters × 2 bytes/parameter = 80 MB

The new static HBM footprint is now:

8 GB (Model) + 40 MB (Gradients) + 80 MB (Optimizer) ≈ 8.12 GB

The training overhead has shrunk from 24 GB to just 120 MB. Your new baseline memory requirement is now just over 8 GB. This lower baseline memory requirement leaves much more room for the dynamic memory that's needed for activations, which lets you use a reasonable batch size on a common 16 GB or 24 GB GPU without running out of memory.

Model quantization

Besides training fewer parameters, we can also shrink the ones that we have by using quantization, which involves reducing the numerical precision of the model's weights. The standard precision for modern training is bfloat16 because it offers the dynamic range of float32 with half the memory footprint. But we can reduce HBM usage further by converting weights to lower-precision integer formats like int8 or int4.

Using lower-precision integer formats has a significant impact on HBM when compared to the standard bfloat16 baseline:

bfloat16 (standard): The baseline size (e.g., a 7B model requires ~14 GB).
8-bit precision: Halves the model size (e.g., 14 GB becomes ~7 GB).
4-bit precision: Reduces the model size by a factor of 4 (e.g., 14 GB becomes ~3.5 GB).

The reduction in size lets you fit much larger models into memory with minimal degradation in performance.

A word of warning from experience:
When I started experimenting in this area, my first attempt to load the model using the common float16 data type failed spectacularly. The model's outputs were NaN, and a quick check revealed that every internal value had collapsed into NaN (Not a Number) .

The culprit was a classic numerical overflow. The float16 data type has a tiny numerical range and it can't represent any number larger than 65,504. During training, intermediate values can easily exceed this limit, causing an overflow that creates a NaN. The fix was a simple one-line change to bfloat16, which has a massive numerical range that prevents these overflows and keeps training stable. For fine-tuning large models, always prefer bfloat16 for stability.

Combining LoRA and Quantization: These techniques work best together. Quantized LoRA (QLoRA) is a method that stores the massive base model in a highly efficient 4-bit format (specifically NF4 or NormalFloat 4), while adding small, trainable LoRA adapters in bfloat16. During the training process, the 4-bit weights are dequantized to bfloat16 for computation. Dequantizing in process lets you fine-tune very large models on a single GPU with the memory savings of 4-bit storage and the mathematical stability of 16-bit training.

FlashAttention: An algorithmic speed boost

Finally, FlashAttention is a foundational algorithmic optimization that significantly reduces HBM usage and speeds up training on both single and multi-GPU setups. The attention mechanism in transformers is a primary memory bottleneck because it requires storing a large, intermediate attention matrix. FlashAttention cleverly reorders the computation to avoid storing this full matrix in memory, leading to substantial memory savings and faster execution.

Best of all, enabling FlashAttention is often as simple as a one-line change. In the MedGemma fine-tuning script, this was done by setting the value attn_implementation="sdpa", which can automatically use more efficient backends like FlashAttention if the hardware supports it.

Scaling beyond a single GPU: Advanced strategies

Techniques like LoRA and quantization are useful for lowering HBM needs on a single GPU. But to train truly massive models or to really speed up the process, you'll eventually need to scale out to multiple GPUs. Here are some of the key strategies that can be used to distribute the load and overcome memory limitations.

Data parallelism

Data parallelism is the most common and intuitive approach to scaling. In a Distributed Data Parallel (DDP) setup, the entire model is replicated on each GPU. The key is that the global batch of training data is split, with each GPU processing its own mini-batch concurrently. After each forward and backward pass, the gradients from each GPU are averaged together to ensure that all of the model replicas learn from the entire dataset and they stay in sync. This method is excellent for speeding up training but it doesn't reduce the HBM that's required to hold the model itself, because every GPU needs a full copy.

Model parallelism

When a model is too large to fit into the memory of a single GPU, you must use model parallelism. Instead of replicating the model, this strategy splits the model across multiple GPUs. There are two primary ways to do this:

Tensor parallelism: This method splits a single large operation (like a massive weight matrix in a transformer layer) across several GPUs. Each GPU computes its part of the operation, and the results are combined.
Pipeline parallelism: This technique places different layers of the model onto different GPUs in a sequence. The data flows through the first set of layers on GPU 1, then the output is passed to GPU 2 for the next set of layers, and so on, like an assembly line.

These strategies are more complex to implement than data parallelism, but they're essential for models that are simply too big for one device.

Fully Sharded Data Parallelism (FSDP)

FSDP is a powerful and efficient hybrid strategy that combines the ideas of data parallelism and model parallelism. Unlike standard data parallelism where each GPU holds a full copy of the model, optimizer states, and gradients, FSDP shards (or splits) all of these components across the GPUs. Each GPU only materializes the full parameters for the specific layer that it's computing at that moment, dramatically reducing the peak HBM usage per device. FSDP makes it possible to train enormous models on a cluster of smaller GPUs.

By combining these hardware and software strategies, you can scale your fine-tuning jobs from a single GPU to a powerful, distributed cluster capable of handling even the most demanding AI models.

HBM sizing guide

HBM	Use case and explanation
16 GB	Sufficient for basic inference or fine-tuning with techniques like LoRA using a very small batch size (e.g., 1-2). Expect slower training times at this level.
24 GB	The recommended starting point for a good experience with 4-7 B parameter models. This capacity allows for a more effective batch size (e.g., 8-16) when using LoRA, providing a great balance of training speed and cost.
40+ GB	Necessary for maximizing training speed with large batch sizes or for working with larger models (in the 20+ B parameter range) now or in the future.

Encountering the CUDA out of memory error provides an important lesson in the trade-offs between model size, training techniques, and batch size. By understanding what consumes your HBM, you can make smarter decisions and keep your projects running smoothly.

I hope that this guide has helped clarify the CUDA out of memory error and that it's given you the tools to keep your projects running smoothly. When you're ready to take the next step, Google Cloud has the tools to accelerate your AI development.

Explore GPU configurations for your Cloud Run services and best practices for running Cloud Run jobs with GPU.
For maximum control: Spin up a Compute Engine instance with the latest NVIDIA H100 or A100 Tensor Core GPUs and take full control of your environment.
Looking to optimize your model hosting infrastructure? Take a look at The Ultimate Guide to xPU Inference Configuration.
For a deeper dive into scaling your model, check out How to Scale Your Model.
New to Google Cloud? Get started with the $300 free credit to find the perfect solution for your next project.

Special thanks to Jason Monden and Sayce Falk from the AI compute team for their helpful review and feedback on this post.

Forem: Shir Meir Lador

Agent Factory Recap: How Gemma 4 Taught Itself Physics

Gemma 4 - What is it?

Omar Sanseviero on how Gemma 4 changes the landscape for agent developers

The Factory Floor

Building a Local Food Tour Agent

Autonomous Python Code Execution

The Shift to Apache 2 Licensing

Developer Q&A

Architectural Decisions and Mixture of Experts (MoE)

Comparing Gemma to Gemini

Fine-Tuning for Specialized Industries

Conclusion

Your turn to build

Connect with us

Deploying a Multi-Agent System with Terraform and Cloud Run

Deployment to Cloud Run and the Path to Production

Production Utilities and Server: Building the System's Body

The Application Server

Implementing Telemetry

Monitoring vs. Targeted Evaluation

Viewing the Trace

Infrastructure as Code: Provisioning Secure Cloud Resources

Terraform Resources and Variables

Core Infrastructure Logic

Provision the Infrastructure

Deployment: Containerization and the Cloud Build Pipeline

Deploy Application

Verification: Accessing and Testing Your Deployed Agent

Granting User Permissions

Launch the Proxy

Summary

Local Testing of a Multi-Agent System with Memory

Testing the Agent Locally

Environment Setup

Helper Utilities

Environment Configuration

Local Testing Script

Running the Test

Test Scenario

Phase 1: Teaching & Multimodal Creation (Session 1)

Discovery

Research

Personalization

Image Generation

Phase 2: Long-Term Memory Recall (Session 2)

Summary

Architect A Personalized Multi-Agent System with Long-Term Memory

Infrastructure and Model Setup

Memory Ingestion Logic

Long-term Memory

How Managed Memory Works:

Short-term Memory

Specialist 1: Reddit Scanner (Discovery)

Specialist 2: GCP Expert (Grounding)

Specialist 3: Blog Drafter (Creativity)

The Root Orchestrator

Summary

Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run

What you'll learn

Getting started with Dev Signal

Prerequisites

Project Setup

1. Initialize the Project

2. Folder Structure

3. Define Dependencies

Building the agent capabilities: MCP tools

Tools Configuration

Reddit Search (Discovery Tool)

Google Cloud Docs (Knowledge Tool)

The Image Generator (Nano Banana MCP)

Implementing the Nano Banana Pro Server Logic

The Server Entrypoint (main.py)

The Generation Logic (nano_banana_pro.py)

GCS Upload Helper (storage_utils.py)

Data Model (media_models.py)

Tool Dependencies (requirements.txt)

Summary

Fine-Tuning Gemma 4 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕

Bridging Reasoning and Precision with Gemma 4

The Server Entrypoint (`main.py`)

The Generation Logic (`nano_banana_pro.py`)

GCS Upload Helper (`storage_utils.py`)

Data Model (`media_models.py`)

Tool Dependencies (`requirements.txt`)