Forem: Evan Lin

Gemini API File Search: Enhanced Multimodal Capabilities with Embedding 2, Including Open-Source LINE Bot Implementation

Evan Lin — Tue, 12 May 2026 04:17:48 +0000

(Image source: Google Blog - Gemini API File Search is now multimodal: build efficient, verifiable RAG)

Recap: RAG Finally Doesn't Need to Build Legos

In the past few years, whenever developers thought about RAG (Retrieval-Augmented Generation), the component list that came to mind probably looked like this:

A chunker (langchain? Write it yourself?)
An embedding model (OpenAI text-embedding-3? Cohere? BGE?)
A vector database (ChromaDB, FAISS, pgvector, Pinecone… which one to choose is a battle)
A retrieval + rerank process
And then the LLM

Not to mention that multimodal RAG needs another layer: How to embed images? Do you need to OCR first? Do you need to split two stores, one for text and one for images? How to calculate scores for mixed text and image search? Just these few questions can take up a sprint.

Recently, Google released Expanded Gemini API File Search for multimodal RAG on the developer blog, turning the long pipeline above into " calling a managed API ", and images are natively supported.

This article will do two things:

Explain the new features clearly, including what Gemini Embedding 2 is doing behind the scenes.
Use an open-source LINE Bot (kkdai/linebot-multimodal-rag) as a live demonstration to see how the new features are combined in actual production code — and share the two typical pitfalls I encountered during debugging to help everyone avoid them.

Three Major Highlights of the New Features

According to the official blog, the core of this upgrade is three things:

1. True Multimodal File Search (Native Multimodal File Search)

In the past, File Search was pure text retrieval, and images could only be indexed by OCRing them into text.

“File Search now processes images and text together. Powered by the Gemini Embedding 2 model, the tool understands native image data.”

Now you can directly put images into the File Search Store, and index them together with text. The engine behind it is Gemini Embedding 2 — text, images, videos, audio, and documents share the same vector space, so you can "find text with images", "find images with text", or "find images with images" without having to align the spaces yourself.

For us product people, this means:

Mixed text and image search is no longer a research topic, it's an API call.
No need to maintain two stores (one for text chunks and one for CLIP-style image embeddings).
Scientific charts, UI screenshots, reports, photo albums... these things that used to lose most of their meaning after OCR can now retain the original visual information for retrieval.

2. Custom Metadata and Server-side Filtering

Each file you put into the store can now be tagged with key-value labels:

{"key": "user_id", "string_value": "U1234abcd..."}
{"key": "department", "string_value": "Legal"}
{"key": "status", "string_value": "Final"}

Use the google.aip.dev/160 filter syntax (same format as most GCP list APIs) when querying:

metadata_filter='department="Legal" AND status="Final"'

Filtering is done first on Google's side, not retrieving a bunch and then discarding. After reducing the noise, the speed and accuracy will both increase, which is a lifesaver for multi-tenant SaaS — one store with metadata filters can separate tenants, without the need to isolate N stores.

My LINE Bot uses this directly to do per-user data isolation: each time a file is uploaded, it's tagged with the LINE user_id, and when querying, a filter is applied, so user A will never see user B's data in the Q&A.

3. Page-level Citations

Each cited snippet in the response will now include the page number.

“captures the page number for every piece of indexed information.”

This is super critical for enterprise customers. "AI says Y is mentioned on page X of the contract" vs. "AI says Y is mentioned in the contract" — the former can be directly accepted by legal/auditing, while the latter requires manual effort to flip through the book for verification. Page numbers unlock the final mile of "LLM answers cannot be traced back to the source".

The Multimodal Engine: Gemini Embedding 2

The core of the new feature is this Gemini Embedding 2 model. Quote its specifications for your selection decisions:

Item	Specification
Supported Input	Text, images, videos, audio, documents (same embedding space)
Input token limit	8,192 tokens
Output dimensions	128 ～ 3,072 (using Matryoshka Representation Learning, small dimensions can also maintain similar accuracy)
Multilingual support	100+ languages

Several key benchmarks (recall@1):

Text-to-Image Search: TextCaps 89.6 / Docci 93.4
Image-to-Text Search: TextCaps 97.4
Multilingual (MTEB): mean 69.9
Video-Text Matching: Vatex ndcg@10 68.8
Speech-Text Retrieval: MSEB mrr@10 73.9

Several key observations:

Matryoshka is not a buzzword: You can store it with 3072 dimensions first, and when running retrieval, switch to 768 dimensions to run faster and maintain quality. Storage/scoring costs can be optimized in stages.
Cross-modal scores are very real: 97.4% recall@1 (image→text) means that if you have an image and want to find the corresponding descriptive text, you'll find it almost immediately. This can be directly implemented for use cases like "take a picture of a product label and find the corresponding page of the user manual".
100+ languages: This is a very real difference for the Taiwan/Japan/Korea/Southeast Asia markets.

What Developers Really Care About: Price and Access Cost

From the official tutorial article Multimodal RAG with the Gemini API File Search tool: a developer guide, there are two sections that developers sensitive to cost should highlight:

“Fully managed, with no vector database overhead.”

“Storage and query-time embeddings are free. You only pay for indexing and tokens.”

In plain English:

You don't pay for the vector database, nor do you pay for the monthly salary of the people maintaining it.
Storage is free, and embedding calculations at query time are also free.
You only have two things to pay for: the embedding fee for the initial indexing and the LLM tokens consumed when generating the answer.

This is a friendly cost curve for personal side projects and early startups — you don't need to decide on day one "can I afford the baseline of the vector DB".

Standard Workflow: 4 SDK calls to complete a RAG

Organized from the dev.to guide, the minimum viable workflow:

from google import genai
from google.genai import types

client = genai.Client()

# 1. Create a store (specify the multimodal embedding model)
store = client.file_search_stores.create(config={
    "display_name": "my-multimodal-rag",
    "embedding_model": "models/gemini-embedding-2",
})

# 2. Upload files + custom metadata
operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=store.name,
    file="report-q1.pdf",
    config={
        "display_name": "Q1 Report",
        "custom_metadata": [
            {"key": "department", "string_value": "Finance"},
            {"key": "year", "string_value": "2026"},
        ],
    },
)
# Upload is a long-running operation, needs to poll:
# operation = client.operations.get(operation)

# 3. Feed file_search as a tool to generate_content
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="What was the revenue growth rate in the first quarter of last year?",
    config=types.GenerateContentConfig(
        tools=[types.Tool(file_search=types.FileSearch(
            file_search_store_names=[store.name],
            metadata_filter='department="Finance" AND year="2026"',
        ))],
    ),
)

# 4. Get citations (including page numbers)
for citation in response.candidates[0].grounding_metadata.grounding_chunks:
    print(citation.web.uri, citation.web.title) # or the corresponding file/page fields

To provide citations with images to the user, there is also client.file_search_stores.download_media() that can be called.

It's no exaggeration, the entire multimodal RAG is less than 30 lines of code.

Demo Case: Putting These New Features into a LINE Bot

It's abstract just looking at the SDK examples, so I made it into a LINE Bot that can be put to work, open-sourced at kkdai/linebot-multimodal-rag:

Users drop PDFs / images / text files into the LINE chat box → Bot indexes into the File Search Store.
Users type questions → Gemini finds answers from the data uploaded by the user themselves.
Users drop an image and ask a question → The same can be done for image-to-text retrieval.
Deployment target: GCP Cloud Run + Cloud Build automatic deployment.

The architecture is very intuitive (key fields):

Component	Role
LINE Webhook	FastAPI receives message events
GCS	Persists original files (`uploads/{user_id}/{message_id}.{ext}`)
Gemini File Search Store	The only index layer (managed)
Custom metadata `user_id`	Multi-tenant isolation
FastAPI BackgroundTasks	Avoid the LINE reply token 30-second limit

Comparing to the three major new features mentioned earlier:

Multimodal: Users drop images, drop PDFs, all go into the same store, and all consume the same pipeline during search.
Custom metadata: Files for each LINE user are tagged with user_id, filtered during queries, achieving server-side forced isolation.
Page-level citations: In the future, to display "the answer comes from XX.pdf page 5" in LINE messages, directly consume grounding_metadata.

The entire repo is about 600 lines of Python, and it completes a " your own private multimodal knowledge base chat Bot ".

Deployment Battle: commit → automatic online

It's not enough for the open-source example to just run; to demonstrate it at the workshop, it needs to be at the level of "code changes, push to GitHub, and automatically deploy". This time, I asked Claude Code to be my co-pilot to help me connect CI/CD.

I only dropped one sentence:

"Help me create a Cloud Build connection to GitHub, and trigger a build to deploy to Cloud Run after committing to main."

Claude Code first scanned cloudbuild.yaml, existing Cloud Run settings, Secret Manager, and Artifact Registry, and listed a "current problem", and then stopped to ask me a key decision: Should I keep the existing service name or change the yaml? Does GitHub need authorization? After I answered, it built the missing resources in one go:

# Build Artifact Registry repo
gcloud artifacts repositories create linebot \
  --repository-format=docker --location=asia-east1

# Secret migration: move from the current service to Secret Manager (via stdin, don't leave shell history)
gcloud run services describe linebot-gemini-file-search --region=asia-east1 \
  --format='value(...)' \
  | gcloud secrets create LINE_CHANNEL_SECRET --data-file=-

# Give Cloud Build / Compute SA the roles needed for deployment
for role in run.admin iam.serviceAccountUser artifactregistry.writer \
            secretmanager.secretAccessor storage.objectAdmin logging.logWriter; do
  gcloud projects add-iam-policy-binding your-cool-project-id \
    --member="serviceAccount:660825558664-compute@developer.gserviceaccount.com" \
    --role="roles/$role" --condition=None
done

# Build trigger
gcloud builds triggers create github \
  --name=linebot-multimodal-rag-main \
  --repo-owner=kkdai --repo-name=linebot-multimodal-rag \
  --branch-pattern="^main$" --build-config=cloudbuild.yaml

The only thing that couldn't be automated was GitHub OAuth authorization — Claude Code directly admitted to me that "this step can only be done by clicking in the Console", and provided the URL and step-by-step instructions. After finishing the one-minute click, the trigger ran through.

Pitfalls Record: Two Traps Directly Related to the New Features

Pitfall 1: Hardcoded Model ID is Outdated

The default values in cloudbuild.yaml and code both write gemini-3.1-flash, but after looking at the Gemini API's current model id list: there's no such model at all. The correct ID for Gemini 3 Flash is gemini-3-flash-preview.

Why this happened: multimodal RAG is a very new feature, and related documents, tutorials, and examples are still being created in large numbers, and the naming has also been slightly adjusted. The initial version of the Repo can easily write an id that "looks like it but doesn't actually exist".

Solution: Change the entire repo to gemini-3-flash-preview, and also confirm that the embedding model is models/gemini-embedding-2 (correct, didn't step on the trap). After pushing, Cloud Build automatically triggered, and a new revision went online in three minutes.

Pitfall 2: Mysterious "Upload has already been terminated"

This trap was directly stepped on the " image upload " path newly supported by File Search Store — it's also the most worth sharing, because it demonstrates that "the error messages of new APIs are sometimes very euphemistic".

I sent a JPG from LINE to the Bot and clicked "store in database", and the result:

❌ Failed to store: 400 Bad Request. {'message': 'Upload has already been terminated.', 'status': 'Bad Request'}

Couldn't see the reason at all. Cloud Logging only had the same error, no stack trace. After looking around on the Google AI Developers Forum, I found that several file types (.md / .xlsx / large CSV) had encountered similar reports.

The real culprit is hidden in this seemingly innocent code:

# app/gemini_service.py (before modification)
suffix = mimetypes.guess_extension(mime_type) or ".bin"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
    tmp.write(file_bytes)
    tmp_path = tmp.name

Before Python 3.13, mimetypes.guess_extension("image/jpeg") returns .jpe, not .jpg. The reason is that in the MIME table of the standard library, .jpe is lexicographically before .jpg, and this quirk has existed for nearly twenty years.

Gemini File Search Store doesn't recognize the file extension .jpe, but the API's message uses "Upload has already been terminated" in a way that is very easy to mislead — at first, I thought it was because the upload size exceeded, or it was choked by concurrency, or there was a race inside the SDK.

Solution: Take the file extension directly from display_name (handlers have already been correctly set to image_<id>.jpg), and use an explicit MIME comparison table as a backup:

# app/gemini_service.py (after modification)
_MIME_TO_EXT = {
    "image/jpeg": ".jpg",
    "image/png": ".png",
    "image/webp": ".webp",
    "application/pdf": ".pdf",
    # ...
}

if "." in display_name:
    suffix = "." + display_name.rsplit(".", 1)[-1].lower()
else:
    suffix = _MIME_TO_EXT.get(mime_type) or mimetypes.guess_extension(mime_type) or ".bin"

print(f"[BG Store] uploading display_name={display_name!r} mime={mime_type} "
      f"size={len(file_bytes)} tmp_suffix={suffix}")

Also, add traceback.format_exc() to the except part, so that the next time something goes wrong, Cloud Logging will have the full stack.

The takeaway from this story: When you're running on a new modality on a "newly GA'd API", please be sure to:

First confirm on the client side that the filename / file extension you generate is the format expected by the API, don't trust the mimetypes standard library to guess for you.
Write the stack trace into the log, otherwise you can't save yourself from the esoteric discussions on the forum like "just change a file".
Compare the file extension you generate with the Gemini File Search official supported format list.

Summary: The Entry Fee for Multimodal RAG, the Lowest in History

This time's Gemini API File Search upgrade compresses a feature line that used to take 3 months to go online into " dozens of lines of code + a managed API " to run:

Native multimodal support: Text, images, videos, audio, and documents share the same embedding space, goodbye to the OCR transition layer.
Custom metadata + server-side filter: Multi-tenant SaaS doesn't need to struggle with how many stores to split.
Page-level citations: Enterprise compliance scenarios finally have native grounding.
Friendly to money: Storage / query embedding are both free, only pay for indexing + LLM tokens.
Cross-modal scores of Embedding 2: 97.4% recall@1 is not a demo number, it's the level that can directly support the product.

If you want to directly see a production-shaped end-to-end example: kkdai/linebot-multimodal-rag the entire repo PR welcome, and you're also welcome to use it to modify it into your own domain's RAG application — Notion knowledge base, employee manual Q&A machine, photo album manager, research paper index... probably only imagination will limit you.

If you want to get started, the recommended reading order:

Google official blog: Expanded Gemini API File Search for multimodal RAG
Gemini Embedding 2 specification page: deepmind.google/models/gemini/embedding
Developer implementation guide: Multimodal RAG with the Gemini API File Search tool: a developer guide
My open-source example: github.com/kkdai/linebot-multimodal-rag

Welcome everyone to try out this very powerful Multimodal RAG support!

[GCP Practice][BwAI] AI-Powered Development: Quickly Deploy a LINE Bot Cloud Backup Tool with Gemini CLI

Evan Lin — Thu, 07 May 2026 04:36:40 +0000

Background

In the upcoming Build With AI 2026 workshop, we're bringing a very practical project: the LINE Bot File Backup Robot. It allows you to directly upload images and files from your LINE chatroom to Google Drive, and it will automatically create folders by month to keep things organized.

Traditionally, putting a project like this, which includes OAuth authorization, a Firestore database, and Cloud Run container deployment, on the cloud would often leave beginners struggling with lengthy gcloud commands.

But this time it's different, we have a secret weapon: Gemini CLI.

This article will document how we used AI as a DevOps engineer, completing the entire complex deployment process by "talking," and of course, including the various real pitfalls we encountered along the way.

Preparation: Summoning the AI Assistant

Before we start, besides the basic gcloud installation and login, you only need to install Gemini CLI.

Prepare the following "confidential parameters" (all are Mock processed in this article):

PROJECT_ID: your-cool-project-id
LINE Channel Secret: YOUR_LINE_SECRET_XXXX
LINE Access Token: YOUR_LINE_TOKEN_XXXX

After entering the project folder, I only said one sentence to Gemini CLI:

"Help me deploy to Cloud Run using gcloud, and stop and ask me if you need any information. Refer to the repo…"

Next, it's time to witness miracles (and fix bugs).

Practical Deployment Process: AI Leading the Way

Gemini CLI intelligently analyzed Dockerfile and main.go and immediately listed a set of battle plans.

Step 1: Environment Detection and API Enablement

The AI first confirmed my current project settings in gcloud and then enabled the necessary services in one go:

gcloud services enable firestore.googleapis.com \
  cloudbuild.googleapis.com \
  run.googleapis.com \
  artifactregistry.googleapis.com

Step 2: Creating a Firestore Database (Encountering the First Pitfall)

Our Bot needs to record the OAuth State anti-counterfeiting mark, so Firestore is needed. The AI tried to execute the command, but we immediately encountered an error. (See the pitfall record below for details)

After correction, the correct command is:

gcloud firestore databases create --location=asia-east1 --type=firestore-native

Step 3: Deploying Cloud Run First, Filling in the Blanks Later

This is a classic "chicken or the egg" problem: Google OAuth needs to know your Cloud Run URL (Redirect URI), but your Cloud Run deployment needs to fill in the OAuth Client ID and Secret.

Gemini CLI's strategy is great: Deploy with placeholders first!

gcloud run deploy linebot-backup-service \
  --source . \
  --region asia-east1 \
  --set-env-vars "GOOGLE_CLOUD_PROJECT=your-cool-project-id,ChannelSecret=YOUR_LINE_SECRET_XXXX,ChannelAccessToken=YOUR_LINE_TOKEN_XXXX,GOOGLE_CLIENT_ID=PENDING,GOOGLE_CLIENT_SECRET=PENDING,GOOGLE_REDIRECT_URL=PENDING" \
  --allow-unauthenticated \
  --quiet

After successful deployment, we got a string of fragrant URLs: https://linebot-backup-service-xxxxx.a.run.app.

Step 4: Completing Google OAuth Settings and Environment Variable Updates

With the URL, I can go to the "API & Services" in Google Cloud Console to complete the settings:

Create an OAuth consent screen.
Create credentials for a Web application.
Fill in the "Authorized redirect URI" with the URL we just got, plus /oauth/callback.

After getting the real ID and Secret, I directly pasted the information to Gemini CLI, and it automatically updated the service for me:

gcloud run services update linebot-backup-service \
  --region asia-east1 \
  --update-env-vars "GOOGLE_REDIRECT_URL=https://[YOUR_URL]/oauth/callback,GOOGLE_CLIENT_ID=real-client-id.apps.googleusercontent.com,GOOGLE_CLIENT_SECRET=real-secret-xxxx"

Done! Finally, just go to the LINE Developers Console and fill in the Webhook.

Blood and Tears Pitfall Records During the Deployment Process

It looks smooth, but in fact, the AI and I hit a few walls together. This is also the most real experience of using CLI tools.

Pitfall 1: Forgetting to Bind a Credit Card, the 390001 Error

When executing the first gcloud run deploy, the terminal directly spewed red text all over the face:

FAILED_PRECONDITION: Billing account for project is not found...

Reason: Cloud Run and Cloud Build require the project to enable billing (Billing Enabled). This is a brand new test project, and I forgot to bind the billing account. Solution: The AI immediately checked the project status for me (gcloud beta billing projects describe) and asked me if I wanted to switch to a project with billing, or to fix it. I obediently went to the Console to bind my credit card, and the deployment was able to continue.

Pitfall 2: The Evolution of Command Parameter Syntax

When creating Firestore, the AI initially gave the command --type=native-mode or --type=native, but gcloud didn't appreciate it:

ERROR: argument --type: Invalid choice: 'native-mode'

Reason: The CLI parameters of gcloud will change with version updates. Solution: Carefully look at the gcloud error message, and now the correct parameter values are firestore-native or datastore-mode. After changing to --type=firestore-native, it passed smoothly.

Pitfall 3: The Invisible "Drive API"

When everything was deployed, we encountered a permission error when testing "upload to Google Drive". Reason: This is a Bot that helps you upload files to Drive, but when we enabled the API in the first step, we actually forgot to enable the protagonist: Google Drive API! Without it, even if OAuth authorization is successful, the program will still be blocked. Solution: I only entered the mysterious "3." (implying the third checkpoint) into the terminal, and the AI immediately understood and added this critical blow:

gcloud services enable drive.googleapis.com

Conclusion

Through Gemini CLI, the originally tedious and error-prone infrastructure construction work has become a "two-person pair programming" session.

AI can help you remember lengthy gcloud parameters, help you sort out the deployment logic (deploy with PENDING first and then update), and even adjust strategies quickly based on error messages when you encounter errors.

This is the core spirit that Build With AI 2026 wants to convey: let AI handle the tedious DevOps chores, so that developers can focus more energy on innovation in core business logic.

If you are still manually typing long and ugly gcloud commands, I strongly recommend you install Gemini CLI and give it a try!

GCP Hands-on: Deploying OpenAB - Building a Gemini ACP Bridge for Telegram on GCE

Evan Lin — Sat, 02 May 2026 12:01:51 +0000

Background

Recently, in order to enable AI coding assistants (such as Claude Code or Gemini CLI) to be used directly on chat platforms, I started researching OpenAB. This is a powerful bridge that can connect Slack, Discord, or Telegram to CLI tools that comply with the ACP (Agent Client Protocol) standard.

This article documents the complete practical process of deploying OpenAB on Google Cloud, specifically how to bypass authentication restrictions, handle Telegram's HTTPS requirements, and resolve path and permission issues in containerized deployments.

OpenAB Reference Documentation: https://openabdev.github.io/openab/
OpenAB Repo: https://github.com/openabdev/openab

Deployment Decision: Why GCE instead of Cloud Run?

Although Cloud Run is my first choice, when dealing with OpenAB, Google Compute Engine (GCE) is the best solution. There are two reasons:

Stateful Session: OpenAB will start a child process (such as Gemini CLI) for each conversation thread. These processes must reside for a long time to maintain the conversation context. Cloud Run's automatic scaling mechanism will kill these processes, leading to conversation interruption.
Authentication Persistence: The AI CLI's Token needs to be stored on the local disk. GCE, combined with Persistent Disk, can ensure that the login status does not disappear after restarting.

Practical Steps: Step-by-Step Deployment Process

Step 1: Writing an Automated Startup Script

To standardize the deployment, we wrote a setup-openab.sh. Its core task is to install Docker, create persistent directories, and dynamically generate config.toml.

The most critical part is the custom Docker Image. Since the official OpenAB image does not necessarily include all AI tools, we install Node.js and @google/gemini-cli on-site through Dockerfile:

FROM ghcr.io/openabdev/openab:latest
USER root
RUN apt-get update && apt-get install -y curl && \
    curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
    apt-get install -y nodejs && \
    npm install -g @google/gemini-cli
USER 1000

Step 2: Using gcloud to Create a GCE Instance

We chose the e2-medium specification and passed sensitive information (such as Bot Token) through Metadata to avoid hardcoding it in the script.

gcloud compute instances create openab-server \
    --project=your-project-id \
    --zone=asia-east1-b \
    --machine-type=e2-medium \
    --image-family=debian-11 \
    --image-project=debian-cloud \
    --metadata-from-file startup-script=setup-openab.sh \
    --metadata=tg_bot_token=YOUR_BOT_TOKEN

Step 3: Configuring the Gemini API Key

Unlike Kiro, which requires interactive login, gemini-cli can directly read environment variables. We inject the API Key into OpenAB's config.toml to make it run automatically in the background:

[agent]
command = "gemini"
args = ["--acp"]
env = { GEMINI_API_KEY = "AIzaSy..." }

Step 4: Using Cloudflare Tunnel to Solve HTTPS Requirements

Telegram Webhook strictly requires HTTPS. Instead of setting up a complex Nginx + SSL, I chose to use Cloudflare Quick Tunnel:

Run on VM: cloudflared tunnel --url http://localhost:8080.
Get a randomly generated HTTPS URL.
Register Webhook: curl "https://api.telegram.org/bot<TOKEN>/setWebhook?url=<CF_URL>/webhook/telegram&secret_token=<SECRET>"

Blood and Tears in the Migration Process: Technical Summary

During the deployment process, we debugged several times, and here are the three major "pits" summarized:

Pitfall 1: Confusion of Image Sources

At first, I tried to Pull openabdev/openab from Docker Hub, but it always failed. Finally, I found that the current stable image of the project is placed in GitHub Container Registry (GHCR).

Solution: You must use ghcr.io/openabdev/openab:latest.

Pitfall 2: Hardcoded Configuration Path

OpenAB's Dockerfile expects the configuration file path to be /etc/openab/config.toml. I initially mounted it to /app/config.toml, which caused the container to crash immediately after startup and report an error.

Solution: Correct the Docker Volume mount path to /etc/openab/config.toml.

Pitfall 3: Security Secret Token Verification Failed

Even if the URL is correct, Telegram messages are still rejected by the Gateway. The log shows invalid or missing secret_token.

Reason: openab-gateway generates an internal checksum to prevent illegal requests.
Solution: You must extract the Token from the Gateway container and pass it as the secret_token parameter when setWebhook.

Summary: The Perfect AI Bridging Solution

Through this architecture, I successfully built a fully self-hosted, secure, and efficient AI assistant on GCP. It does not rely on expensive subscriptions, but directly utilizes Gemini's API capabilities and uses Telegram as the interaction interface.

If you also want to set up a dedicated ACP bridge on the cloud, this combination of GCE + Docker + Cloudflare Tunnel will be the most balanced and stable choice.

GCP in Action: Building a Persistent AI Assistant with GCE, Hermes Agent, and Telegram

Evan Lin — Sat, 02 May 2026 12:01:42 +0000

Background

After solving the LINE Bot's Vertex AI migration, I started thinking: Could there be an AI assistant that is "more proactive" and "has long-term memory"? At this time, I set my sights on NousResearch's open-source Hermes Agent.

Unlike a typical Chatbot, Hermes is designed as an "operating system that breathes". It can execute Shell commands, write Python scripts, manage long-term memory, and even stay in touch with you via different Gateways (Telegram, Discord) at any time.

To make it available 24/7, I chose to deploy it on Google Compute Engine (GCE). This article will document the deployment process from scratch, as well as the pitfalls I encountered when configuring the latest Gemini 2.5 Flash model.

Environment Parameter Preparation

Before you start, please make sure you have these necessary parameters:

PROJECT_ID: YOUR_PROJECT_ID
LOCATION: global
GOOGLE_API_KEY: YOUR_GOOGLE_API_KEY (Obtained from Google AI Studio)

Step 1: Create a GCE Instance

Hermes Agent needs some computing power to handle tool use. It is recommended to use the e2-medium specification.

gcloud compute instances create hermes-agent-vm \
    --project=YOUR_PROJECT_ID \
    --zone=us-central1-a \
    --machine-type=e2-medium \
    --image-family=ubuntu-2204-lts \
    --image-project=ubuntu-os-cloud \
    --boot-disk-size=30GB \
    --metadata=startup-script='#!/bin/bash
        apt-get update
        apt-get install -y git curl python3-pip python3-venv nodejs npm
    '

Step 2: Install Hermes Agent

After SSHing into the VM, use the official one-click installation script directly.

Enter the VM:

gcloud compute ssh hermes-agent-vm --zone=us-central1-a

Execute the installation:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.bashrc

Step 3: Configure Gemini 2.5 Flash (SOP Practice)

This is the most likely place to step on a landmine in the entire exercise. Hermes may default to pointing to non-existent or outdated model identifiers.

Create a configuration file: In ~/.hermes/config.yaml, we must precisely specify Gemini 2.5 Flash, and do not include the google/ prefix.
Set the API Key: Write the key and permission settings in ~/.hermes/.env:

Step 4: Connect to Telegram and Background Persistence

To prevent the Agent from disappearing after the SSH connection is lost, we use Systemd to manage it.

Create a Systemd service (/etc/systemd/system/hermes.service):

[Unit]
Description=Hermes Agent Gateway
After=network.target

[Service]
Type=simple
User=root
Environment=HOME=/root
Environment=PYTHONUNBUFFERED=1
ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes gateway run
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start the service:

sudo systemctl daemon-reload
sudo systemctl enable hermes
sudo systemctl restart hermes

Blood and Tears in the Migration Process: Why Isn't My Agent Responding?

Even with the correct configuration, I still encountered the dilemma of "the Agent reads messages but doesn't reply". After checking the logs (journalctl -u hermes), I found several deep pitfalls:

Pitfall 1: The 404 Ghost of Gemini 3.0

I tried to pursue the latest version when configuring, and used gemini-3-flash-preview. As a result, the logs spewed out a bunch of 404 Model Not Found. Reason: The internal auxiliary_client.py of Hermes hardcodes many gemini-3-flash-preview as the default value. When these auxiliary functions (such as generating titles) report errors, it will affect the reply logic of the entire Gateway. Solution: Manually define all auxiliary models as gemini-2.5-flash in config.yaml, or directly patch the source code with sed.

Pitfall 2: Prefix Confusion of Model Identifiers

In different SDKs, some people use google/gemini-2.5-flash, and some people use gemini-2.5-flash. Experience: In Hermes' Gemini Provider, using the short name gemini-2.5-flash directly is the safest. Adding google/ will instead cause API routing errors.

Pitfall 3: Conflict between Systemd and "Processes Already Running"

When you manually run hermes gateway and then start the service, the system will report Gateway already running (PID xxxx). Solution: Before ExecStart in Systemd, you can add an ExecStartPre=/usr/bin/pkill -9 -f hermes || true to ensure a clean environment every time you start.

Summary

Now, my dedicated Hermes Agent is running stably on GCE and is available via Telegram at any time. It can not only help me find information, but also run some simple computing scripts for me directly on the cloud VM.

This deployment taught me: In the face of rapidly updating models, the official documentation (or MCP tool query) is the only truth. Don't blindly pursue the latest version number; ensuring that the identifier matches the current API environment is the key to stable operation.

If you also want a 24-hour AI digital double, get a machine set up according to this SOP!

Gemini 3.1: Native TTS for Easier, More Powerful Summary Reading

Evan Lin — Sat, 02 May 2026 10:08:42 +0000

Background

In the previous practical session, we used Gemini 3.1 Flash Live to achieve speech recognition, and through the "side-attack" method of the Gemini 2.5 Live API, we barely achieved the text-to-speech (TTS) function.

But in April 2026, Google officially released Gemini 3.1 Flash TTS. This is a native model specifically designed for audio output, no longer requiring a Live WebSocket, and can directly output high-quality audio through the standard generate_content process.

As a developer, of course, you want to follow up immediately with a more elegant and native solution. This article will share how to upgrade the LINE Bot's text-to-speech summary function to Gemini 3.1 Native TTS, and the "asynchronous pit" encountered in the process.

Technical Upgrade: From Live API to Native TTS

The previous reading function was simulated using the Gemini 2.5 Live API. Although it was usable, it had several shortcomings:

High complexity: Requires managing the WebSocket connection lifecycle.
Model limitations: Must use a specific native-audio model, and primarily supports us-central1.
Fixed return format: The sampling rate is usually fixed at 16kHz.

The emergence of Gemini 3.1 Flash TTS changed all this:

Model name: gemini-3.1-flash-tts-preview.
Consistent interface: Uses the familiar generate_content_stream.
Dynamic parameters: Supports automatically detecting the sampling rate from the returned MIME type (usually increased to 24kHz, better sound quality).

Core Code Evolution (tools/tts_tool.py)

The new implementation has become more concise, with the focus on the response_modalities=["audio"] setting:

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(api_key=GOOGLE_AI_API_KEY, http_options={"api_version": "v1beta"})

    contents = [
        types.Content(
            role="user",
            parts=[
                # Add localization instructions to make the tone more natural
                types.Part.from_text(text=f"Please use Traditional Chinese with Taiwanese vocabulary, and read the following summary in a friendly and natural tone. ## Transcript:\n{text}"),
            ],
        ),
    ]

    config = types.GenerateContentConfig(
        response_modalities=["audio"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Zephyr")
            )
        ),
    )

    pcm_chunks = []
    sample_rate = 24000 # Default value

    try:
        # ⚠️ This is the big pit that almost made me stay up all night fixing it
        response_stream = await client.aio.models.generate_content_stream(
            model="gemini-3.1-flash-tts-preview",
            contents=contents,
            config=config,
        )
        async for chunk in response_stream:
            if chunk.parts:
                for part in chunk.parts:
                    if part.inline_data:
                        pcm_chunks.append(part.inline_data.data)
                        # Get the sampling rate dynamically from the MIME type (e.g. audio/L16;rate=24000)
                        if part.inline_data.mime_type:
                            sample_rate = parse_rate(part.inline_data.mime_type)
    except Exception as e:
        logger.error(f"TTS Error: {e}")
        raise

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / (sample_rate * 2) * 1000)

    # Subsequently, it is also converted to m4a via ffmpeg and sent to LINE...

The Pitfall: The Missing `await`

This upgrade encountered a very subtle TypeError, which kept popping up after remote deployment:

TypeError: 'async for' requires an object with __aiter__ method, got coroutine

❌ Incorrect Writing

When I wrote it according to the example, I intuitively thought I could directly async for a method:

# This is wrong!
async for chunk in client.aio.models.generate_content_stream(...):
    pass

✅ Correct Solution

In the asynchronous version of the Google GenAI Python SDK, generate_content_stream itself is an async function, and it returns an iterator. So you must await to get that iterator, and then perform async for on it.

# Correct approach: two steps
response_stream = await client.aio.models.generate_content_stream(...)
async for chunk in response_stream:
    pass

This detail may not exist in general synchronous code or some older SDKs, but when dealing with the asynchronous stream of 3.1 Flash TTS, this is the key to whether it can run successfully.

Localization Adjustment: Making the Bot Speak "Taiwanese"

Although the summary itself is already in Traditional Chinese, the TTS model sometimes has non-native accents or vocabulary when reading. We solved this problem through Prompt Engineering:

"Please use Taiwanese vocabulary in Traditional Chinese, and read it in a friendly and natural tone..."

After adding this line of instruction, the audio output by Gemini is closer to the habits of Taiwanese users in terms of intonation and sentence breaks, which greatly enhances the friendliness of the "reading summary".

Summary: Changes Brought by Native TTS

After migrating from Live API to Native TTS:

More stable connection: No longer need to maintain a long-term WebSocket.
Improved sound quality: Native support for 24kHz sampling rate.
Easy to maintain: The amount of code is reduced by about 30%, and the logic is more direct.

This experience also reminds me that even a seemingly mature SDK should carefully check the return value type when dealing with the async mode.

If you also want your LINE Bot to speak, Gemini 3.1 Flash TTS is definitely the best choice at the moment.

The complete code has been updated to GitHub, see you next time!

GCP in Action: Migrating a LINE Bot from AI Studio to Vertex AI to Solve 429 Errors

Evan Lin — Sat, 02 May 2026 10:08:31 +0000

Background

Recently, the LINE business card assistant robot (linebot-namecard-python) deployed on Google Cloud Run suddenly went down. After checking the logs with gcloud logging read, the following ruthless error appeared:

google.api_core.exceptions.ResourceExhausted: 429 Your billing account has exceeded its monthly spending cap.

It turned out that we used the API Key provided by Google AI Studio (google.generativeai package) for rapid development, and as a result, we silently maxed out the monthly free quota.

As a developer who needs to launch a service, it's time to "level up" the architecture and migrate the model calls to the enterprise-grade Google Cloud Vertex AI, directly using GCP's IAM permissions and billing system. This article will share the migration process and the various pitfalls encountered along the way.

Technical Upgrade: From AI Studio to Vertex AI

To migrate a project from the Google AI Studio SDK to Vertex AI, there are three main steps:

Replace the dependency package: In requirements.txt, remove the old google.generativeai and replace it with google-cloud-aiplatform.
Update environment variable settings: In config.py, we no longer need GEMINI_API_KEY, but instead use GCP's PROJECT_ID and LOCATION:

PROJECT_ID = os.getenv("PROJECT_ID", None)
LOCATION = os.getenv("LOCATION", "global") # Default to global

Core code rewriting (gemini_utils.py): Although the SDK interface of Vertex AI is similar, the handling of multimodal data (such as images) is slightly stricter. We need to convert PIL.Image to the vertexai.generative_models.Part format:

Pitfall 1: Residual Old SDK Causing Cloud Run Startup Failure

Happily, I updated the environment variables with gcloud run services update, but the Cloud Run deployment failed, and the container couldn't even start.

After checking the logs, I found:

ModuleNotFoundError: No module named 'google.generativeai'

Reason: Although gemini_utils.py has been rewritten, the main program app/main.py still contains import google.generativeai as genai and the initialization code genai.configure(api_key=...). Since the package has been removed from requirements.txt, the container will naturally fail to find the module and crash during startup.

Solution: Globally grep the project, completely remove all references to the old SDK, and then repackage the Docker image using Cloud Build and push it again.

Pitfall 2: Vertex AI Model Name and Region Restrictions (404 Not Found)

The code is cleaned up, and the container also starts successfully, but when I send a business card image on LINE, the robot throws a 500 error. After reviewing the logs again, this time it's:

google.api_core.exceptions.NotFound: 404 Publisher Model ... gemini-1.5-flash was not found or your project does not have access to it.

This is the biggest pit I encountered this time! In Google AI Studio, you can casually use the alias gemini-1.5-flash; but in certain regions of Vertex AI (such as asia-east1 Taiwan), you must specify the exact version number, such as gemini-1.5-flash-002, otherwise the API will directly tell you that the model cannot be found.

Advanced Challenge: I want to try Gemini 3.0 Flash Preview!

To solve this problem, I had an idea. Since I'm going to change it, why not upgrade directly to the latest gemini-3-flash-preview!

As a result, I wrote a test script and found:

❌ asia-east1 (Taiwan): 404 Not Found
❌ us-central1 (Central US): 404 Not Found
✅ global (Global): SUCCESS!

That's right, currently this preview model on Vertex AI is only available in the global region.

Final Solution:

Change the default region in config.py to global.
Call vertexai.init(project="line-vertex", location="global").
The Cloud Run environment variable --update-env-vars="LOCATION=global" must also be aligned.

Summary: Changes Brought by Vertex AI

After some effort, the business card robot finally came back to life and used the latest Gemini 3 Flash model. After migrating from AI Studio to Vertex AI, several significant benefits have been brought:

Get rid of Quota Anxiety: No longer limited by AI Studio's free quota or Spending Cap, directly deduct through GCP billing, suitable for production environments.
Security Enhancement: Removed the plaintext API Key in the environment variables and used GCP's Default Application Credentials (IAM) for authentication, making the architecture more secure.
Stability: Enterprise-grade SLA guarantee.

This experience also reminded me that when using Vertex AI on GCP, you must first check the official documentation to confirm the correspondence between "Region" and "Model Name" to avoid being overwhelmed by 404 errors after deployment.

If you also have a project that is about to move from AI Studio to Vertex AI, I hope this pitfall record can help you avoid some detours!

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

Evan Lin — Sun, 29 Mar 2026 02:08:30 +0000

Reference articles:

Background

I believe many people have used the combination of LINE Bot + Function Calling. When a user asks "What clothes did I buy last month?", the Bot calls the database query function, retrieves the order data, and then Gemini answers based on that JSON:

Traditional process designed by developers:

User: "Help me see the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."

The answer is completely correct, but it always feels like something is missing—the user is talking about "that jacket," and Gemini is just restating the text in the JSON, with no way to "confirm" what the jacket looks like. If there happen to be three jackets in the database, the AI can't even determine which one is the one the user remembers.

AI can read text, but it can't see pictures—this limitation has always been a blind spot in the traditional Function Calling architecture.

This problem was truly solved only after Gemini introduced Multimodal Function Response.

What is Multimodal Function Response?

The traditional Function Calling process is as follows:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]

Multimodal Function Response changes that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer that has seen the image]

When Gemini generates the next round of answers, it can "see" both the structured data and the image returned by the function, thereby generating richer and more accurate responses.

The official currently supported media formats:

Category	Supported formats
Image	`image/jpeg`, `image/png`, `image/webp`
Document	`application/pdf`, `text/plain`

The application scenarios for this feature are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.

Project Goal

This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:

User: "Help me see the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper pockets on the sides. This is your January 15th order ORD-2026-0115, for a total of NT$1,890, and it has been delivered." + Product photo

The difference is obvious: Gemini really "saw" the jacket, rather than just restating the text in the database.

Architecture Design

Why not use Google ADK?

Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.

But Multimodal Function Response requires manually including image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it can't be intervened.

So this time, I directly used google.genai.Client to implement the iterative cycle of function calls myself:

# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
    ... # ADK handles all function calls for you, but you can't control the response content

# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
    model=model,
    contents=contents,
    config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself

Overall architecture

LINE User
    │
    ▼ POST /
FastAPI Webhook Handler
    │
    ▼
EcommerceAgent.process_message(text, line_user_id)
    │
    ├─ ① Call Gemini (with conversation history)
    │
    ├─ ② Gemini decides to call a tool → function_call
    │
    ├─ ③ _execute_tool()
    │ ├─ Execute query function (search_products / get_order_history / get_product_details)
    │ └─ Read real product photos in the img/ directory (Unsplash JPEG)
    │
    ├─ ④ Construct Multimodal Function Response
    │ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
    │
    ├─ ⑤ Call Gemini again (Gemini sees the image + data)
    │
    └─ ⑥ Return (ai_text, image_bytes)
    │
    ▼
LINE Reply:
  TextSendMessage(text=ai_text)
  ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provides

How to get product images?

This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the item, stored in the img/ directory. The reading logic is very simple:

def generate_product_image(product: dict) -> bytes:
    """Read the product image and return JPEG bytes."""
    with open(product["image_path"], "rb") as f:
        return f.read()

Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:

Product ID	Name	Image
P001	Brown pilot jacket	tobias-tullius-…-unsplash.jpg
P002	White cotton university T	mediamodifier-…-unsplash.jpg
P003	Dark blue denim jacket	caio-coelho-…-unsplash.jpg
P004	Beige knitted shawl	milada-vigerova-…-unsplash.jpg
P005	Light blue simple T-shirt	cristofer-maximilian-…-unsplash.jpg

The image bytes read have two uses:

As FunctionResponseBlob to include for Gemini analysis—real photos allow Gemini to describe the actual fabric texture and tailoring details
Temporarily stored in the image_cache dict, provided to the LINE Bot for display through the FastAPI /images/{uuid} endpoint

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

from google.genai import types

ECOMMERCE_TOOLS = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_history",
            description="Query the current user's order history",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "time_range": types.Schema(
                        type=types.Type.STRING,
                        description="Time range: all / last_month / last_3_months",
                        enum=["all", "last_month", "last_3_months"],
                    ),
                },
                required=[],
            ),
        ),
        # ... search_products, get_product_details
    ])
]

Step 2: Function call cycle (up to 5 iterations)

async def process_message(self, text: str, line_user_id: str):
    contents = self._get_history(line_user_id) + [
        types.Content(role="user", parts=[types.Part(text=text)])
    ]

    for _iteration in range(5): # Up to 5 times, to prevent infinite loops
        response = await self._client.aio.models.generate_content(
            model=self._model,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=_SYSTEM_INSTRUCTION,
                tools=ECOMMERCE_TOOLS,
            ),
        )

        model_content = response.candidates[0].content
        contents.append(model_content)

        # Find all function_call parts
        fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]

        if not fc_parts:
            # No function call → final text response
            final_text = "".join(p.text for p in model_content.parts if p.text)
            break

        # Has function call → execute tool, include image
        tool_parts = []
        for fc_part in fc_parts:
            result_dict, image_bytes = _execute_tool(
                fc_part.function_call.name,
                dict(fc_part.function_call.args),
                line_user_id,
            )
            tool_parts.append(
                self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
            )

        contents.append(types.Content(role="tool", parts=tool_parts))

Step 3: Construct Multimodal Function Response (the most critical step)

def _build_multimodal_response(self, func_name, result_dict, image_bytes):
    multimodal_parts = []

    if image_bytes:
        # ⚠️ Note: Use FunctionResponseBlob here, not types.Blob!
        multimodal_parts.append(
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/jpeg",
                    data=image_bytes, # raw bytes, SDK handles base64 internally
                )
            )
        )

    return types.Part.from_function_response(
        name=func_name,
        response=result_dict, # Structured JSON data
        parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
    )

Gemini will receive both result_dict (order JSON) and image_bytes (product image) in the next generate_content call, and the generated answer can therefore describe the visual content of the image.

Step 4: LINE Bot simultaneously returns text + image

# main.py

ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)

reply_messages = [TextSendMessage(text=ai_text)]

if image_bytes:
    image_id = str(uuid.uuid4())
    image_cache[image_id] = image_bytes # Temporary storage
    image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
    reply_messages.append(
        ImageSendMessage(
            original_content_url=image_url,
            preview_image_url=image_url,
        )
    )

await get_line_bot_api().reply_message(event.reply_token, reply_messages)

LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent simultaneously.

Potholes

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

The most common pitfall: When constructing multimodal image parts, you cannot use types.Blob, you must use types.FunctionResponseBlob:

# ❌ Error (will TypeError)
types.FunctionResponsePart(
    inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)

# ✅ Correct
types.FunctionResponsePart(
    inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)

Although both have mime_type and data fields, the inline_data field type of FunctionResponsePart is FunctionResponseBlob, and Pydantic validation will directly reject Blob. You can confirm this with python -c "from google.genai import types; print(types.FunctionResponsePart.model_fields)".

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

The original code directly created aiohttp.ClientSession() at the module level:

# ❌ Old method: module level
session = aiohttp.ClientSession() # Will warn or error if there is no running event loop
async_http_client = AiohttpAsyncHttpClient(session)

When importing main.py in pytest tests, because there is no running event loop, RuntimeError: no running event loop will appear. The solution is to change to lazy initialization, and create it only when it is actually needed for the first time:

# ✅ New method: lazy init
_line_bot_api = None

def get_line_bot_api():
    global _line_bot_api
    if _line_bot_api is None:
        session = aiohttp.ClientSession() # Called within the async route handler, guaranteeing an event loop
        _line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
    return _line_bot_api

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Gemini receives raw bytes, but LINE Bot's ImageSendMessage requires a publicly accessible HTTPS URL.

The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE gets the image through this endpoint:

@app.get("/images/{image_id}")
async def serve_image(image_id: str):
    image_bytes = image_cache.get(image_id)
    if image_bytes is None:
        raise HTTPException(status_code=404, detail="Image not found")
    return Response(content=image_bytes, media_type="image/jpeg")

Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.

Demo Display

Mock database (default data for demo)

The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:

Order number	Date	Product
ORD-2026-0115	2026-01-15	P001 Brown pilot jacket
ORD-2026-0108	2026-01-08	P003 Dark blue denim jacket

Scenario 1: "Help me see the jacket I bought before"

User sends: "Help me see the jacket I bought before"

[Gemini → function_call]
  get_order_history(time_range="all")

[_execute_tool execution]
  - get_order_history() returns two orders (P001, P003)
  - Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes

[Multimodal Function Response]
  Part.from_function_response(
    name="get_order_history",
    response={"orders": [...], "order_count": 2},
    parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
  )

[Gemini responds after seeing the real photo]
  "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon with
   a glossy feel, and a metal zipper pocket on the left sleeve. This is your January 15, 2026
   order ORD-2026-0115, for a total of NT$1,890, status: delivered."

LINE displays: [Text] + [Brown pilot jacket real photo]

Scenario 2: "Are there any dark blue jackets?"

[Gemini → function_call]
  search_products(description="dark blue jacket", color="dark blue")

[Gemini sees the real photo of the P003 dark blue denim jacket]
  "Yes! This dark blue denim jacket (P003) in the photo features a retro stitching design,
   a lapel with metal buttons, and a very complete garment feel, priced at NT$1,490, with 8 in stock."

Scenario 3: "What are the features of the P004 knitted shawl?"

[Gemini → function_call]
  get_product_details(product_id="P004")

[Gemini sees the real photo of the beige knitted shawl]
  "The photo shows a beige handmade crochet shawl, with a V-neck design and tassels at the bottom,
   you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."

Traditional Function Response vs Multimodal Function Response

	Traditional	Multimodal
Function return	Pure JSON	JSON + image/PDF bytes
Gemini perception	Text data	Text + visual content
Answer quality	"You bought a brown pilot jacket"	"You can see the nylon texture in the photo, with a zipper pocket on the left sleeve..."
API difference	`Part.from_function_response(name, response)`	`Part.from_function_response(name, response, parts=[...])`
Applicable scenarios	Pure text data queries	Scenarios that require visual recognition/confirmation

Analysis and Outlook

This implementation gave me a new understanding of Gemini's Function Calling capabilities.

The problem that Multimodal Function Response truly solves is to allow AI agents to bring in visual information in the action of "calling an external system" itself, instead of first checking text and then uploading images separately. This will be an important basic capability in areas highly related to visuals, such as e-commerce, medicine, and design.

However, there are still a few limitations worth noting:

Image URLs cannot be used directly: Gemini's FunctionResponseBlob requires raw bytes, and URLs cannot be filled in directly (this is different from bringing images directly in the prompt). If the image is originally a URL, you need to download it with requests.get() to bytes before passing it in.
No display_name can also be used: The official documentation examples have display_name and $ref JSON reference, but in actual testing in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.
Model limitations: The official mark supports the Gemini 3 series, but gemini-2.0-flash can also handle it normally in actual testing, and the API structure is the same.

There are many directions that can be extended in the future: let users send their own product photos for the Bot to compare, include PDF catalogs in the function response for Gemini to read directly, or let the Bot analyze the report images converted from DICOM in medical scenarios... As long as visual data can be obtained from external systems, Multimodal Function Response can make the AI's answers more in-depth.

Summary

The focus of this LINE Bot implementation is only one sentence: Let the function response carry the image, and Gemini's answer will be upgraded from "restating data" to "telling a story based on the picture".

The core API is just these few lines, but it takes a lot of details to get the whole process working:

# The complete way for Gemini to see the image returned by the function
types.Part.from_function_response(
    name="get_order_history",
    response={"orders": [...]},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
                mime_type="image/jpeg",
                data=image_bytes,
            )
        )
    ],
)

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Gemini Tool Combo: Building a LINE Meetup Helper with Maps Grounding and Places API in a Single API Call

Evan Lin — Sun, 29 Mar 2026 02:07:59 +0000

Reference articles:

Gemini API tooling updates: context circulation, tool combos and Maps grounding for Gemini 3
Google Places API (New) - searchNearby
GitHub: linebot-spot-finder
Complete code GitHub (Meeting Helper LINE Bot Spot Finder)

Background

The combination of LINE Bot + Gemini is already very common. Whether it's using Google Search Grounding to let the model look up real-time information or using Function Calling to let the model call custom logic, they are both mature when used alone.

But what if you want to achieve both "map location context" and "query real ratings" in the same question?

Taking restaurant search as an example, the traditional approach usually looks like this:

User: "Help me find a hot pot restaurant nearby with a rating of 4 stars or above"

Solution A (using only Maps Grounding):
Gemini has map context, but the rating information is described by AI itself, and accuracy is not guaranteed.

Solution B (using only Places API):
You can get real ratings, but there is no map context, and Gemini doesn't know where the user is.

To have both, you usually need to make two API calls, or manually connect them yourself.

AI can search maps and call external APIs, but doing both in a single call—has always been an awkward blank in the old Gemini API architecture.

Until March 17, 2026, Google released Gemini API Tooling Updates (by Mariano Cocirio), which provided an official solution to this problem.

What are Tool Combinations?

Google announced three core features in this update:

1. Tool Combinations Developers can now attach built-in tools (such as Google Search, Google Maps) and custom Function Declarations simultaneously in a single Gemini API call. The model decides which tool to call and when to call it, and finally integrates the results to generate an answer.

2. Maps Grounding Gemini can now directly perceive map data, not just text descriptions of "location", but truly has spatial context—knowing where the user is and what's nearby.

3. Context Circulation Allows the context between multi-turn tool calls to flow naturally, and the model can fully remember the results of the first tool call when making the second call.

The key to this change is:

# Old approach (two tools cannot coexist)
types.Tool(google_search=types.GoogleSearch())
types.Tool(function_declarations=[MY_FN])

# New approach (the same Tool object, both coexist)
types.Tool(
    google_maps=types.GoogleMaps(),
    function_declarations=[MY_FN],
)

One line of modification opens up a whole new combination method.

Project Goal

This time, I used Tool Combinations to transform the existing linebot-spot-finder, upgrading it from "only Maps Grounding for rough answers" to "Google Maps context + Places API real data":

After the user sends their GPS location, they enter: "Please find a hot pot restaurant with a rating of 4 stars or above, suitable for group dining, and list the name, address, and review summary."

Bot (old Maps Grounding): "There are several hot pot restaurants nearby, and the ratings are good." (AI describes it itself, which may not be accurate)

Bot (new Tool Combo): "Lao Wang Hot Pot | 100 Shimin Avenue, Xinyi District, Taipei City | Rating 4.6 (312) | Reviews: Large portions, great value for money, suitable for group dining; efficient service, fast serving."

The difference is: Gemini now receives both map context (where you are) and the real structured data (rating numbers, review text) from the Places API, so the answer changes from a "vague description" to "informed information".

Architecture Design

Overall Message Flow

LINE User sends GPS location
    │
    ▼
handle_location() → session.metadata stores lat/lng
    │
    └──► Returns Quick Reply (restaurant / gas station / parking lot)

LINE User sends text question (e.g. "Find a hot pot restaurant with a rating of 4 stars or above")
    │
    ▼
handle_text()
    │
    ├── session has lat/lng?
    │ Yes → tool_combo_search(query, lat, lng) ← Focus of this article
    │ No → fallback: Gemini Chat + Google Search
    │
    └──► Returns natural language answer

Tool Combo Agentic Loop

tool_combo_search(query, lat, lng)
         │
         ▼
  Step 1: generate_content()
  tools = [google_maps + search_nearby_restaurants]
         │
         ▼
  response.candidates[0].content.parts has function_call?
       ╱ ╲
      Yes   No
      │     │
      ▼     ▼
  _execute_function()  Directly returns response.text
  → _call_places_api()
    (Places API searchNearby)
    Returns rating, address, reviews
      │
      ▼
  Collect into a single Content(role="user")
  Add to history
      │
      ▼
  Step 3: generate_content(contents=history)
  Gemini integrates map context + Places data
      │
      ▼
  Returns final.text

Why not put lat/lng in Function Declaration?

This is an important design decision.

If you add lat/lng to the parameters of SEARCH_NEARBY_RESTAURANTS_FN, Gemini will fill in the coordinates itself—but it fills in the "approximate location" inferred from the conversation, not the user's actual GPS coordinates, and the error can be as high as several kilometers.

The correct approach is to let the Python dispatcher extract the precise coordinates from session.metadata and inject them:

def _execute_function(name: str, args: dict, lat: float, lng: float):
    if name == "search_nearby_restaurants":
        return _call_places_api(
            lat=lat, lng=lng, # ← Inject from session, don't let Gemini guess
            keyword=args.get("keyword", ""),
            min_rating=float(args.get("min_rating", 4.0)),
        )

Core Code Details

Step 1: Define Function Declaration

from google.genai import types

SEARCH_NEARBY_RESTAURANTS_FN = types.FunctionDeclaration(
    name="search_nearby_restaurants",
    description=(
        "Search for nearby restaurants using Google Places API, and return the rating, address, and user reviews."
        "lat/lng is automatically included by the system and does not need to be provided."
    ),
    parameters=types.Schema(
        type=types.Type.OBJECT,
        properties={
            "keyword": types.Schema(
                type=types.Type.STRING,
                description="Restaurant type or keyword, such as: hot pot, hot pot, Italian",
            ),
            "min_rating": types.Schema(
                type=types.Type.NUMBER,
                description="Minimum rating threshold (1–5), default 4.0",
            ),
            "radius_m": types.Schema(
                type=types.Type.INTEGER,
                description="Search radius (meters), default 1000",
            ),
        },
    ),
)

The description clearly tells the model "lat/lng is included by the system", avoiding the model filling in the coordinates itself in the args.

Step 2: Places API Call

import httpx

PLACES_API_URL = "https://places.googleapis.com/v1/places:searchNearby"
PLACES_FIELD_MASK = (
    "places.displayName,"
    "places.rating,"
    "places.userRatingCount,"
    "places.formattedAddress,"
    "places.reviews"
)

def _call_places_api(lat, lng, keyword="", min_rating=4.0, radius_m=1000):
    body = {
        "includedTypes": ["restaurant"],
        "maxResultCount": 5,
        "locationRestriction": {
            "circle": {
                "center": {"latitude": lat, "longitude": lng},
                "radiusMeters": radius_m,
            }
        },
    }

    response = httpx.post(
        PLACES_API_URL,
        headers={
            "X-Goog-Api-Key": os.getenv("GOOGLE_MAPS_API_KEY"),
            "X-Goog-FieldMask": PLACES_FIELD_MASK,
        },
        json=body,
        timeout=10.0,
    )
    response.raise_for_status()
    data = response.json()

    restaurants = []
    for place in data.get("places", []):
        rating = place.get("rating", 0)
        if rating < min_rating:
            continue
        reviews = [
            r["text"]["text"]
            for r in place.get("reviews", [])[:3]
            if r.get("text", {}).get("text")
        ]
        restaurants.append({
            "name": place["displayName"]["text"],
            "address": place.get("formattedAddress", ""),
            "rating": rating,
            "rating_count": place.get("userRatingCount", 0),
            "reviews": reviews,
        })

    return {"restaurants": restaurants}

Step 3: Tool Combo Main Function (Agentic Loop)

async def tool_combo_search(query: str, lat: float, lng: float) -> str:
    client = genai.Client(
        vertexai=True,
        project=os.getenv("GOOGLE_CLOUD_PROJECT"),
        location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
        http_options=types.HttpOptions(api_version="v1"),
    )

    enriched_query = (
        f"User's current location: latitude {lat}, longitude {lng}.\n"
        f"Please answer in traditional Chinese using Taiwanese terminology, and do not use markdown format.\n\n"
        f"Question: {query}"
    )

    tool_config = types.GenerateContentConfig(
        tools=[
            types.Tool(
                google_maps=types.GoogleMaps(), # ← Maps grounding
                function_declarations=[SEARCH_NEARBY_RESTAURANTS_FN], # ← Places API
            )
        ],
    )

    # ── Step 1 ──────────────────────────────────────────────────────
    response = client.models.generate_content(
        model=TOOL_COMBO_MODEL,
        contents=enriched_query,
        config=tool_config,
    )

    if not response.candidates:
        return response.text or "（Unable to get a reply）"

    history = [
        types.Content(role="user", parts=[types.Part(text=enriched_query)]),
        response.candidates[0].content,
    ]

    # ── Step 2：Processing function_call ──────────────────────────────────
    function_response_parts = []
    for part in response.candidates[0].content.parts:
        if part.function_call:
            fn = part.function_call
            result = _execute_function(fn.name, dict(fn.args or {}), lat, lng)
            function_response_parts.append(
                types.Part(
                    function_response=types.FunctionResponse(
                        id=fn.id, name=fn.name, response=result,
                    )
                )
            )

    if function_response_parts:
        history.append(types.Content(role="user", parts=function_response_parts))

        # ── Step 3 ────────────────────────────────────────────────────
        final = client.models.generate_content(
            model=TOOL_COMBO_MODEL,
            contents=history,
            config=tool_config,
        )
        return final.text or "（Unable to get a reply）"

    return response.text or "（Unable to get a reply）"

Pitfalls Encountered

❌ Pitfall 1: `Part.from_function_response()` does not accept the `id` parameter

This is the easiest pitfall to step into this time, and the error only explodes when real model calls are made, and unit tests almost never detect it.

Originally, I wrote it like this, referring to the official example:

# ❌ Error——TypeError occurs at runtime
types.Part.from_function_response(
    id=fn.id, # ← This parameter does not exist!
    name=fn_name,
    response=result,
)

The actual signature of from_function_response is:

(*, name: str, response: dict, parts: Optional[list] = None) -> Part

There is no id parameter at all. Every time the model actually triggers a function_call, the program will throw a TypeError at this line, and then silently enter the except of Step 3, returning an error message, and the results of the Places API are never truly returned to Gemini.

The correct way is to directly construct types.FunctionResponse:

# ✅ Correct
types.Part(
    function_response=types.FunctionResponse(
        id=fn.id,
        name=fn_name,
        response=result,
    )
)

You can immediately confirm the parameter list with python -c "from google.genai import types; help(types.Part.from_function_response)".

❌ Pitfall 2: `include_server_side_tool_invocations=True` causes Pydantic to explode

I thought I should add this parameter after seeing the official documentation example:

# ❌ Error
types.GenerateContentConfig(
    tools=[...],
    include_server_side_tool_invocations=True, # ← The installed SDK version does not support it
)

In google-genai 1.49.0, this field is not in the model fields of GenerateContentConfig, and Pydantic will directly throw an extra_forbidden validation error. Just remove it, and the function is completely normal.

❌ Pitfall 3: `textQuery` is a parameter of `searchText`, not `searchNearby`

I thought "if there is a keyword, then bring it into the Places API", and intuitively added it to the request body:

# ❌ Error——Invalid field for searchNearby endpoint
if keyword:
    body["textQuery"] = keyword

searchNearby only accepts fields such as includedTypes, locationRestriction; textQuery is a parameter of the searchText endpoint. Adding this field will not report an error (in some versions), but the keyword will not take effect at all.

The correct approach is to leave the keyword in the description of the Function Declaration for Gemini to refer to, let the model translate the intent to enriched_query, let Maps Grounding handle the keyword semantics, and Places API is only responsible for returning real rating data.

❌ Pitfall 4: No guard for `response.candidates[0]`

When the model encounters security filtering, RECITATION, or other abnormal termination, candidates may be an empty list, and then directly response.candidates[0] is IndexError.

# ❌ No guard
history = [
    types.Content(role="user", parts=[types.Part(text=enriched_query)]),
    response.candidates[0].content, # ← If candidates is empty, it will explode
]

# ✅ Add guard
if not response.candidates:
    return response.text or "（Unable to get a reply）"

history = [...]

Demo Display

Scenario 1: "Find a hot pot restaurant with a rating of 4 stars or above for group dining"

User sends: GPS location (Xinyi District, Taipei City, 25.0441, 121.5598)

User enters: "Please find a hot pot restaurant with a rating of 4 stars or above, suitable for group dining, and list the name, address, and review summary."

[Step 1: Gemini receives query + map context]
  → Detects the need for restaurant data, emit function_call:
    search_nearby_restaurants(keyword="hot pot", min_rating=4.0)

[Step 2: Python calls Places API]
  → lat=25.0441, lng=121.5598 injected from session
  → Returns 3 restaurants with a rating ≥ 4.0, including review text

[Step 3: Gemini integrates Maps context + Places data]
  → "Lao Wang Hot Pot｜100 Shimin Avenue, Xinyi District｜⭐ 4.6 (312)
      Review summary: Large portions, great value for money, a top choice for friends to dine; fast service, fresh dishes.
     ... (3 restaurants in total)"

Scenario 2: "Are there any high-value Japanese restaurants?"

User enters: "Are there any high-value Japanese restaurants nearby?"

[Step 1: Gemini]
  → function_call: search_nearby_restaurants(keyword="Japanese cuisine", min_rating=4.0)

[Step 2: Places API]
  → Returns 2 Japanese restaurants that meet the rating criteria

[Step 3: Gemini]
  → "There are two recommendations:
      Washoku ○○｜...｜⭐ 4.4｜Reviews: Weekday lunch set is only 280 yuan, very fresh.
      ..."

Demo Script Quick Test

No need for LINE Bot, directly on the local machine:

# Only test Tool Combo (main function)
python demo.py combo

# Run all three functions
python demo.py all

Old Architecture vs. New Architecture

	Old Architecture (Maps Grounding only)	New Architecture (Tool Combo)
Tool	`google_maps` (built-in)	`google_maps` + `search_nearby_restaurants` (custom)
Rating Data	Gemini describes it itself (may not be accurate)	Places API real numbers
Reviews	AI generated	Real user reviews (up to 3)
API Call Count	1 time	1 time (Step1) + 1 time (Step3) = 2 times, but transparent to the user
Accuracy	Medium	High
Custom Filtering	Rely on prompt	`min_rating`, `radius_m` precise control

Analysis and Outlook

This implementation has given me a clearer understanding of the potential of Gemini Tool Combinations.

The problem that Tool Combinations truly solves is that Grounding and Function Calling are no longer mutually exclusive. Previously, to achieve "map context + real external data", you could only manually connect two APIs yourself at the application layer, or use Gemini's text generation to "simulate" external data (unreliable). Now the model itself knows when to use map context and when to call the Places API, and developers only need to attach the tools.

However, there are also a few things to note about this implementation:

lat/lng injection mode is very important: You can't let the model guess the coordinates itself, you must inject them from the session, otherwise the positioning accuracy will be very poor. This mode also applies to all function calling scenarios that "have session status".
The cost of two generate_content calls: The agentic loop of Tool Combo requires two model calls, and the token consumption is about 1.5–2 times that of a single call. This needs to be especially considered for scenarios with high latency requirements.
SDK version differences: Different versions of google-genai have different support for the fields of GenerateContentConfig, and new fields like include_server_side_tool_invocations should be used after confirming the version number, otherwise Pydantic validation errors are hard to track.

Future directions that can be extended:

Connect the Postback quick replies (click the "Find Restaurant" button) to Tool Combo, so that each entry can get real ratings
Add the searchText endpoint to support more complex keyword searches (e.g. Michelin recommendations)
Tool Combo combined with other built-in tools (such as google_search) to achieve more complex multi-tool chaining

Summary

The core concept of this modification is only one sentence: Put Google Maps grounding and the Places API function tool in the same types.Tool, and Gemini will coordinate the two in a single conversation.

The key code is only these few lines:

# This is all the magic of Tool Combo
types.Tool(
    google_maps=types.GoogleMaps(), # ← Maps context
    function_declarations=[SEARCH_NEARBY_RESTAURANTS_FN], # ← Places API
)

But to make it really work, you also need to pay attention to: the construction method of FunctionResponse, the guard of candidates, the correct fields of the Places API endpoint, and the injection of lat/lng from the session instead of letting the model guess.

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Gemini 3.1: Real-World Voice Recognition with Flash Live: Making Your LINE Bot Understand You

Evan Lin — Sun, 29 Mar 2026 02:07:26 +0000

Background

Google released Gemini 3.1 Flash Live at the end of March 2026 March, focusing on "making audio AI more natural and reliable." This model is specifically designed for real-time two-way voice conversations, with low latency, interruptibility, and multi-language support.

I happened to have a LINE Bot project (linebot-helper-python) on hand, which already handles text, images, URLs, PDFs, and YouTube, but completely ignores voice messages:

User sends a voice message
Bot: (Silence)

This time, I'll add voice support and share a few pitfalls I encountered.

Design Decision: Flash Live or Standard Gemini API?

The first question: Gemini 3.1 Flash Live is designed for real-time streaming, but LINE's voice messages are pre-recorded m4a files, not real-time audio streams.

Using Flash Live to process pre-recorded files is like using a live streaming camera to take photos – technically feasible, but the wrong tool.

Decided to use the standard Gemini API – directly passing the audio bytes as inline data, and getting the transcribed text in one call. It's simpler and more suitable for this scenario.

Architecture Design

Integration Approach

This repo already has a complete Orchestrator architecture, which automatically routes to different Agents (Chat, Content, Location, Vision, GitHub) based on the message content. The goal for voice messages is clear:

Convert voice to text, and then treat it as a regular text message and pass it into the Orchestrator – allowing all existing features to automatically support voice input.

User says "Help me search for nearby gas stations" → transcribed into text → Orchestrator determines it's a location query → LocationAgent processes it. No need to implement separate logic for voice.

Complete Flow

User sends AudioMessage (m4a)
    │
    ▼ handle_audio_message()
    │
    ├─ ① LINE SDK downloads audio bytes
    │ get_message_content(message_id) → iter_content()
    │
    ├─ ② Gemini transcription
    │ tools/audio_tool.py → transcribe_audio()
    │ model: gemini-3.1-flash-lite-preview
    │
    ├─ ③ Reply #1: "You said: {transcription}"
    │ reply_message() (consumes reply token)
    │
    └─ ④ Reply #2: Orchestrator routing
            handle_text_message_via_orchestrator(push_user_id=user_id)
            ↓
            push_message() (reply token already used, use push instead)

Why two replies?

The replies are divided into two parts to let the user see the transcription result immediately, without waiting for the Orchestrator to finish processing to know if the Bot understood what they said.

Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

from google import genai
from google.genai import types

TRANSCRIPTION_MODEL = "gemini-3.1-flash-lite-preview"

async def transcribe_audio(audio_bytes: bytes, mime_type: str = "audio/mp4") -> str:
    """
    Transcribe audio bytes to text using Gemini.
    LINE voice messages are always m4a, MIME type is always audio/mp4.
    """
    client = genai.Client(
        vertexai=True,
        project=os.getenv("GOOGLE_CLOUD_PROJECT"),
        location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
    )

    audio_part = types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)

    response = await client.aio.models.generate_content(
        model=TRANSCRIPTION_MODEL,
        contents=[
            types.Content(
                role="user",
                parts=[
                    audio_part,
                    types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."),
                ],
            )
        ],
    )

    return response.text or ""

Design principle: The function itself does not catch exceptions, allowing the upper-level handler to handle error responses uniformly.

Step 2: Handler Main Flow (main.py)

async def handle_audio_message(event: MessageEvent):
    """Handle audio (voice) messages — transcribe and route through Orchestrator."""
    user_id = event.source.user_id
    replied = False # Track if the reply token has been used
    try:
        # Download audio
        message_content = await line_bot_api.get_message_content(event.message.id)
        audio_bytes = b""
        async for chunk in message_content.iter_content():
            audio_bytes += chunk

        # Transcription
        transcription = await transcribe_audio(audio_bytes)

        # Empty transcription (silent or too short)
        if not transcription.strip():
            await line_bot_api.reply_message(
                event.reply_token,
                [TextSendMessage(text="Unable to recognize voice content, please re-record.")]
            )
            return

        # Reply #1: Let the user confirm the transcription result (consumes reply token)
        await line_bot_api.reply_message(
            event.reply_token,
            [TextSendMessage(text=f"You said: {transcription.strip()}")]
        )
        replied = True

        # Reply #2: Send to Orchestrator, using push_message (token already used)
        await handle_text_message_via_orchestrator(
            event, user_id,
            text=transcription.strip(),
            push_user_id=user_id,
        )

    except Exception as e:
        logger.error(f"Error handling audio for {user_id}: {e}", exc_info=True)
        error_text = LineService.format_error_message(e, "processing voice message")
        error_msg = TextSendMessage(text=error_text)
        if replied:
            # reply token has been consumed, use push instead
            await line_bot_api.push_message(user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

Step 3: Enabling Orchestrator to Support External Text Input

The original handle_text_message_via_orchestrator directly reads event.message.text. AudioMessage doesn't have .text, so add two optional parameters:

async def handle_text_message_via_orchestrator(
    event: MessageEvent,
    user_id: str,
    text: str = None, # ← External text input (voice transcription)
    push_user_id: str = None, # ← Use push_message when set
):
    msg = text if text is not None else event.message.text.strip()
    try:
        result = await orchestrator.process_text(user_id=user_id, message=msg)
        response_text = format_orchestrator_response(result)
        reply_msg = TextSendMessage(text=response_text)

        if push_user_id:
            await line_bot_api.push_message(push_user_id, [reply_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [reply_msg])
    except Exception as e:
        error_msg = TextSendMessage(text=LineService.format_error_message(e, "processing your question"))
        if push_user_id:
            await line_bot_api.push_message(push_user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

text is not None (instead of text or ...) is intentional – in case the voice transcription results in an empty string, allow the empty string to pass through (and then be intercepted by the upper-level if not transcription.strip()), instead of falling back to event.message.text.

Pitfalls Encountered

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

The first TypeError encountered:

# ❌ Error (TypeError: Part.from_text() takes 1 positional argument but 2 were given)
types.Part.from_text(
    "Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."
)

# ✅ Correct
types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes.")

In this version of the SDK, text in Part.from_text() is a keyword argument, or use the Part(text=...) constructor directly for more safety.

❌ Pitfall 2: LINE reply token can only be used once

LINE's reply token is one-time use. Once reply_message() is called, the token is invalidated.

This project's voice flow will call twice:

Reply #1 (display transcription text) → consumes token
Reply #2 (Orchestrator result) → token is invalid, will receive LINE 400 error

The solution is to have the Orchestrator handler support push_message mode (via the push_user_id parameter), and Reply #2 changes to push_message.

Error handling should also be noted: if Orchestrator throws an exception after Reply #1 succeeds, the reply_message cannot be used in the except block, and it also needs to be changed to push_message. This is the purpose of the replied flag in the code.

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Not a real "pitfall", but worth clarifying:

Gemini 3.1 Flash Live is designed for real-time two-way streaming, which has the overhead of connection establishment and streaming protocols. LINE voice messages are complete pre-recorded m4a files, which can be processed once.

Using client.aio.models.generate_content() directly to pass inline audio bytes is simpler, and the delay is not bad. Leave Flash Live for scenarios that truly require real-time conversations.

Effect Demonstration

Scenario 1: Voice Command Query

User sends: [Voice] Help me search for cafes near Taipei Main Station

Bot Reply #1: You said: Help me search for cafes near Taipei Main Station
Bot Reply #2: [LocationAgent replies with a list of nearby cafes]

Scenario 2: Voice Question

User sends: [Voice] What's the difference between Gemini and GPT-4

Bot Reply #1: You said: What's the difference between Gemini and GPT-4
Bot Reply #2: [ChatAgent with Google Search Grounding replies with comparison results]

Scenario 3: Voice Send URL

User sends: [Voice] Help me summarize this article https://example.com/article

Bot Reply #1: You said: Help me summarize this article https://example.com/article
Bot Reply #2: [ContentAgent fetches and summarizes the article]

The text transcribed from voice goes directly into the Orchestrator, and all existing URL detection and intent determination work as usual, with zero extra logic.

Traditional Text Input vs. Voice Input

	Text Input	Voice Input
Input Format	TextMessage	AudioMessage (m4a)
Pre-processing	None	Gemini transcription
reply token	Direct use	Reply #1 consumes, Reply #2 changes to push
Orchestrator	Direct routing	Route after transcription
Supported Functions	All	All (no additional settings required)
Error Handling	reply_message	replied flag determines reply/push

Analysis and Outlook

What I am most satisfied with in this integration is that I hardly need to change the Orchestrator itself. As long as the voice is converted to text at the input end, all the routing logic, Agent calls, and error handling are automatically inherited.

Gemini's multimodal audio understanding performs very stably in this scenario – Traditional Chinese, Taiwanese accents, and sentences mixed with English can basically be transcribed accurately.

Future directions for extension:

Multi-language automatic detection: Tell Gemini to preserve the original language during transcription, Japanese voice → Japanese transcription, and then the Orchestrator decides whether to translate
Group voice support: Currently limited to 1:1, voice messages in groups are temporarily ignored
Long recording summary: Recordings exceeding a certain length go directly to ContentAgent for summarization, instead of being treated as commands

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Voice recognition allows the Bot to "understand" what the user is saying. After this is done, the next question naturally arises:

Can the Bot respond by speaking?

The Gemini Live API has a setting response_modalities: ["AUDIO"], which can directly output an audio PCM stream. I connected it to another scenario – reading summaries aloud.

Function Design

Each time the Bot summarizes a URL, YouTube, or PDF, a "🔊 Read Aloud" QuickReply button will appear below the message. When the user presses it, the Bot sends the summary text into Gemini Live TTS, converts the PCM audio to m4a, and then sends it back using AudioSendMessage.

URL summary complete
    │
    ▼ [🔊 Read Aloud] QuickReply button
    │
User presses the button → PostbackEvent
    │
    ▼ handle_read_aloud_postback()
    │
    ├─ ① Retrieve the summary text from summary_store (10 minutes TTL)
    │
    ├─ ② Gemini Live API → PCM audio
    │ model: gemini-live-2.5-flash-native-audio
    │ response_modalities: ["AUDIO"]
    │
    ├─ ③ ffmpeg transcoding: PCM → m4a
    │ s16le, 16kHz, mono → AAC
    │
    └─ ④ AudioSendMessage sent to the user
            original_content_url: /audio/{uuid}
            duration: {ms}

Core Code (tools/tts_tool.py)

LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(vertexai=True, project=VERTEX_PROJECT, location="us-central1")
    config = {"response_modalities": ["AUDIO"]}

    async with client.aio.live.connect(model=LIVE_MODEL, config=config) as session:
        await session.send_client_content(
            turns=types.Content(role="user", parts=[types.Part(text=text)]),
            turn_complete=True,
        )
        pcm_chunks = []
        async for message in session.receive():
            if message.server_content and message.server_content.model_turn:
                for part in message.server_content.model_turn.parts:
                    if part.inline_data and part.inline_data.data:
                        pcm_chunks.append(part.inline_data.data)
            if message.server_content and message.server_content.turn_complete:
                break

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / 32000 * 1000) # 16kHz × 16-bit mono

    # PCM → m4a (temp file mode, avoid moov atom problem)
    with tempfile.NamedTemporaryFile(suffix=".pcm", delete=False) as f:
        f.write(pcm_bytes)
        pcm_path = f.name
    m4a_path = pcm_path.replace(".pcm", ".m4a")
    subprocess.run(
        ["ffmpeg", "-y", "-f", "s16le", "-ar", "16000", "-ac", "1",
         "-i", pcm_path, "-c:a", "aac", m4a_path],
        check=True, capture_output=True,
    )
    with open(m4a_path, "rb") as f:
        return f.read(), duration_ms

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

The first attempt at Gemini Live TTS was:

LIVE_MODEL = "gemini-3.1-flash-live-preview"

Following the inference of gemini-3.1-flash-lite-preview used for voice recognition, the result was a direct 1008 policy violation:

Publisher Model `projects/line-vertex/locations/global/publishers/google/
models/gemini-3.1-flash-live-preview` was not found

Listing the available models on Vertex AI revealed that the model naming rules for Live/native audio are completely different:

# ✅ Correct
LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

There is no Live version of Gemini 3.1 on Vertex AI. The Live/native audio feature is currently the 2.5 generation, and the naming format is gemini-live-{version}-{variant}-native-audio, which is completely separate from the general model gemini-{version}-flash-{variant}.

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect

After changing to the correct model name, the error message was still the same:

Publisher Model `projects/line-vertex/locations/global/...` was not found

This time the model name was correct, but locations/global was strange – we clearly set us-central1.

Investigating the source code of the Google GenAI SDK revealed:

# _api_client.py
self.location = location or env_location
if not self.location and not self.api_key:
    self.location = 'global' # ← here

location or env_location – if the passed-in location is an empty string, it will fall back to global.

The root cause of the problem is the environment variable of Cloud Run:

{ "name": "GOOGLE_CLOUD_LOCATION", "value": "global" }

GOOGLE_CLOUD_LOCATION was set to the "global" string. os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") did not get "us-central1", but "global" – then the SDK obediently connected to the global endpoint, but gemini-live-2.5-flash-native-audio does not have BidiGenerateContent support in global.

Endpoint	Standard API	Live API
`global`	✅ Available	❌ Model not here
`us-central1`	✅ Available	✅ `gemini-live-2.5-flash-native-audio`

Solution: Hardcode the location of the Live API, and don't read from the env var:

# ❌ Affected by GOOGLE_CLOUD_LOCATION=global
VERTEX_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")

# ✅ Hardcoded, not affected by env var
VERTEX_LOCATION = "us-central1" # Live API needs a regional endpoint

Voice Recognition vs. Read Summary Aloud

The two functions use completely different Gemini APIs:

	Voice Recognition	Read Summary Aloud
Direction	Audio → Text	Text → Audio
API	Standard `generate_content`	Live API `BidiGenerateContent`
Model	`gemini-3.1-flash-lite-preview`	`gemini-live-2.5-flash-native-audio`
Location	Follows env var	Hardcoded `us-central1`
Output Format	text	PCM → ffmpeg → m4a
LINE Message Type	Input: `AudioMessage`	Output: `AudioSendMessage`

Conclusion

The release of Gemini 3.1 Flash Live makes audio AI more worthy of serious consideration. This time, both voice recognition and read summary aloud were integrated into the LINE Bot:

Voice Recognition: Standard Gemini API, pre-recorded m4a one-time transcription, connected to the existing Orchestrator
Read Summary Aloud: Gemini Live TTS, summary text to PCM, ffmpeg to m4a, AudioSendMessage returns

The most troublesome part is not the function itself, but finding the correct model name and locating the SDK's location logic – neither of these are clearly written in a prominent place in the documentation, and the answer can only be found by listing available models and reading the SDK source code.

The full code is on GitHub, feel free to refer to it.

See you next time!

Building an Agent Skill Hub: From Skill Development to Automated Multilingual Documentation Deployment on GitHub Pages

Evan Lin — Fri, 27 Mar 2026 01:45:20 +0000

Reference links:

This article documents how I built a skill description specification from scratch and created a GitHub Pages documentation site that supports both Chinese and English, drawing inspiration from minimalist aesthetics, while developing the Agent Skill Hub (2026 Skill Library).

Background

With the popularity of AI Agents (such as OpenClaw or Gemini CLI), we found that "how to quickly understand and execute specific tasks for the Agent" has become key. Instead of writing long prompts every time, it's better to package common operations into standardized Skills.

To facilitate community communication and Agent reading, I created agent-skill-hub. But code alone is not enough; we also need a decent "facade" – a document website that is both aesthetically pleasing and has technical details.

🛠️ Step 1: Standardize Skill Descriptions (SKILL.md)

In agent-skill-hub, each skill (such as gcp-helper or n8n-executor) has a SKILL.md. The structure of this file is crucial because it's not just for humans to read, but also for LLMs to read:

Name & Description: Let the Agent know what this is.
When to Use: Define trigger scenarios.
Core Pattern: Provide standard instruction examples.
Common Mistakes: Reduce errors caused by Agent hallucinations.

🎨 Step 2: Design Style — Tribute to Minimalist Aesthetics

When designing the web pages under the docs directory, I referenced the style of whisperASR. That design of a dark background with bright accent colors (Teal) is very in line with the aesthetics of modern developers:

Visual Element Highlights:

Gradient Title: Use linear-gradient to create a high-end texture.
Teal Accent Color: Use #14b8a6 as the highlight color for key buttons and titles.
Card-style Layout: Clearly present the icons and introductions of each skill, with good responsive design.

🌐 Step 3: Multilingual Support and Automatic Switching

To make it available to developers worldwide, I adopted a directory-structured language management method:

docs/
├── index.html (Language detection and redirection)
├── en/ (English version)
│ └── skills/
└── zh/ (Traditional Chinese version)
    └── skills/

I added a simple JavaScript snippet to the root directory's index.html, which automatically redirects to the correct language based on the user's browser settings:

const lang = navigator.language || navigator.userLanguage;
if (lang.startsWith('zh')) {
    window.location.href = './zh/index.html';
} else {
    window.location.href = './en/index.html';
}

🚀 Step 4: GitHub Pages Deployment Process

In 2026, the most recommended deployment method is to put the content in the docs/ directory of the main branch, which can keep the main branch clean while keeping development and documentation synchronized.

1. Prepare the Directory Structure

Create all the necessary directories at once using the command:

mkdir -p docs/en/skills docs/zh/skills docs/assets/css

2. Git Commit and Push

After completing HTML/CSS development, execute the standard Git process:

git add docs/
git commit -m "docs: add GitHub Pages documentation in English and Chinese"
git push origin main

3. Enable GitHub Pages Settings

Go to Settings > Pages in the GitHub repository.
Under Build and deployment, in Branch, select the main branch and the /docs folder.
Click Save, and the website will be online in a few minutes.

🛠️ Common Pitfalls and Troubleshooting

❓ Why can't the webpage style (CSS) be loaded?

Reason: In HTML files under subdirectories (such as en/skills/), the referenced paths must correctly use relative paths. Correction:

<!-- In the home page index.html -->
<link rel="stylesheet" href="../assets/css/style.css">
<!-- In the skill detail page -->
<link rel="stylesheet" href="../../assets/css/style.css">

❓ How to ensure that the Agent can correctly read the document?

We have retained a large number of semantic tags (article, h2, pre, code) in the HTML, so that the Agent can more accurately capture the core logic when performing RAG (Retrieval-Augmented Generation) or directly reading the webpage.

🏁 Conclusion

Through this development, I have realized the importance of "documentation as product". A good AI skill library, in addition to powerful program logic, also needs a clear, intuitive, and multilingual-friendly navigation system.

If you also want to create a professional facade for your AI project, you might as well refer to the docs/ structure layout. Happy Coding! 🦞

Security Declaration for AI Agents: Deep Dive into A2AS (Agent-to-Agent Security) Certification Mechanism

Evan Lin — Fri, 27 Mar 2026 01:45:10 +0000

Reference links:

This article documents an interesting Pull Request I received while maintaining linebot-adk (LINE Bot Agent Development Kit): adding the A2AS security certificate to the project. This is not just a YAML file, but a significant milestone for AI Agents to move towards "industrial-grade security" in 2026.

Background

When we develop Agents like linebot-adk that have Tool Use (Function Calling) capabilities, the biggest concern for users is often: "Will this Agent issue commands without my permission?" or "What data can it access?".

Traditionally, we could only write explanations in README.md, but that's for humans to read, not for system verification. This is why A2AS (Agent-to-Agent Security) emerged – it's hailed as the "HTTPS of the AI world".

🛠️ Step 1: Understanding the BASIC Model of A2AS

A2AS is not just a name; it has a complete BASIC security model behind it, designed to solve the trust issue between AI Agents:

(B)ehavior Certificates: Declarative certificates that clearly define the behavior boundaries of the Agent.
(A)uthenticated Prompts: Ensures that the source of prompts is trustworthy and traceable.
(S)ecurity Boundaries: Uses structured tags (such as <a2as:user>) to isolate untrusted input.
(I)n-Context Defenses: Embeds defense logic in prompts to reject malicious injections.
(C)odified Policies: Writes business rules into code and enforces them during inference.

🎨 Step 2: Deconstructing a2as.yaml – The Agent's ID Card

In PR #1 received by linebot-adk, the most core change was the addition of a2as.yaml. This file is like the Agent's "digital signature", making the code logic explicit:

manifest:
  subject:
    name: kkdai/linebot-adk
    scope: [main.py, multi_tool_agent/agent.py]
  issued:
    by: A2AS.org
    url: https://a2as.org/certified/agents/kkdai/linebot-adk

agents:
  root_agent:
    type: instance
    models: [gemini-2.5-flash]
    tools: [get_weather, get_current_time]

Why is this important?

This certificate is directly linked to the content of our main.py. When the certificate declares tools: [get_weather, get_current_time], it means this is a limited-authorization Agent. If it tries to execute delete_database, the security monitoring system can immediately detect that it is outside the certificate scope.

🌐 Step 3: Combining Code Logic

In linebot-adk, we used Google's ADK (Agent Development Kit) to build the Agent. The A2AS certificate can accurately map our program architecture:

1. Tool Declaration and Implementation

In multi_tool_agent/agent.py, we defined two tools:

def get_weather(city: str) -> dict:
    # Implement the logic to get the weather
    ...

def get_current_time(city: str) -> dict:
    # Implement the logic to get the time
    ...

The A2AS certificate will register these functions in the tools block, ensuring that the Agent's capability boundaries are transparent and auditable.

2. Runner and Execution Loop

In main.py, we start the Agent through Runner:

runner = Runner(
    agent=root_agent,
    app_name=APP_NAME,
    session_service=session_service,
)

The manifest.subject.scope in the certificate marks main.py, which means the entire startup process (including FastAPI's Webhook processing) is within the A2AS compliant scope.

🚀 Step 4: Why is this the "HTTPS of the AI world"?

Imagine if you want a "travel agent Agent" to talk to a "hotel reservation Agent".

Without A2AS: The travel Agent can only "blindly trust" the hotel Agent.
With A2AS: The travel Agent can first check the other party's a2as.yaml certificate. If the other party claims to have the right to "modify orders" but the certificate doesn't say so, the travel Agent can refuse the transaction.

This "verify first, then execute" model is the trust network that A2AS wants to build.

🛠️ Common Pitfalls and Troubleshooting

❓ What if the certificate expires or the Commit Hash doesn't match?

Reason: A2AS certificates are bound to a specific Git Commit. When you modify the logic of agent.py but don't update the certificate, the verification will fail. Correction: Every time you modify the core functions of the Agent (such as adding a Tool or changing the Model), you must regenerate and sign a2as.yaml.

❓ Does using A2AS increase latency?

No. A2AS is mainly a "declarative" and "structured" specification. During the inference phase, it uses structured tags (S in the BASIC model) to help LLMs distinguish between instructions and data, which can reduce the hallucinations caused by the model's confusion and improve execution efficiency.

🏁 Conclusion

Through the introduction of this A2AS certificate, linebot-adk is no longer just a simple LINE Bot example; it has become a transparent Agent that meets the 2026 security standards. In an era where AI agents are gradually penetrating our lives, "transparency" is the best defense.

If you are also developing AI Agents, you might as well go to A2AS.org and add that badge of trust to your project. Happy Coding! 🦞

Deploying OpenClaw on Google Cloud VM: Avoiding Sudo and NVM Pitfalls

Evan Lin — Sun, 01 Mar 2026 14:04:54 +0000

(Image generated by Nano Banana - Gemini Image Generation)

References:

This article documents the complete solution process for the permission, environment variable, and process persistence issues encountered when installing OpenClaw (2026 Latest Version) in a Debian/Ubuntu environment on Google Cloud Platform (GCP).

Preface

The AI Agent field has been very popular recently. OpenClaw, as an open-source AI agent that can operate 24 hours a day, has impressed people with its powerful system access and browsing capabilities. For security reasons, deploying it on a cloud VM (such as GCP GCE) is the most ideal approach, which can ensure 24/7 online availability and isolate sensitive local data.

However, in the default Debian/Ubuntu environment of GCP, due to the permission mechanism being slightly different from that of a general Desktop Linux, following the official script for installation often leads to many pitfalls.

🛠️ Basic Installation Process of OpenClaw on GCP

Before we get into troubleshooting, let's quickly go through the standard installation logic:

1. Create a VM Instance

Create a new VM in the GCP Console:

Machine type: Recommended e2-small or e2-medium (depending on your Agent load).
Operating system: Recommended to choose Ubuntu 24.04 LTS or Debian 12.
Hard disk: Recommended 20GB or more.

2. Connect and Basic Updates

After entering the VM via SSH, first perform a system update:

sudo apt update && sudo apt upgrade -y
sudo apt install -y git curl build-essential

3. Officially Install OpenClaw

The official website provides a one-click installation script:

curl -fsSL https://openclaw.ai/install.sh | bash

But! If you directly execute the above script, you will usually encounter the following two serious permission and path problems on GCP.

🛠️ Problem 1: "HAL 9000" Style Denial of sudo-rs

Symptom: When executing the official installation script, the following error is encountered with sudo-rs:

sudo-rs: I'm sorry evanslin. I'm afraid I can't do that

Reason:

Interaction Restriction: The script executed via curl ... | bash cannot obtain password input from the terminal when sudo is required.
No Password Account: GCP defaults to using SSH Key login, and the user account usually does not have a physical password set, leading to sudo authentication failure.

Solution: Use NVM (Node Version Manager) to install Node.js, and build the environment under the user directory, completely avoiding the sudo requirement.

# 1. Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

# Reload shell configuration
source ~/.bashrc

# 2. Install Node.js
nvm install node # Recommended version v25.7.0+

🛠️ Problem 2: NVM Path and Environment Variables

After using NVM, although sudo is avoided, a new problem arises: when you log in again or execute commands using a non-interactive shell, the system may not be able to find the node or openclaw command.

This is because the NVM path is dynamically loaded. It is recommended to ensure that the following content exists in ~/.bashrc:

export NVM_DIR="$HOME/.nvm"
[-s "$NVM_DIR/nvm.sh"] && \. "$NVM_DIR/nvm.sh"
[-s "$NVM_DIR/bash_completion"] && \. "$NVM_DIR/bash_completion"

🛠️ Problem 3: How to Make OpenClaw Run 24/7 Stably?

After installation, in order to keep the Agent running after closing the SSH window, I switched from the original GCP Web SSH to using the local gcloud CLI, but I also found a new small pitfall.

1. Why gcloud ssh can't find openclaw?

This is usually because GCP's gcloud compute ssh may create a new username based on your local account name, instead of using the account you used when installing on the VM (e.g., evanslin).

Verification method: Please enter the following in the "Web SSH" and "Local gcloud SSH" windows respectively:

whoami

Root cause: If the web version shows evanslin, but the gcloud version shows a name like evan_lin_yourdomain_com, then the home directory paths of the two are completely different, and your NVM and OpenClaw settings will of course "disappear".

Solution: When executing the gcloud command, explicitly specify the account to log in to:

gcloud compute ssh evanslin@openclaw-evanlin

This will ensure that you return to the correct environment!

2. Use tmux and Startup Script to Achieve Perfect Execution

In order to ensure that environment variables can be loaded correctly in any SSH session (web version or gcloud version), and to keep OpenClaw running stably in the background, it is recommended to use the following "scripted" startup method.

Step 1: Create a Startup Script

In a window where you can normally execute openclaw (usually Web SSH), create a startup script:

cat << 'EOF' > ~/start_openclaw.sh
#!/bin/bash
# 1. Force loading NVM path
export NVM_DIR="$HOME/.nvm"
[-s "$NVM_DIR/nvm.sh"] && \. "$NVM_DIR/nvm.sh"

# 2. Automatically correct PATH (please adjust the path according to your Node version)
export PATH="$HOME/.nvm/versions/node/v25.7.0/bin:$PATH"

# 3. Execute command
openclaw "$@"
EOF

# Grant execution permission
chmod +x ~/start_openclaw.sh

Step 2: Verify the Script

From now on, no matter where you log in from, please use this script uniformly. Test in the gcloud ssh window:

~/start_openclaw.sh gateway

If it can run successfully, it means the path has been manually connected!

Step 3: Combine tmux to Solve the Disconnection Problem

Now we combine the script with tmux to achieve true 24/7 background operation:

Open a new session: tmux new -s openclaw
Execute the script inside: ~/start_openclaw.sh gateway
Perfectly detach: Press Ctrl + B and release, then press D.
Reconnect at any time: Next time you log in, execute tmux a -t openclaw.

Summary

The key to deploying OpenClaw on GCP is "user directory priority". By using NVM to avoid the system-level sudo-rs restriction, not only is the installation process smoother, but it also makes it easier to switch Node.js versions to meet the latest requirements of OpenClaw.

After successful deployment, don't forget to use openclaw onboard to start configuring your API Keys and communication channels (such as Telegram or Discord).

I hope this note can help developers who are also working hard on GCP. See you next time!

Forem: Evan Lin

Gemini API File Search: Enhanced Multimodal Capabilities with Embedding 2, Including Open-Source LINE Bot Implementation

Recap: RAG Finally Doesn't Need to Build Legos

Three Major Highlights of the New Features

1. True Multimodal File Search (Native Multimodal File Search)

2. Custom Metadata and Server-side Filtering

3. Page-level Citations

The Multimodal Engine: Gemini Embedding 2

What Developers Really Care About: Price and Access Cost

Standard Workflow: 4 SDK calls to complete a RAG

Demo Case: Putting These New Features into a LINE Bot

Deployment Battle: commit → automatic online

Pitfalls Record: Two Traps Directly Related to the New Features

Pitfall 1: Hardcoded Model ID is Outdated

Pitfall 2: Mysterious "Upload has already been terminated"

Summary: The Entry Fee for Multimodal RAG, the Lowest in History

[GCP Practice][BwAI] AI-Powered Development: Quickly Deploy a LINE Bot Cloud Backup Tool with Gemini CLI

Background

Preparation: Summoning the AI Assistant

Practical Deployment Process: AI Leading the Way

Step 1: Environment Detection and API Enablement

Step 2: Creating a Firestore Database (Encountering the First Pitfall)

Step 3: Deploying Cloud Run First, Filling in the Blanks Later

Step 4: Completing Google OAuth Settings and Environment Variable Updates

Blood and Tears Pitfall Records During the Deployment Process

Pitfall 1: Forgetting to Bind a Credit Card, the 390001 Error

Pitfall 2: The Evolution of Command Parameter Syntax

Pitfall 3: The Invisible "Drive API"

Conclusion

GCP Hands-on: Deploying OpenAB - Building a Gemini ACP Bridge for Telegram on GCE

Background

Deployment Decision: Why GCE instead of Cloud Run?

Practical Steps: Step-by-Step Deployment Process

Step 1: Writing an Automated Startup Script

Step 2: Using gcloud to Create a GCE Instance

Step 3: Configuring the Gemini API Key

Step 4: Using Cloudflare Tunnel to Solve HTTPS Requirements

Blood and Tears in the Migration Process: Technical Summary

Pitfall 1: Confusion of Image Sources

Pitfall 2: Hardcoded Configuration Path

Pitfall 3: Security Secret Token Verification Failed

Summary: The Perfect AI Bridging Solution

GCP in Action: Building a Persistent AI Assistant with GCE, Hermes Agent, and Telegram

Background

Environment Parameter Preparation

Step 1: Create a GCE Instance

Step 2: Install Hermes Agent

Step 3: Configure Gemini 2.5 Flash (SOP Practice)

Step 4: Connect to Telegram and Background Persistence

Blood and Tears in the Migration Process: Why Isn't My Agent Responding?

Pitfall 1: The 404 Ghost of Gemini 3.0

Pitfall 2: Prefix Confusion of Model Identifiers

Pitfall 3: Conflict between Systemd and "Processes Already Running"

Summary

Gemini 3.1: Native TTS for Easier, More Powerful Summary Reading

Background

Technical Upgrade: From Live API to Native TTS

Core Code Evolution (tools/tts_tool.py)

The Pitfall: The Missing await

❌ Incorrect Writing

✅ Correct Solution

Localization Adjustment: Making the Bot Speak "Taiwanese"

Summary: Changes Brought by Native TTS

GCP in Action: Migrating a LINE Bot from AI Studio to Vertex AI to Solve 429 Errors

Background

Technical Upgrade: From AI Studio to Vertex AI

Pitfall 1: Residual Old SDK Causing Cloud Run Startup Failure

Pitfall 2: Vertex AI Model Name and Region Restrictions (404 Not Found)

Advanced Challenge: I want to try Gemini 3.0 Flash Preview!

Summary: Changes Brought by Vertex AI

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

Background

What is Multimodal Function Response?

Project Goal

Architecture Design

Why not use Google ADK?

Overall architecture

How to get product images?

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

The Pitfall: The Missing `await`

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

❌ Pitfall 1: `Part.from_function_response()` does not accept the `id` parameter

❌ Pitfall 2: `include_server_side_tool_invocations=True` causes Pydantic to explode

❌ Pitfall 3: `textQuery` is a parameter of `searchText`, not `searchNearby`

❌ Pitfall 4: No guard for `response.candidates[0]`

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect