Building Idea2Socdia: A Multimodal AI Agent with Gemini and Vertex AI

dang phan — Sun, 15 Mar 2026 15:26:01 +0000

I created this piece of content for the purposes of entering the Gemini Live Agent Challenge.

As a recent Computer Science graduate, transitioning from academic machine learning models to deploying a production-ready, Cloud-Native AI system is a thrilling leap. For this hackathon, I wanted to tackle a real-world problem: the "context-switching fatigue" that content creators face when juggling scriptwriting, image generation, and video rendering tools.

The solution is Idea2Socdia, a Human-In-The-Loop (HITL) multimodal AI agent. Here is a deep dive into how I built it using Google's ecosystem.

The Architecture: 100% Cloud Native

To ensure scalability and maintain a stateless architecture, the system is fully deployed on Google Cloud:

Frontend: A responsive Next.js application hosted on Google Cloud Run.
Backend: A high-performance FastAPI server, also containerized via Docker and running on Cloud Run.
Media Storage: Google Cloud Storage (GCS) securely holds all generated assets and serves public URLs back to the client.

The Brain: Gemini 3 Flash & Interleaved Generation

The core orchestration relies on Gemini 3 Flash via the new google-genai SDK. Instead of traditional multi-step prompting, Idea2Socdia leverages interleaved generation.

The LLM acts as a "Content Director." As it streams the strategic outline and script back to the Next.js frontend via NDJSON, it autonomously evaluates the narrative. When it determines a visual is needed, it pauses the text stream, constructs a highly contextual prompt, and triggers a media generation tool call.

The Media Engine: Vertex AI

For the visual components, the backend securely authenticates with Google Cloud via OAuth2 to access Vertex AI endpoints.
Depending on the target platform (e.g., a Facebook post vs. a YouTube Short), the agent dynamically decides whether to call state-of-the-art text-to-image models (like Nano Banana) or text-to-video models (like Veo). Once Vertex AI returns the media bytes, the backend streams them directly to GCS, bypassing the need for local persistent storage on Cloud Run.

Conclusion

Building Idea2Socdia provided deep, practical experience in orchestrating complex LLM workflows and managing enterprise-grade cloud resources. By strictly grounding the model and keeping the human in the loop, the system transforms raw ideas into ready-to-publish social campaigns seamlessly.

You can check out the public repository here: Idea2Socdia

Forem: dang phan

Building Idea2Socdia: A Multimodal AI Agent with Gemini and Vertex AI

The Architecture: 100% Cloud Native

The Brain: Gemini 3 Flash & Interleaved Generation

The Media Engine: Vertex AI

Conclusion