<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: dang phan</title>
    <description>The latest articles on Forem by dang phan (@dangineer_4k2).</description>
    <link>https://forem.com/dangineer_4k2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825502%2Fedd2d47e-799c-40da-8ca3-0e5b1ad3940a.png</url>
      <title>Forem: dang phan</title>
      <link>https://forem.com/dangineer_4k2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dangineer_4k2"/>
    <language>en</language>
    <item>
      <title>Building Idea2Socdia: A Multimodal AI Agent with Gemini and Vertex AI</title>
      <dc:creator>dang phan</dc:creator>
      <pubDate>Sun, 15 Mar 2026 15:26:01 +0000</pubDate>
      <link>https://forem.com/dangineer_4k2/building-idea2socdia-a-multimodal-ai-agent-with-gemini-and-vertex-ai-364c</link>
      <guid>https://forem.com/dangineer_4k2/building-idea2socdia-a-multimodal-ai-agent-with-gemini-and-vertex-ai-364c</guid>
      <description>&lt;p&gt;&lt;em&gt;I created this piece of content for the purposes of entering the Gemini Live Agent Challenge.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As a recent Computer Science graduate, transitioning from academic machine learning models to deploying a production-ready, Cloud-Native AI system is a thrilling leap. For this hackathon, I wanted to tackle a real-world problem: the "context-switching fatigue" that content creators face when juggling scriptwriting, image generation, and video rendering tools.&lt;/p&gt;

&lt;p&gt;The solution is &lt;strong&gt;Idea2Socdia&lt;/strong&gt;, a Human-In-The-Loop (HITL) multimodal AI agent. Here is a deep dive into how I built it using Google's ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: 100% Cloud Native
&lt;/h2&gt;

&lt;p&gt;To ensure scalability and maintain a stateless architecture, the system is fully deployed on Google Cloud:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; A responsive Next.js application hosted on Google Cloud Run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; A high-performance FastAPI server, also containerized via Docker and running on Cloud Run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Media Storage:&lt;/strong&gt; Google Cloud Storage (GCS) securely holds all generated assets and serves public URLs back to the client.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Brain: Gemini 3 Flash &amp;amp; Interleaved Generation
&lt;/h2&gt;

&lt;p&gt;The core orchestration relies on &lt;strong&gt;Gemini 3 Flash&lt;/strong&gt; via the new &lt;code&gt;google-genai&lt;/code&gt; SDK. Instead of traditional multi-step prompting, Idea2Socdia leverages &lt;strong&gt;interleaved generation&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;The LLM acts as a "Content Director." As it streams the strategic outline and script back to the Next.js frontend via NDJSON, it autonomously evaluates the narrative. When it determines a visual is needed, it pauses the text stream, constructs a highly contextual prompt, and triggers a media generation tool call. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Media Engine: Vertex AI
&lt;/h2&gt;

&lt;p&gt;For the visual components, the backend securely authenticates with Google Cloud via OAuth2 to access &lt;strong&gt;Vertex AI&lt;/strong&gt; endpoints. &lt;br&gt;
Depending on the target platform (e.g., a Facebook post vs. a YouTube Short), the agent dynamically decides whether to call state-of-the-art text-to-image models (like Nano Banana) or text-to-video models (like Veo). Once Vertex AI returns the media bytes, the backend streams them directly to GCS, bypassing the need for local persistent storage on Cloud Run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building Idea2Socdia provided deep, practical experience in orchestrating complex LLM workflows and managing enterprise-grade cloud resources. By strictly grounding the model and keeping the human in the loop, the system transforms raw ideas into ready-to-publish social campaigns seamlessly.&lt;/p&gt;

&lt;p&gt;You can check out the public repository here: &lt;a href="https://github.com/PTD504/idea-to-socdia" rel="noopener noreferrer"&gt;Idea2Socdia&lt;/a&gt;&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>gemini</category>
      <category>geminiliveagentchallenge</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
