<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Dileep Kumar Sharma</title>
    <description>The latest articles on Forem by Dileep Kumar Sharma (@dileep_kumarsharma_f76b7).</description>
    <link>https://forem.com/dileep_kumarsharma_f76b7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3826317%2F707e95d4-038b-4772-9019-459c465af9d0.png</url>
      <title>Forem: Dileep Kumar Sharma</title>
      <link>https://forem.com/dileep_kumarsharma_f76b7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dileep_kumarsharma_f76b7"/>
    <language>en</language>
    <item>
      <title>Building Reveria: An AI Story Engine with Gemini #GeminiLiveAgentChallenge</title>
      <dc:creator>Dileep Kumar Sharma</dc:creator>
      <pubDate>Mon, 16 Mar 2026 05:27:50 +0000</pubDate>
      <link>https://forem.com/dileep_kumarsharma_f76b7/building-reveria-an-ai-story-engine-with-gemini-geminiliveagentchallenge-53c3</link>
      <guid>https://forem.com/dileep_kumarsharma_f76b7/building-reveria-an-ai-story-engine-with-gemini-geminiliveagentchallenge-53c3</guid>
      <description>&lt;p&gt;&lt;em&gt;Describe a story. Watch it come alive. That's the pitch. Here's how I actually built it.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Built for the &lt;a href="https://devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; hackathon (Creative Storyteller Track). #GeminiLiveAgentChallenge&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What is Reveria?
&lt;/h2&gt;

&lt;p&gt;Reveria is an interactive story engine. You type (or say) something like "a noir detective story in a rain-soaked city at midnight," and it generates an illustrated storybook in real time: narrative text, scene illustrations, voice narration, and an interactive flipbook you can page through. Everything streams in live as four AI agents work in parallel.&lt;/p&gt;

&lt;p&gt;What makes it different from "give me a story" ChatGPT wrappers is the &lt;strong&gt;Director Chat&lt;/strong&gt;. You open a voice conversation with an AI Director character, brainstorm your story idea out loud, and when the Director decides you're ready, it triggers generation automatically. During generation, the Director watches each scene and offers creative analysis in real time. It suggests what should happen next, and the Narrator picks up that suggestion in the following scene. Two agents shaping a story together, with you steering.&lt;/p&gt;

&lt;p&gt;This isn't a single API call. It's a multi-agent pipeline built on Google's Agent Development Kit (ADK), with Gemini 2.0 Flash for text, Imagen 3 for illustrations, Gemini Live API for voice, and Gemini Native Audio for narration. Each agent runs at a different temperature tuned for its task.&lt;/p&gt;

&lt;p&gt;Beyond generation, Reveria is a full application: a Library for your saved stories, an Explore page for discovering published work from other users, Reading Mode with karaoke-style narration, PDF export, 8-language support, 9 story templates, 30+ art styles, social features (likes, ratings, comments), and share links for public viewing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live app&lt;/strong&gt;: &lt;a href="https://reveria.web.app" rel="noopener noreferrer"&gt;reveria.web.app&lt;/a&gt; | &lt;strong&gt;Source&lt;/strong&gt;: &lt;a href="https://github.com/Dileep2896/reveria" rel="noopener noreferrer"&gt;github.com/Dileep2896/reveria&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick stats&lt;/strong&gt;: 4 AI Agents · 30+ Art Styles · 9 Story Templates · 8 Languages&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4qyg6yme0fbl3v6zn4d.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4qyg6yme0fbl3v6zn4d.jpg" alt="9 story templates from Storybook to Manga to Photo Journal" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Template Chooser - pick from 9 story templates via a 3D coverflow carousel&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;Reveria runs four specialist agents coordinated by ADK's SequentialAgent. The key design decision: &lt;strong&gt;different temperatures for different tasks&lt;/strong&gt;. Story writing needs high creativity (temp 0.9). Image prompts need precision (temp 0.3). Character extraction needs determinism (temp 0.1). Director analysis needs structured JSON output (temp 0.3). A single Gemini call can't do all of these well.&lt;/p&gt;

&lt;p&gt;The pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;StoryOrchestrator (SequentialAgent)
  +-- NarratorADKAgent (per-scene streaming loop)
  |     |
  |     +-- Scene 1 text ready ──&amp;gt; Illustrator (Imagen 3)
  |     |                     ──&amp;gt; TTS (Gemini Native Audio)
  |     |                     ──&amp;gt; Director Live (commentary)
  |     |
  |     +-- [Check steering queue → inject user direction]
  |     |
  |     +-- Scene 2 text ready ──&amp;gt; (same parallel tasks)
  |     |
  |     +-- await all pending tasks
  |
  +-- PostNarrationAgent (ParallelAgent)
        +-- Director Agent (full post-batch analysis)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Four agents, four roles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Narrator Agent&lt;/strong&gt; (Gemini 2.0 Flash, temp 0.9): writes each scene with consistent characters and plot threads, streams text chunk-by-chunk over WebSocket&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Illustrator Agent&lt;/strong&gt; (Gemini + Imagen 3, temp 0.1–0.3): four-stage hybrid prompt pipeline for visually consistent illustrations across scenes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS Agent&lt;/strong&gt; (Gemini Native Audio): audiobook-quality narration that varies tone with mood&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Director Agent&lt;/strong&gt; (Gemini Flash, temp 0.3): per-scene live commentary with mood, tension, craft notes, and creative suggestions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each prompt generates exactly one scene. This keeps the feedback loop tight. Everything streams over a single WebSocket  - text chunk-by-chunk, images as each Imagen call completes, audio per-scene, Director analysis as structured JSON.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1jlgrlu8f8ztkosacwcq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1jlgrlu8f8ztkosacwcq.jpg" alt="System Architecture - four agents coordinated by ADK SequentialAgent" width="800" height="366"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;System Architecture - four agents coordinated by ADK's SequentialAgent&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Build
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Week 1&lt;/strong&gt; was about proving the core pipeline. Day 1: can we get Gemini to generate story text, stream it over WebSocket, and render it in a flipbook? Day 2 brought the first big challenge: image generation  - Imagen 3 produces stunning illustrations, but characters looked completely different across scenes. Day 3 was the Firebase integration marathon: auth, Firestore persistence, save flows, Library, URL routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 2&lt;/strong&gt; was about solving character consistency (the hardest problem  - described below), building Director Mode with live commentary, adding templates and art styles, and getting per-scene streaming working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3&lt;/strong&gt; was the Director Chat integration with the Gemini Live API, the safety and content filtering system, social features, multi-language support, Reading Mode, the CI/CD pipeline, and a lot of polish. The interaction-flow audit at the end caught 9 bugs that would have been embarrassing in production.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Biggest Challenge: Character Consistency
&lt;/h2&gt;

&lt;p&gt;This was the hardest technical problem I solved.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;The naive approach: send scene text to Gemini ("write an image prompt"), get a 100-word prompt, send to Imagen. Gemini would receive a scene about "Elena, a woman in her late 20s with pale skin, long dark wavy hair, green eyes, wearing a high-collar black Victorian dress" and compress it to "woman in dark dress." Characters changed faces, hair color, and outfits between every scene.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Fix: Hybrid Prompt Construction
&lt;/h3&gt;

&lt;p&gt;We split image prompt creation into four stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Character Sheet Extraction&lt;/strong&gt; (Gemini, temp 0.1): reads the full story and outputs structured character descriptions with hex color codes, face shapes, signature items, and dominant palette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Character Identification&lt;/strong&gt; (Gemini, temp 0.0): identifies which characters appear in each scene&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scene Composition&lt;/strong&gt; (Gemini, temp 0.3): writes ONLY setting, lighting, mood, camera angle  - explicitly told "do NOT describe characters"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assembly&lt;/strong&gt;: character descriptions + anti-drift anchor + scene composition + art style suffix, concatenated programmatically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The final prompt sent to Imagen looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Elena: [gender: woman], [age: late 20s], [skin: pale ivory #F5E6D3],
[hair: dark wavy #2A1810 shoulder-length], [face: oval, green #4A7C59 eyes,
high cheekbones], [outfit: black #1A1A2E Victorian dress, silver moon pendant],
[signature items: silver moon pendant, lace gloves],
[palette: #1A1A2E, #F5E6D3, #4A7C59, #C0C0C0]

IMPORTANT: Render each character EXACTLY as described above.

Elena stands at the edge of a moonlit cliff, wind catching her dress.
Low angle, dramatic backlighting, cinematic digital painting,
highly detailed, dramatic volumetric lighting, depth of field.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hex color codes give Imagen specific, unambiguous visual targets instead of subjective descriptions like "pretty woman in dark clothing."&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchor Portraits and Visual DNA
&lt;/h3&gt;

&lt;p&gt;We pushed this further. Before generating any scene images, the Illustrator creates a 1:1 close-up portrait of each character via Imagen 3, then feeds it to Gemini Vision for &lt;strong&gt;visual DNA extraction&lt;/strong&gt;  - a 100–150 word description of exactly what was rendered. Subsequent scene prompts reference this visual DNA instead of the original text description. Characters look recognizably like &lt;em&gt;themselves&lt;/em&gt; across every scene, because every prompt references a description derived from a real rendered image.&lt;/p&gt;




&lt;h2&gt;
  
  
  Director Chat: Talking to Your Story's AI Director
&lt;/h2&gt;

&lt;p&gt;This is the feature I'm most excited about. Director Chat is a real-time voice conversation with an AI Director character, built on the &lt;strong&gt;Gemini Live API&lt;/strong&gt; (&lt;code&gt;gemini-live-2.5-flash-native-audio&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start session&lt;/strong&gt;: Frontend sends story context. Backend opens a persistent bidirectional Gemini Live session with function calling, native audio transcription, and context window compression.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation&lt;/strong&gt;: User speaks. Web Audio's AnalyserNode detects 800ms of silence to auto-stop the recorder. Audio goes over WebSocket to the Live session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-driven generation&lt;/strong&gt;: When the model decides brainstorming is done, it calls the &lt;code&gt;generate_story&lt;/code&gt; tool with a vivid prompt distilled from your conversation. No external classifier needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual fallback&lt;/strong&gt;: A "Suggest" button handles cases where tool calling doesn't fire in audio mode.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Zero Extra API Calls
&lt;/h3&gt;

&lt;p&gt;The previous architecture made 3–5 separate Gemini calls per interaction: one for conversation, one for user transcription, one for Director transcription, one for intent detection, one for prompt suggestion. Massive latency and API waste.&lt;/p&gt;

&lt;p&gt;The rewrite eliminated ALL extra calls using three native Live API features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native transcription&lt;/strong&gt; (&lt;code&gt;input_audio_transcription&lt;/code&gt; / &lt;code&gt;output_audio_transcription&lt;/code&gt;): transcripts arrive in the receive stream. No separate STT calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native function calling&lt;/strong&gt;: the model decides when to generate. Replaces the external intent classifier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window compression&lt;/strong&gt;: sliding window handles long brainstorming sessions automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Streaming Audio: Eliminating the "Thinking" Gap
&lt;/h3&gt;

&lt;p&gt;The original Director Chat had a noticeable delay  - full audio had to be collected, encoded as WAV, and sent as a data URL. The fix: stream raw PCM chunks incrementally. A &lt;code&gt;useStreamingAudio&lt;/code&gt; hook feeds each chunk into Web Audio API AudioBufferSource nodes for gapless playback. The Director's voice starts within 200–400ms instead of 1–2 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice-Reactive Orb
&lt;/h3&gt;

&lt;p&gt;The voice orb is a canvas-based organic visualization  - four overlapping soft blobs driven by real-time audio amplitude. Six visual modes (idle, recording, speaking, loading, watching, waiting) transition smoothly via per-frame color and speed lerping. Asymmetric smoothing (fast attack, slow decay) makes it feel alive.&lt;/p&gt;

&lt;p&gt;For accessibility, a text input mode lets users type messages to the Director instead of speaking.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F782b7xx1pd6ger5dn9rd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F782b7xx1pd6ger5dn9rd.jpg" alt="Director Chat - voice brainstorming with the AI Director" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Director Chat - voice brainstorming with the AI Director, then watching generation unfold&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx1s9rwtca0ne75m0ag9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx1s9rwtca0ne75m0ag9.jpg" alt="Director Chat architecture - Gemini Live API with native tool calling" width="800" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Director Chat architecture - Gemini Live API with native tool calling and transcription&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Per-Scene Streaming: Making It Feel Alive
&lt;/h2&gt;

&lt;p&gt;The original pipeline was batch-sequential: Narrator generates ALL text, then ALL images, then ALL audio. Users stared at a spinner for 15–30 seconds.&lt;/p&gt;

&lt;p&gt;The rewrite fires image, audio, and Director commentary tasks &lt;strong&gt;per-scene&lt;/strong&gt; as each scene's text completes. Scene 1's image paints in while Scene 2's text is still streaming.&lt;/p&gt;

&lt;p&gt;A module-level &lt;code&gt;asyncio.Semaphore(1)&lt;/code&gt; serializes Imagen calls for rate limiting, but they start as soon as each scene's text is ready. &lt;code&gt;handle_generate&lt;/code&gt; runs as &lt;code&gt;asyncio.create_task()&lt;/code&gt; so the WebSocket loop stays responsive  - users can send steer messages ("make it scarier") during generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Director as Creative Partner
&lt;/h3&gt;

&lt;p&gt;The Director's live commentary includes a &lt;code&gt;suggestion&lt;/code&gt; field that proposes what should happen next. This is stored on shared state and prepended to the Narrator's input for the next scene. The Director doesn't just observe  - it drives. It spots an opportunity ("Reveal that the stranger is her long-lost sister"), and the Narrator runs with it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxhmu549nz8c3n0d9u9x.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxhmu549nz8c3n0d9u9x.jpg" alt="Live story generation with Director analysis panel" width="800" height="504"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Story Generation - live text, image, and audio streaming with Director analysis panel&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Visual Narratives: Comics, Manga, and Webtoons
&lt;/h2&gt;

&lt;p&gt;Templates aren't skins. Each one reshapes the entire pipeline. A Manga template changes the scene composer to use character-dominant framing, activates the text-free image defense, adjusts TTS to narrate only overlay text, and shifts the Narrator toward visual storytelling.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Text-in-Image Problem
&lt;/h3&gt;

&lt;p&gt;Comic art styles triggered Imagen to render speech bubbles with garbled AI text. Our fix is a &lt;strong&gt;triple-layer defense&lt;/strong&gt;: a positive "Text-free panel art:" prefix at the start of the prompt (where attention weight is highest), explicit composer instructions, and negative constraints at the end. We learned the hard way that putting negative constraints first consumed Imagen's attention budget and degraded character consistency.&lt;/p&gt;




&lt;h2&gt;
  
  
  The UI: Glassmorphism Meets Interactive Fiction
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cinematic Book Opening
&lt;/h3&gt;

&lt;p&gt;New stories trigger a choreographed entrance: the book materializes with a brightness bloom at 60%, then the cover flips open in an overlapping motion that starts at 350ms (before the entrance finishes). The overlap creates one fluid motion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini Native Audio Narration
&lt;/h3&gt;

&lt;p&gt;We replaced Cloud TTS with Gemini's native audio output. The difference is striking: audiobook-quality narration that varies tone with mood instead of robotic voices. Reading Mode adds word-by-word karaoke highlighting synced to the audio.&lt;/p&gt;

&lt;h3&gt;
  
  
  Library and Social Features
&lt;/h3&gt;

&lt;p&gt;3D CSS book cards with perspective transforms, spine shadows, and page edges. Published stories get a BookDetailsPage with likes, star ratings, and threaded comments  - all denormalized on the story document for zero pop-in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkf9gr1v371hw4oaadqf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkf9gr1v371hw4oaadqf.jpg" alt="Published story with characters, ratings, and social features" width="800" height="450"&gt;&lt;/a&gt;&lt;em&gt;Book Details - published story with characters, ratings, and social features&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Safety and Content Filtering
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pre-pipeline&lt;/strong&gt;: A Gemini Flash classifier (temp 0, ~200ms) catches non-story prompts in any language. Fails open on errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-generation&lt;/strong&gt;: Pattern matching in 6 languages for edge cases.&lt;/p&gt;

&lt;p&gt;For borderline content, the Narrator redirects in-character: &lt;em&gt;"That part of the library is forbidden! Let's explore this mysterious path instead..."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Language Support
&lt;/h2&gt;

&lt;p&gt;Reveria generates stories in 8 languages: English, Spanish, French, German, Japanese, Hindi, Portuguese, and Chinese. Language flows through &lt;code&gt;SharedPipelineState&lt;/code&gt; and touches every agent: Narrator prompt, TTS voice selection, title generation, content filtering, and Director Chat personality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini Native Interleaved Output
&lt;/h3&gt;

&lt;p&gt;The primary generation path uses &lt;code&gt;response_modalities: ["TEXT", "IMAGE"]&lt;/code&gt;  - Gemini generates text and images together in a single call. But &lt;strong&gt;Imagen 3 is always primary for images&lt;/strong&gt;. The Gemini native image is a tier-0 fallback when Imagen fails. Why? Character consistency  - our full pipeline (character sheets, visual DNA, hybrid prompts) only works with Imagen.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cloud Infrastructure
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run&lt;/strong&gt;: containerized FastAPI backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firebase Hosting&lt;/strong&gt;: React SPA frontend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Firestore&lt;/strong&gt;: story persistence, social features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Storage&lt;/strong&gt;: scene images, covers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertex AI&lt;/strong&gt;: Gemini 2.0 Flash, Imagen 3, Gemini Native Audio, Live API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions CI/CD&lt;/strong&gt;: 4 jobs  - backend tests, frontend tests, Cloud Run deploy, Firebase deploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key resilience patterns: per-user circuit breaker for Imagen quota, retry utility with transient error classification, GCS signed URL fallback, atomic Firestore transactions for usage tracking, first-message WebSocket auth (no credentials in URLs).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcuu95b9wmxuqav6x0zyc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcuu95b9wmxuqav6x0zyc.jpg" alt="Cloud Infrastructure - GCP deployment architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Cloud Infrastructure - full GCP deployment architecture&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5z7eimlnqc90lj92yx0f.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5z7eimlnqc90lj92yx0f.jpg" alt="Explore - discover published stories from the community" width="800" height="504"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Explore - discover published stories from the community&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flw42fp3itqhv0mfc6p8q.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flw42fp3itqhv0mfc6p8q.jpg" alt="Subscription and Usage - Free, Standard, and Pro tiers" width="800" height="504"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Subscription &amp;amp; Usage - Free, Standard, and Pro tiers with usage tracking&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Prompt Engineering is Architecture.&lt;/strong&gt; When your prompt construction has four stages with different temperatures, it's not a template  - it's a data pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use Native API Features First.&lt;/strong&gt; Our Director Chat went from 3–5 Gemini calls per interaction to zero extra calls by enabling native transcription, function calling, and context compression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Per-Scene is the Right Granularity.&lt;/strong&gt; Scene-level parallelism (fire tasks as each scene completes) makes the experience feel live. The UX improvement is dramatic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Make Agents Proactive, Not Just Reactive.&lt;/strong&gt; The Director started as a passive observer. The breakthrough was giving it a suggestion field that feeds the Narrator. A read-only analyst became an active creative partner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Voice UX Needs Silence Detection.&lt;/strong&gt; Web Audio's AnalyserNode detects speech-to-silence transitions and auto-stops the recorder. One tap to start, zero taps after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Flow Audits Find Crashes, Code Audits Find Patterns.&lt;/strong&gt; The critical bug: silently dropping a Gemini Live API tool call. The protocol requires a &lt;code&gt;FunctionResponse&lt;/code&gt; for every tool call. Dropping it corrupted the session permanently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Templates Are Modes, Not Skins.&lt;/strong&gt; When a config option touches four pipeline stages, it's architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Character Consistency Requires Structural Solutions.&lt;/strong&gt; You can't prompt-engineer your way to consistent characters with a single call. Separate extraction from composition, use hex color codes, and anchor to rendered portraits via Gemini Vision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: React + CSS (glassmorphism) + Vite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: Python 3.12 + FastAPI + Uvicorn&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Framework&lt;/strong&gt;: Google ADK (SequentialAgent + ParallelAgent)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: Gemini 2.0 Flash via Vertex AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interleaved Output&lt;/strong&gt;: Gemini native text+image (Imagen primary, Gemini fallback)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Generation&lt;/strong&gt;: Imagen 3 via Vertex AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Director Chat&lt;/strong&gt;: Gemini Live API (gemini-live-2.5-flash-native-audio)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice&lt;/strong&gt;: Web Audio API + Gemini Native Audio&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth&lt;/strong&gt;: Firebase Authentication (Google Sign-In)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: Cloud Firestore&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: Google Cloud Storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosting&lt;/strong&gt;: Cloud Run + Firebase Hosting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD&lt;/strong&gt;: GitHub Actions (4-job pipeline)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Live app&lt;/strong&gt;: &lt;a href="https://reveria.web.app" rel="noopener noreferrer"&gt;reveria.web.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source code&lt;/strong&gt;: &lt;a href="https://github.com/Dileep2896/reveria" rel="noopener noreferrer"&gt;github.com/Dileep2896/reveria&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built for the &lt;a href="https://devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; hackathon (Creative Storyteller Track) using Google's AI technologies including Gemini 2.0 Flash, Imagen 3, Gemini Live API, Gemini Native Audio, and the Agent Development Kit (ADK).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#GeminiLiveAgentChallenge&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Describe a story. Watch it come alive.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>ai</category>
      <category>googlecloud</category>
      <category>gemini</category>
    </item>
  </channel>
</rss>
