If you’ve ever wanted to take control of Google Veo’s powerful video generation but felt boxed in by vague prompts, you’re not alone. Luckily, there’s a hack going around the creative corners of the internet that lets you fine-tune every single element of your video—using a clean JSON format.
Before we dive deep into crafting cinematic prompts with JSON, here’s a tip for devs building anything around video generation tools, APIs, or creative workflows:Apidog Docs is perfect for documenting and testing your API endpoints in one clean interface.
In this guide, we’ll break down what this JSON hack looks like, why it’s blowing up, and how you can use it to replicate cinematic aesthetics, lens types, wardrobe styles, ambient sound, and even tone of voice. Whether you’re building a fashion short film or an anime-inspired clip, this method gives you the building blocks.
What’s the Deal with the Veo JSON Hack?
Instead of feeding Veo 3 a vague block of text and hoping it gets it right, this JSON-format approach gives you something better: structure and control.
It’s like giving the AI a shot list and creative brief in one — and suddenly, your output starts to feel like it had a human director.
Here’s why this works:
Why JSON Makes Sense for Veo Prompts:
- Cleaner input: Each section of your idea (camera, subject, audio, lighting, etc.) is broken down clearly.
- Modular editing: Want to change the mood or location? Just tweak one section—no need to rewrite the whole thing.
- Cinematic control: You can define:
- Lens type and film grain
- Camera movement (e.g., Steadicam, handheld)
- Ambient sound and vocal tone
- Lighting style and time of day
- Specific wardrobe and styling cues
-
No surprises: Want no subtitles or overlays? Just say it outright in the
visual_rules
section.
What This Means for Creators:
- You're not guessing what Veo “might” generate anymore.
- You’re guiding the visuals like a director using a script.
- You can replicate or remix your style across scenes or projects.
So instead of hoping for great results, you’re engineering them—one field at a time.
Full JSON Example Breakdown
Let’s break down this example JSON block that generated a stylish Tokyo street-style morning scene:
{
"shot": {
"composition": "Medium tracking shot, 50mm lens, shot on RED V-Raptor 8K with Netflix-approved HDR setup, shallow depth of field",
"camera_motion": "smooth Steadicam walk-along, slight handheld bounce for naturalistic rhythm",
"frame_rate": "24fps",
"film_grain": "clean digital with film-emulated LUT for warmth and vibrancy"
},
"subject": {
"description": "A young woman with a petite frame and soft porcelain complexion. She has oversized, almond-shaped eyes with long lashes, subtle pink-tinted cheeks, and a heart-shaped face. Her inky-black bob is slightly tousled and clipped to one side with a small red strawberry hairpin. Her style blends playful retro and modern Tokyo streetwear: she wears a crocheted ivory halter top with scalloped edges, high-waisted denim shorts with a wide brown belt and a red enamel star buckle, and a loose red gingham blouse draped off one shoulder. Her accessories include glossy cherry lip tint, a beaded bracelet stack, and soft shimmer eyeshadow.",
"wardrobe": "Crocheted ivory halter with scalloped trim, fitted high-waisted denim shorts, wide tan belt with red enamel star buckle, oversized red gingham blouse slipped off one shoulder, strawberry hairpin in side-parted bob, and translucent plastic bead bracelets in pink and cream tones."
},
"scene": {
"location": "a quiet urban street bathed in early morning sunlight",
"time_of_day": "early morning",
"environment": "empty sidewalks, golden sunlight reflecting off puddles and windows, occasional birds fluttering by, street slightly wet from overnight rain"
},
"visual_details": {
"action": "she walks rhythmically down the sidewalk, swinging her hips slightly with the beat, one hand gesturing playfully, the other adjusting her shirt sleeve as she sings",
"props": "morning mist, traffic light turning green in the distance, reflective puddles, subtle sun flare"
},
"cinematography": {
"lighting": "natural golden-hour lighting with soft HDR bounce, gentle lens flare through morning haze",
"tone": "playful, stylish, vibrant",
"notes": "STRICTLY NO on-screen subtitles, lyrics, captions, or text overlays. Final render must be clean visual-only."
},
"audio": {
"ambient": "city birds chirping, distant traffic hum, her boots tapping pavement",
"voice": {
"tone": "light, teasing, and melodic",
"style": "pop-rap delivery in Japanese with flirtatious rhythm, confident breath control, playful pacing and bounce"
},
"lyrics": "ラーメンはもういらない、キャビアだけでいいの。 ファイナンスのおかげで、私、星みたいに輝いてる。"
},
"color_palette": "sun-warmed pastels with vibrant reds and denim blues, soft contrast with warm film LUT",
"dialogue": {
"character": "Woman (singing in Japanese)",
"line": "ラーメンはもういらない、キャビアだけでいいの。 ファイナンスのおかげで、私、星みたいに輝いてる。",
"subtitles": false
},
"visual_rules": {
"prohibited_elements": [
"subtitles",
"captions",
"karaoke-style lyrics",
"text overlays",
"lower thirds",
"any written language appearing on screen"
]
}
}
Rather than paste the entire block again, here’s what this structured prompt includes:
Shot
- Composition type (medium tracking shot, 50mm lens)
- Motion style (Steadicam, with a touch of handheld)
- Frame rate and LUT film grain
- You basically get full cinematographer-level control here.
Subject & Wardrobe
The description is highly detailed—down to accessories like strawberry hairpins and cherry lip gloss. The character is described in visual, tactile language that helps the AI model generate vivid results.
Scene & Environment
- Time of day: Early morning
- Atmosphere: Golden light, empty street, wet pavement
- It even includes birds and puddle reflections.
Visual Details & Props
- Physical actions like walking, singing, adjusting clothes
- Elements like sun flares and mist
- Props (traffic light in distance, puddles, etc.)
Lighting & Tone
Golden hour with HDR bounce and soft lens flares. Think soft, dreamy, but vibrant. It also sets the mood: “playful, stylish, vibrant.”
Audio & Lyrics
- Ambient audio: birds, distant cars, shoes tapping
- Voice tone: melodic, teasing, playful
- Lyrics in Japanese: flashy, finance-themed
No subtitles, no captions—this is a strict “visual-only” policy.
Why This Method Works
AI video generators like Veo thrive on structure. While most prompt-based tools respond to loose storytelling instructions, JSON gives your request:
- Clarity: No confusion about what goes where
- Control: Set every scene element like a director
- Reproducibility: You can tweak one part at a time
Customize It for Your Own Videos
Want to use this format for your own project? Here’s a simple way to do it:
You can plug in your own style references, film gear, mood, and tone. The more specific, the better.
Tips to Nail the Perfect Veo JSON Prompt
- Stick to film language: Use words like “lens,” “frame rate,” “cinematic motion,” “bokeh,” etc.
- Describe subject like you’re painting: Facial structure, clothing texture, accessories
- Set tone with lighting and audio: Warm/cold, sharp/soft, ambient/clean
- Use verbs: Have your character walk, spin, sing, adjust, etc.
- Avoid prohibited elements: Like this JSON did—no on-screen text unless you want chaos.
Before You Try It...
This method isn't "official," but it’s shockingly effective. Don’t be afraid to experiment. Start small—change the lighting, add props, or switch the scene—and compare the results. That’s where the magic happens.
If Google ever decides to expose a formal JSON interface, you’ll already be ahead of the game.
Why This Matters for Creators and Developers
Generative video tools like Veo 3 aren’t just about clicking “generate” and hoping for the best anymore. They’re evolving into precision instruments—and this JSON approach proves it. For creators, that means you don’t need to settle for generic outputs. With a structured format, you can dial in exactly what you want, from lens type to lighting mood, all the way to wardrobe details and ambient audio.
For developers, this opens up exciting possibilities:
- You could build custom prompt templates for different aesthetics.
- Automate prompt generation based on mood boards or UI inputs.
- Even integrate with APIs to create video production pipelines.
It's like turning generative video into a programmable medium—and that’s a big deal. It means your creative vision doesn’t get lost in vague prompts. Instead, it’s translated clearly, line by line, into a stunning visual output.
This isn’t just a hack. It’s a new workflow. One that’s structured, repeatable, and tailored to your vision.
Final Thoughts
This JSON-style hack shows that cinematic video generation is entering its prompt-engineering era. With the right structure, you can make Veo 3 do things that feel hand-directed.
Whether you’re making moody cityscapes or fun music video snippets, the format is flexible enough to match your vision.
Let your JSON tell the story—and let your tools bring it to life.
Top comments (11)
Great guide! Never thought of this method. Good work Emmanuel!
You are welcome Gary. It's quite a hidden gem.
Impressive work! Having been watching so many Google Veo videos with various styles, this is how they cooked!
Good job!
Glad you find it helpful. That's awesome.😎
Nice one Emmanuel
I'll definitely try it out
This is extremely impressive, honestly. I've wasted too much time on scattered prompts that never quite deliver - having this kind of control with JSON feels like finally getting to actually direct instead of just hoping for the best
So hey... am I the only one who didn't actually see a JSON example? Like, it's mentioned, a lot, and never actually shown. Yet none of the other commenters noticed this, so maybe it's just me.
Hey Raymond, Thanks for pointing that out. I might have mistakenly edited that part. It's now added back. Thanks
Cool tips! Do you have a prompt or n8n workflow that can automate this process?
Yes! 🔥 And definitely keep an eye out for my next post — I’ll be sharing more on that soon.
Cool! I'm doing something similar with Sora using a DSL based Lisp and S-expression mixed with JSON(why? vibe coding xD) Messing with functions and parameter rules, currently sat on PascalCase but am about to move to NLP for the parameters.
SceneID("scene_id", Duration=5s) {
Environment(Room=..., Style=..., Time=..., Lighting(...))
Characters(Main=..., BodyState=..., Gesture=..., EyeContact=...)
Props(...)
Camera(Angle=..., Motion=..., Framing=...)
Emotion(Outer=..., Inner=...)
FocalPoint("...")
Rhythm(Beat=..., Cut=...)
SceneArc(Action=..., Outcome=...)
Echo("Previous scene_id")
}