Forem: QAYS KADHIM

I Tested My AI Ad Generator on 3 Completely Different Ad Formats — Here's What Actually Happened

QAYS KADHIM — Tue, 03 Mar 2026 02:39:15 +0000

I recently open-sourced AdVideo Creator, a CLI tool that lets Claude generate complete video ads — script, images, voiceover, music, and final video — through a single prompt. In my first post, I walked through the architecture: 45 MCP tools, 5 quality gates, and a 15-step pipeline.

The response was great. But one comment stuck with me:

"Would love to see a follow-up post benchmarking output quality across different ad formats."

Fair point. Architecture posts are nice, but what actually comes out the other end? So I picked 3 very different ad scenarios, ran them through the full pipeline, and recorded everything — scores, retries, failures, and the final videos.

Here's what happened.

The 3 Tests

I deliberately chose formats that stress different parts of the pipeline:

	Product Demo	Storytelling	CTA / Urgency
Product	HydroSync (smart water bottle)	Ember & Oak (coffee roastery)	SkillSprint (online courses)
Template	Product Demo (5 scenes)	Storytelling (5 scenes)	Countdown/Urgency (4 scenes)
Platform	TikTok 1080×1920	Instagram Reel 1080×1920	Instagram Feed 1080×1080
Duration	15 seconds	30 seconds	15 seconds
Image Style	Photorealistic	Watercolor	Flat-design
Language	English	English	Arabic (RTL)
Voice	ElevenLabs Elli	ElevenLabs Rachel	ElevenLabs Adam

Each test uses a different template, platform, aspect ratio, image style, and voice. The Arabic test also throws RTL text rendering into the mix.

Test 1: Product Demo — HydroSync Smart Water Bottle

The prompt:

Create a 15 second TikTok product demo ad for HydroSync — a smart water bottle that tracks your daily hydration and syncs with your phone app. Target audience is fitness-conscious millennials. Tone: energetic and modern.

What happened:

This was the smoothest run. The script passed on the first attempt at 8.05/10. Claude wrote a tight 5-scene structure: bold product reveal, two feature highlights (hydration tracking, phone sync), a lifestyle benefit shot, and a CTA.

Image generation was fast — all 5 scenes generated via Replicate Flux Schnell in about 2 seconds each. The photorealistic style produced clean, product-shot-style images that scored 9.88/10 average. Voiceover landed at 9.67/10 on the first try.

The final video exported at 14.4 seconds, 1080×1920, 12.9 MB. Hardware acceleration kicked in via Apple VideoToolbox.

The catch: The pipeline hit the 20-tool round limit before it could add subtitles or run the final composition scoring. The video still exported fine — it just skipped those last two steps.

Verdict: Product demos are the tool's sweet spot. Clear features, simple structure, photorealistic images — everything lines up.

Test 2: Storytelling — Ember & Oak Coffee Roastery

The prompt:

Create a 30 second Instagram Reel storytelling ad for Ember & Oak, a small-batch coffee roastery that partners directly with farmers in Colombia. The story should follow a farmer's journey from harvest to your cup. Tone: warm, authentic, emotional.

What happened:

This is where the self-grading system proved its value.

The first script scored 7.7/10. The grading system flagged two specific problems: the hook was generic (7/10) and the CTA was weak (6/10). Claude rewrote the script. The new hook — a pattern interrupt about coffee traveling 3,000 miles — scored 9/10. The CTA got specific. Version 2 passed at 8.4/10.

The watercolor image style was interesting. Four of the five scenes looked cohesive and atmospheric. Scene 2 (the discovery scene) scored lowest at 7.91 — slightly less watercolor consistency than the others. The average still held strong at 9.36/10.

Voiceover had a hiccup. The first attempt ran 32.79 seconds — almost 3 seconds over the 30-second target. The quality gate caught it, auto-shortened the text, and the retry came in at 29.95 seconds with a 9.0/10 score.

This was the only test where the full pipeline completed — including subtitles and composition scoring (8.35/10). The final video landed at exactly 30.0 seconds, 27.5 MB.

Verdict: Storytelling ads need more iteration, but the quality gates handle it. The self-grading loop catching the weak hook is exactly what you want from an automated system.

Test 3: CTA / Urgency — SkillSprint Flash Sale (Arabic)

The prompt:

Create a 15 second Instagram Feed ad in Arabic for SkillSprint — an online learning platform running a 48-hour flash sale with 60% off all courses. Target audience: Arabic-speaking young professionals. Tone: urgent and exciting.

What happened:

This was the hardest test by design — Arabic RTL, urgency template, square format, flat-design style. I wanted to push the tool.

The script passed first try at 8.4/10. Urgency ads have a clear structure (limited offer → value → scarcity → CTA), and Claude wrote strong Arabic copy with the right energy.

Then the voiceover became a challenge. Attempt 1 came back at 21.69 seconds — over 6 seconds too long for a 15-second ad. The quality gate caught it and auto-shortened. Attempt 2 scored 7.24/10 — below the 7.5 threshold due to pacing issues. Attempt 3 finally passed at 7.55/10 with 14.35 seconds duration.

Three attempts for voiceover. That's the most retries across all tests.

The cross-asset consistency check scored 6.45/10 — just below the 6.5 threshold. It flagged color palette variations between flat-design scenes. The pipeline noted it needs review but continued to export.

The final video: 14.4 seconds, 1080×1080, 6.4 MB. RTL text overlays rendered correctly with lang: ar. Arabic metadata and hashtags were generated automatically.

Verdict: Arabic ads work, but they're the hardest path. Voice generation needs more attempts, and flat-design consistency across scenes is trickier than photorealistic or watercolor. The pipeline handles it — it just works harder.

The Numbers Side by Side

Metric	Product Demo	Storytelling	CTA/Urgency (Arabic)
Script grade	8.05/10	8.4/10 (v2)	8.4/10
Script iterations	1	2	1
Image quality (avg)	9.88/10	9.36/10	8.99/10
Voice quality	9.67/10	9.0/10	7.55/10
Voice retries	0	1	2
Music quality	8.1/10	8.1/10	7.98/10
Consistency score	9.25/10	7.65/10	6.45/10
Pipeline time	~5 min	~6 min	~3 min
File size	12.9 MB	27.5 MB	6.4 MB

A clear pattern: simpler formats score higher, but the quality gates keep complex formats in check.

5 Things I Learned

1. Self-grading is the most valuable feature.
The storytelling test proved it. A 7.7 script became an 8.4 script because the system knew the hook was weak. Without that feedback loop, the first draft would have gone straight to production.

2. Voice generation is the bottleneck for non-English.
English voiceovers passed on the first try in both tests. Arabic needed 3 attempts. The issue is duration estimation — Arabic speech pacing differs from English, and the first-pass text is often too long. This is a clear area for improvement.

3. Photorealistic is the easiest style for consistency.
The product demo scored 9.25 on consistency. Watercolor dropped to 7.65. Flat-design hit 6.45. Stylized images have more variance between scenes, which makes cross-scene consistency harder. A style-locking mechanism could help here.

4. The tool limit is a real constraint.
Two of three tests hit the 20-tool round limit before completing subtitles and composition scoring. The videos still exported fine, but the pipeline should be optimized to fit within fewer tool calls — or the limit needs to increase.

5. Every ad format exported a real video.
Despite all the retries and edge cases, every test produced a platform-ready video with correct specs. That's the baseline promise, and it held.

What I'd Improve Next

Based on these tests, here's what's on the roadmap:

Arabic voice calibration — Pre-calculate duration estimates using Arabic-specific WPM ranges to reduce retries
Style consistency locking — Extract color palette and visual parameters from Scene 1 and enforce them across all subsequent scenes
Pipeline optimization — Reduce tool calls by batching operations (generate all images in one call, grade them in one call)
Subtitle fallback — Prioritize subtitle generation over composition scoring when approaching the tool limit

Try It Yourself

The tool is open source. Pick one of these three prompts, run it, and see what comes out:

git clone https://github.com/UrNas/advideo-creator.git
cd advideo-creator
cp .env.example .env  # Add your API keys
uv run main.py

Star the repo if you find it useful. Open an issue if something breaks. PRs are welcome.

GitHub: UrNas/advideo-creator

This is part 2 of a series on building AI-powered ad generation. Part 1 covered the architecture. Part 3 will go deeper on the quality gate system and how self-grading actually works under the hood.

I Built an AI Video Ad Generator with Claude + MCP — Here's the Architecture

QAYS KADHIM — Sun, 01 Mar 2026 16:38:50 +0000

I wanted to see what happens when you give Claude real tools — not a weather API, not a todo app — but image generation, voice synthesis, video composition, and quality grading. Could it orchestrate a full creative pipeline from a single prompt?

The result is AdVideo Creator: an open-source CLI where you type "create a 15-second TikTok ad for artisan coffee" and get back a finished .mp4 file. Script, images, voiceover, music, transitions, subtitles — all generated and composed automatically.

Here's how it works under the hood.

The Problem: Claude Has No Hands

Claude can write an excellent marketing script. Give it a product, a target audience, and a tone — it'll produce a hook, emotional beats, and a call to action that actually works.

Then what?

You still need images. A voiceover. Background music. Video editing. Platform-specific export. And if the script doesn't fit the timing after you lay it over the visuals, you go back to Claude, ask for a rewrite, and start the cycle again.

This is the gap between "AI chatbot" and "AI application." Claude can think about your ad, but it can't make it. It has no hands.

Giving Claude Hands with MCP

The Model Context Protocol solves this. MCP is an open protocol that defines how AI models discover and use external tools. Think of it like HTTP but for AI capabilities — a standardized way for a client (the AI) and a server (the tools) to talk to each other.

The architecture is simple:

┌─────────────────────────────────────┐
│           CLIENT (Python)           │
│  User ←→ Claude API ←→ Tool Router │
└──────────────┬──────────────────────┘
               │ stdio (JSON-RPC)
┌──────────────┴──────────────────────┐
│           MCP SERVER (Python)       │
│  Image │ Voice │ Video │ Grading   │
│  Stock │ Brand │ Cache │ System    │
└─────────────────────────────────────┘

The client handles conversation with Claude. The server handles doing things — generating images, producing voiceover, composing video. They communicate through stdio using the MCP protocol.

Claude never talks to the server directly. The client is always the intermediary: Claude decides what tools to call, the client routes those calls to the server, and the server executes them.

The 15-Step Pipeline

When you ask for an ad, Claude doesn't just run one tool. It orchestrates a 15-step pipeline, calling different tools at each stage:

Brief → Template → Project → Script → Grade → Iterate → Save
  → Images (Gate 1) → Voiceover (Gate 2) → Music (Gate 3)
  → Consistency (Gate 4) → Compose (Gate 5)
  → Subtitles → Export → Deliver

The critical thing: Claude decides the order, not the code. There's no hardcoded workflow. Claude sees all 45 tools and their descriptions, and it figures out which ones to call and when. The system prompt gives it a recommended pipeline, but Claude adapts — if the user imports their own images, it skips image generation. If they want stock footage instead of AI images, it searches Pexels.

Here's what a real session looks like behind the scenes:

User: "Create a 15s TikTok ad for artisan coffee beans"

Claude → create_project("coffee-ad", "tiktok", 15)
Claude → get_ad_template("problem-agitate-solve")
Claude → save_script(project_id, script_json)
Claude → search_stock_video("tired person morning")
Claude → use_stock_video(project_id, scene_0, video_id)
Claude → generate_scene_image(project_id, 1, "coffee bag close-up...")
Claude → evaluate_scene_image(project_id, 1)        ← Quality Gate
Claude → generate_scene_image(project_id, 2, "person smiling...")
Claude → evaluate_scene_image(project_id, 2)        ← Quality Gate
Claude → generate_voiceover(project_id, script_text)
Claude → evaluate_voiceover(project_id)              ← Quality Gate
Claude → generate_background_music(project_id, "energetic")
Claude → evaluate_background_music(project_id)       ← Quality Gate
Claude → evaluate_asset_consistency(project_id)       ← Quality Gate
Claude → compose_video(project_id, timeline)
Claude → evaluate_composition(project_id)             ← Quality Gate
Claude → add_subtitles(project_id, "word_highlight")
Claude → export_video(project_id, "tiktok")

That's ~18 tool calls from a single user message. Each one goes through the MCP protocol: Claude emits a tool call → client routes to server → server executes → result goes back to Claude → Claude decides what's next.

The Part I'm Most Proud Of: Quality Gates

Here's where it gets interesting. Most AI pipelines generate output and hope for the best. AdVideo Creator has 5 quality gates that grade every generated asset automatically:

Gate	What It Checks	Pass Threshold
Scene Image	CLIP similarity to prompt, safe-zone compliance, framing	7.0/10
Voiceover	Whisper transcription vs script, WPM pacing, duration fit	7.5/10
Background Music	BPM, duration match, loop quality, mix compatibility	7.0/10
Cross-Asset Consistency	Color palette coherence, pacing alignment, energy match	6.5/10
Final Composition	Duration accuracy, audio balance, platform spec compliance	7.5/10

When an asset fails a gate, Claude retries — but not randomly. The system follows a drift prevention rule: always retry from the original parameters with a targeted fix, never modify the previous retry's parameters. This prevents the common problem where each retry drifts further from the creative direction.

For images, the fix is additive — append a composition hint like "leave center space for text." For voiceover, it's subtractive — shorten the text if pacing is too fast. For music, it's a swap — try a different mood keyword. For consistency, it's surgical — only regenerate the outlier assets.

The graders themselves use real signal processing:

Image grading: CLIP similarity score between the prompt and generated image, plus safe-zone compliance checking that important content isn't cut off at platform edges
Voiceover grading: Whisper transcription compared against the original script text, words-per-minute checking against language-specific ranges (English: 130-170 WPM, Arabic: 100-140 WPM)
Music grading: librosa for BPM extraction, pydub for loudness analysis and loop-point detection
Consistency grading: K-means clustering on color palettes across all scene images, BPM-to-pacing correlation

Script Self-Grading

Before any assets are generated, Claude grades its own script on 6 marketing criteria:

Criteria	Weight
Hook Strength	25%
Emotional Appeal	20%
CTA Clarity	20%
Audience Targeting	15%
Pacing & Flow	10%
Memorability	10%

Scripts must score 8.0/10 or higher. If they don't, Claude identifies the weakest criterion and rewrites targeting that specific weakness — up to 3 iterations. This means the script is already strong before the expensive image and voice generation starts.

The grading rubric lives in the MCP server as a resource (config://grading-rubric), not hardcoded in the prompt. Claude reads it at runtime. This means you can modify the rubric without touching any code.

8 Ad Templates

Claude doesn't write scripts from scratch — it uses proven frameworks:

Problem-Agitate-Solve — Hook with pain point, amplify the problem, reveal the solution
Before/After — Show the transformation
Testimonial — Social proof format
Product Demo — Feature showcase
Trend Hijack — Ride a current trend
Countdown/Urgency — Limited time offers
Storytelling — Mini narrative arc
UGC Style — Raw, authentic feel

Each template defines a scene structure — how many scenes, what each scene should contain, where the hook goes, where the CTA lands. Claude selects the best template for the product type and follows its structure while adapting the content.

Multi-Backend Architecture

The tool has tiered fallbacks for each capability:

Image generation: Replicate (Flux Schnell, ~1-2s, ~$0.003/image) → HuggingFace (free, ~3-5s) → Local SDXL (free, requires GPU)

Voice synthesis: ElevenLabs (ultra-natural, ~$0.06/ad) → OpenAI TTS (~$0.003/ad)

Stock video: Pexels API (free, 200 req/hour)

The factory pattern makes this transparent — create_image_engine() checks which API keys are available and returns the best backend. Add a new key to .env and the entire pipeline upgrades automatically. Remove it and it gracefully falls back.

The minimum setup is just an Anthropic API key. Everything else is optional. You can generate a complete ad for as little as $0.01 (Anthropic only, no images/voice) or $0.10-$0.15 with all premium backends.

Multilingual: Arabic RTL Support

This was the hardest engineering challenge. The pipeline supports full Arabic ads with:

RTL text rendering — Pillow's HarfBuzz backend with Noto Sans Arabic font, automatic text reshaping
Per-language voice defaults — Arabic uses ElevenLabs eleven_multilingual_v2 with stability tuned to 0.50 (vs 0.35 default) for more consistent Arabic pronunciation
Language-aware grading — Arabic has different WPM ranges (100-140 vs English's 130-170), and the voiceover grader normalizes Arabic text (strips tashkeel, normalizes hamza) before comparing against Whisper transcription

Not many AI tools handle RTL correctly. Getting Arabic subtitles to render properly over video, with the right font and correct text direction, required diving deep into Pillow's text rendering internals.

The MCP Server: 45 Tools, 12 Resources

The server exposes everything through MCP's three primitives:

Tools (45) — actions Claude can take. Project management, image generation, voice synthesis, video composition, quality grading, brand profiles, stock video search, asset import, cache management.

Resources (12) — read-only data Claude can access. Platform specs, style presets, grading rubrics, pricing info, voice catalogs, ad templates.

Prompts (8) — reusable instruction templates. The main system prompt with the 15-step workflow, the script grader, the asset grader with drift prevention rules.

The key design decision: everything is discoverable at runtime. When the client connects, it calls tools/list and gets back all 45 tools with their schemas. It calls resources/list and gets all 12 resources. Claude sees everything and decides what to use. Add a new tool to the server? Claude picks it up on the next connection.

What I Learned

Building this taught me patterns that apply to any AI application:

Claude is better at orchestration than you'd expect. Given clear tool descriptions and a recommended workflow, Claude makes remarkably good decisions about which tools to call and in what order. The key is writing descriptive tool descriptions — Claude reads them carefully.

Quality gates change everything. Without them, you get "generate and pray." With them, you get consistent, predictable output. The cost overhead is small (~5-10% of total pipeline cost for grading) and the quality improvement is significant.

Drift prevention matters. When retrying failed generations, always go back to the original parameters and apply a targeted fix. Never modify the previous retry's output. This single rule eliminated most of our "the 3rd retry looks nothing like what was requested" problems.

MCP's separation of concerns pays off. Building the server independently from the client made development much faster. I could test every tool with MCP Inspector (a web UI) without making a single Claude API call. And the same server works with Claude Desktop, no modifications needed.

Try It Yourself

AdVideo Creator is MIT licensed and open source:

GitHub: github.com/UrNas/advideo-creator

Minimum setup: Python 3.12+, FFmpeg, and an Anthropic API key. That's it.

git clone https://github.com/UrNas/advideo-creator.git
cd advideo-creator
uv sync
cp .env.example .env    # add your ANTHROPIC_API_KEY
uv run python main.py

Add more API keys (Replicate, ElevenLabs, OpenAI, Pexels) to unlock premium features.

If you're interested in learning how to build this kind of AI application from scratch — tool design, agentic loops, quality gates, engine abstractions — I'm working on a full course covering every module in detail. Star the repo and follow for updates.