Forem: Usman Mehfooz

SelfSprite Maze Game---------Take a selfie, and become an animated playable character in the game+Gif anything

Usman Mehfooz — Sun, 14 Sep 2025 19:03:32 +0000

This is a submission for the Google AI Studio Multimodal Challenge

Demo Video

A live demo of the applet is right here: Selfsprite Maze Demo

What I Built

Generic gaming avatars are dead. I built Selfsprite Maze to fix the disconnect between player and character.

It's a retro game that uses multimodal GenAI to rip your actual face from a selfie and mint a custom, animated 8-bit animated sprite. You are the hero.

The gameplay loop is brutally simple:

📸 Create Your Hero: Snap a selfie, pick a class like 'Wizard' or 'Cyberpunk', and the AI spits out a personalized sprite sheet. Done in seconds.
😈 Design Your Enemy: Here's the twist. Run the process again, but this time you're creating the enemy guards. Now you can literally fight your friends, a celebrity, or a weird alternate-reality version of yourself.
🏃 Escape the Maze: You're dropped into a procedurally generated maze. The goal? Hit the exit. The problem? The guards you just made are running pathfinding algorithms to hunt you down. No pressure.

This game itself is a creative engine.

Infinite Replayability: AI-driven level-gen means you'll never play the same maze twice.
Smart Enemies: Guards use line-of-sight and A* pathfinding. They aren't stormtroopers; they will find you.
Digital Swag: Beat the level and you get to download your character as a high-quality GIF and PNG frames. Your new profile pic is waiting.
Offline Mode: Use existing sprite sheet to save API, or use google AI studio to build sprite sheets.

Demo

A live demo of the applet is right here: This Demo is unfortunately without paid API, for testing, Please use upload option in the game, and with a generation of a sprite sheet directly from google AI studio for free, you dont need to use API.

SelfSprite Maze Demo

Demo Video

Intro: Kicks off with a retro instruction screen. You'll know how to play in 10 seconds.
Character Creation: Watch the full flow: snap a selfie, select the 'character' class, and fire it off to the AI.
Motion Creation: Watch the full flow: select the 'motion' class, and fire it off to the AI.
Real-Time Generation: The AI-generated sprite sheet appears and gets sliced into animation frames on the fly. Pure visual feedback.
Showcase: A classic "VS" screen showcases your hero and the enemy you designed, building hype for the showdown.
Gameplay: The real deal. My sprite navigating a maze, getting spotted, and a tense chase kicking off with NPC AI Players.
Victory & Download: Make it to the exit, get the "Level Complete" screen, and one-click download a ZIP file with your GIF and all the frames. Ship it.

How I Used Google AI Studio

This entire project runs on Google AI. I used AI Studio for rapid-fire prompt engineering, and the final build uses the @google/genai SDK exclusively. The magic is in how it orchestrates two different Gemini models.

(Nano Banana) for Multimodal Generation: This is the creative engine. It takes two inputs—an image prompt (your selfie) and a text prompt (my instructions for style, class, etc.)—and fuses them into a brand new sprite sheet. This image-plus-text-to-image pipeline is the core feature.
Vision-Based Analysis with gemini-2.5-flash: After an image is generated, I need its grid dimensions. Instead of guessing, I just show the image back to a vision model and ask, "How many columns and rows?" I use responseSchema to force the output into a clean JSON object. The AI becomes a reliable, automated data-processing tool.

Multimodal Features

Multimodal isn't just a feature here; it's the entire foundation of the app.

Deep Personalization through Image-to-Image Transformation: This isn't just text-to-image; it's image-plus-text-to-image. The user's photo is the actual foundational reference, not just a loose inspiration. Seeing an 8-bit version animated of yourself being chased through a dungeon hits different than playing as a generic knight.
Vision-Powered Automation: I built a closed-loop pipeline. The AI generates a creative asset, and then another AI analyzes that asset to provide the technical data needed for the next step (Image -> JSON). It bridges the gap between the creative and technical, making a complex process feel instant and overcome hallucination limitations.
Creativity as a Gameplay Mechanic: The AI is so fast that the creation process is part of the gameplay. The user is both the hero designer and the monster designer. This dual role is a novel gameplay loop that's only possible with powerful and flexible multimodal AI.

Architecting a Generative AI Pipeline for Automated Sprite Sheet Creation for Animation

Usman Mehfooz — Thu, 11 Sep 2025 16:39:37 +0000

The Engineering Challenge of Creative Scale

If you've ever delved into game development, you know the drill. Character sprites—those tiny, animated heroes and villains—are a massive investment of time and artistic skill. It's a classic creative bottleneck: a single walking animation can demand dozens of individual frames, each needing to be drawn with perfect consistency.

I wanted to solve this problem by automating the most painful part of the process. This post isn't a conceptual overview; it's a detailed technical blueprint for building a generative AI pipeline that takes a single character image and programmatically generates a full 16-frame animated sprite sheet.
The latest model for vision nano banana by Google AI, it's now quite doable in an automated pipeline.
We'll cover the tech stack, the system architecture, and provide code-level insights into the backend logic that orchestrates this powerful multimodal workflow.

The Core Architecture: A System Overview

To build a robust and scalable application, you have to decouple your concerns. My system is broken down into four primary components:

Frontend Client: A web UI (React/Next.js) for uploading the source image and displaying the final grid of generated sprites.
Backend API Service: The central orchestrator (Node.js/Cloud Run). This is the brain that manages the entire workflow, stores files, makes parallel calls to the AI model, and processes the results.
Cloud Storage: A scalable object storage service like Google Cloud Storage (GCS) to hold the source image and generated frames.
AI Model Service: The external API for the generative model, which in this case is Google's Gemini via Vertex AI.

The data flow is orchestrated entirely by our backend:

[Frontend Client] --(Uploads Image)--> [Backend API] --> [Google Cloud Storage]

[Backend API] --(Triggers 16x API calls w/ GCS URI + Prompts)--> [Vertex AI Gemini API]

[Vertex AI Gemini API] --(Returns 16x Generated Images)--> [Backend API]

[Backend API] --(Saves Images to GCS & Returns URLs)--> [Frontend Client]

This decoupled architecture ensures that each component can be scaled and maintained independently.

The Tech Stack in Detail

Choosing the right tools is critical for a project like this. Here’s a recommended stack for the pipeline:

Frontend

Framework: Next.js 14. Its integrated API routes provide a simple way to build the backend logic, making it a great choice for a full-stack application.
UI/Styling: Tailwind CSS with a component library like Shadcn/ui for building a clean UI quickly.
Data Fetching: React Query (TanStack Query) is ideal for managing the asynchronous state of the generation process (loading, errors, etc.).
File Uploads: React-Dropzone for a clean, accessible drag-and-drop interface.

Backend & Deployment

Runtime & Language: Node.js with TypeScript. Type safety is invaluable when dealing with API contracts.
Deployment Environment: Google Cloud Run. Deploying the Next.js app in a Docker container on Cloud Run provides exceptional scalability, including the ability to scale to zero when not in use.
Image Processing: Sharp. A high-performance Node.js library for stitching the final frames into a single sprite sheet on the backend.

Cloud Services & AI

Storage: Google Cloud Storage (GCS). Its tight integration with other Google Cloud services allows us to directly reference GCS objects in our Vertex AI calls.
AI SDK: Google's Vertex AI SDK for Node.js (@google-cloud/aiplatform). This is the official way to interact with Gemini models.
AI Model: The gemini-2.5-flash-image-preview model, a new model specifically for image editing that Google has nicknamed "nano banana." Its multimodal capabilities, speed, and cost-effectiveness make it the perfect fit for this project.

Backend Logic: A Code-Level Deep Dive

This is the heart of the system. Let's walk through the backend orchestration, which would live inside a Next.js API route (e.g., src/app/api/generate-sprites/route.ts).

Step 1: The API Endpoint and File Upload

The endpoint must handle multipart/form-data. The Next.js req.formData() method makes this straightforward.

// src/app/api/generate-sprites/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { Storage } from '@google-cloud/storage';
import { VertexAI } from '@google-cloud/aiplatform';

export async function POST(req: NextRequest) {
    const formData = await req.formData();
    const file = formData.get('file') as File | null;

    if (!file) {
        return NextResponse.json({ error: 'No file provided.' }, { status: 400 });
    }

    const buffer = Buffer.from(await file.arrayBuffer());
    // ... rest of the logic follows
}

Step 2: Uploading the Source Image to GCS

We must store the source image in GCS so the Gemini API can access it directly via its URI.

// ... inside the POST function
const storage = new Storage({ projectId: 'your-gcp-project-id' });
const bucket = storage.bucket('your-gcs-bucket-name');
const fileName = `uploads/${Date.now()}-${file.name}`;
const gcsFile = bucket.file(fileName);
await gcsFile.save(buffer, { contentType: file.type });
const gcsUri = `gs://${bucket.name}/${fileName}`;

Step 3: Orchestrating the 16 Generative Calls

For maximum efficiency, we use Promise.all to fire off all 16 requests to the Vertex AI API in parallel. The key is to define a suite of prompts, each describing a specific frame in the animation sequence.

const prompts = [
    "Using the character from the image, generate a full-body sprite of them walking forward, towards the camera...",
    // ... add all 15 other detailed prompts here
];

const vertexAI = new VertexAI({ project: 'your-gcp-project-id', location: 'us-central1' });
const generativeModel = vertexAI.getGenerativeModel({
    model: 'gemini-2.5-flash-image-preview',
});

const generationPromises = prompts.map(prompt => {
    const request = {
        contents: [
            {
                role: 'user',
                parts: [
                    { text: prompt },
                    // Reference the GCS file directly
                    { fileData: { mimeType: file.type, fileUri: gcsUri } }
                ]
            }
        ],
    };
    return generativeModel.generateContent(request);
});
const responses = await Promise.all(generationPromises);

Step 4: Processing Responses and Saving Generated Frames

The responses will contain the generated image data as a base64 string. We decode this, convert it to a buffer, and upload it back to GCS.

const generatedImageUrls: string[] = [];
let frameCounter = 0;
for (const response of responses) {
    const base64Data = response.response.candidates[0].content.parts[0].fileData.data;
    const imageBuffer = Buffer.from(base64Data, 'base64');
    const outputFileName = `generated/sprite-${Date.now()}-${frameCounter++}.png`;
    const outputFile = bucket.file(outputFileName);

    await outputFile.save(imageBuffer, { contentType: 'image/png' });

    // Create a signed URL for the frontend to access the image
    const [publicUrl] = await outputFile.getSignedUrl({ 
        action: 'read', 
        expires: '2026-09-12' 
    });
    generatedImageUrls.push(publicUrl);
}

// Finally, return the array of URLs to the client
return NextResponse.json({ urls: generatedImageUrls });

The Frontend: Bringing It to Life

On the frontend, the UI simply calls our API and displays the results in a grid. React Query handles the asynchronous state and renders the images as they're generated. A final server-side step can then download all the images from their URLs, use the Sharp library to composite them into a single 4x4 grid, and return the final sprite sheet for download.

// A simplified React component using TanStack Query
import { useMutation } from '@tanstack/react-query';

function SpriteGenerator() {
  const { mutate, data, isPending } = useMutation({
    mutationFn: async (file: File) => {
      const formData = new FormData();
      formData.append('file', file);
      const response = await fetch('/api/generate-sprites', { method: 'POST', body: formData });
      if (!response.ok) throw new Error('Network response was not ok');
      return response.json();
    }
  });

  // ... file upload logic using react-dropzone that calls mutate(file)

  if (isPending) return <div>Generating your sprite sheet...</div>;

  return (
    <div className="grid grid-cols-4 gap-4">
      {data?.urls.map(url => <img key={url} src={url} alt="Generated sprite frame" />)}
    </div>
  );
}

A New Era of Creative Tools

The emergence of powerful multimodal AI models like Gemini marks a paradigm shift. We're moving from a world where creative professionals spend countless hours on repetitive, manual tasks to one where they can focus on high-level vision and ideation.

Tools like the one I've outlined here pose a direct challenge to traditional creative software companies like Adobe and others in the digital art space. Instead of a user having to master a complex suite of tools—Photoshop for editing, Animate for a single frame, and After Effects for motion—an entire process can now be encapsulated within a single API call. This doesn't eliminate the need for human creativity, but it shifts the focus dramatically. The engineer becomes a co-creator, building tools that can accelerate the artist's workflow by automating the tedious parts. The future of creative software isn't just about a new UI; it's about embedding generative intelligence directly into the core of the tool itself.

Convert Any UI Images to Multi-Page HTML Website With UI Editor

Usman Mehfooz — Thu, 11 Sep 2025 14:15:46 +0000

Final Submission Text

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built ProtoHTML, a web-based tool designed to bridge the gap between design and development. It transforms static website mockups (like screenshots or design files) into fully functional, multi-page HTML websites styled with Tailwind CSS.

The problem ProtoHTML solves is the tedious and time-consuming process of manually converting a visual design into code. For developers and designers, this "mockup-to-code" phase can be a major bottleneck. ProtoHTML automates this by using a powerful multimodal AI to analyze the images and write the code, turning a process that could take hours into one that takes just a few seconds.

Key features include:

Multi-Page Site Generation: Upload multiple image mockups at once to generate a complete website structure.
AI-Powered Code Generation: Leverages the gemini-2.5-flash-image-preview model to produce clean, semantic HTML and Tailwind CSS.
Live Editable Previews: Instantly preview the generated pages and edit text content directly in the browser, with the underlying code updating in real-time.
One-Click Export: Package the entire multi-page website into a single, downloadable .zip file, ready for immediate deployment.

Demo

USED FREE API KEY SO IT MAY NOT BE WORKING

You can try the live application here: https://ai-multi-page-architect-626025278302.us-west1.run.app/

Here is a video walkthrough of the application in action:

The application has a clean UI for uploading mockups and editing the results. The AI generates clean HTML and CSS as the output.

How I Used Google AI Studio

Google AI Studio was the complete development environment for building and iterating on ProtoHTML. The core of the application is powered by the gemini-2.5-flash-image-preview model (affectionately known as 'nano banana'), used during the free trial period on Sept 6-7. This model was the perfect choice for its powerful and fast multimodal capabilities.

The key to getting high-quality, consistent output was prompt engineering. I crafted a detailed systemInstruction that sets the persona for the AI as an "expert senior frontend developer" and provides a strict set of rules it must follow. These rules dictate everything from the output format (raw HTML only) to technical requirements like including the Tailwind CSS CDN link, using semantic HTML5 tags, and implementing responsive design patterns.

Each API call is a multimodal request, sending both the visual image data and a concise text prompt (e.g., Based on the provided image, generate the complete HTML file for the "About Us" page now.) to the Gemini model. This combination allows the AI to understand both the visual layout from the image and the specific context for the page from the text.

Multimodal Features

The primary multimodal feature of ProtoHTML is Image-to-Code Generation. The application takes a visual input (a webpage mockup) and translates it into a structured, textual output (a complete HTML file with Tailwind CSS classes).

This functionality fundamentally enhances the user experience in several ways:

Accelerates Prototyping: It dramatically reduces the friction between a visual idea and a functional prototype. Users can go from a set of static images to an interactive, multi-page website in minutes, allowing for rapid iteration and feedback.
Empowers Non-Coders: Designers or project managers can bring their visions to life without needing to write a single line of code, making web development more accessible.
Creates a Tangible Feedback Loop: The most powerful part of the experience is the immediate connection between the visual input and the interactive output. Seeing your static mockup rendered as a live, editable webpage in the "Preview & Edit" tab is a powerful "wow" moment. It makes the AI's "understanding" of the image tangible and gives the user immediate control to refine the result.