Forem: Sameer Shah

Building an Enterprise-Grade AI Voice Agent with Twilio, Deepgram, and Groq Llama-3.3 (Real-Time Telephony Automation)

Sameer Shah — Mon, 13 Apr 2026 17:38:30 +0000

Building real-time AI voice agents over actual phone calls is one of the hardest engineering problems you can take on. The latency requirements are brutal (humans notice any delay over ~500ms), the audio pipeline is full of edge cases, and coordinating three different external services — telephony, speech recognition, and LLM — in real time requires careful architectural thinking.

I built a production-ready, low-latency AI telephony agent from scratch. Here's the full technical breakdown — architecture, implementation details, and the lessons learned along the way.

What This System Does

When someone calls your Twilio phone number, this system:

Captures the incoming audio stream in real time via Twilio Media Streams (WebSockets)
Streams the audio to Deepgram Nova-2 for sub-second speech-to-text transcription
Sends the transcript to Groq's Llama-3.3-70b for contextually aware response generation
Converts the LLM response to natural-sounding speech using Deepgram Aura TTS
Streams the audio back to the caller in 20ms frames
Monitors LLM output for emergency trigger phrases — and redirects the call instantly if detected

System Architecture

Tech Stack Breakdown

Layer	Technology	Why
Telephony	Twilio Media Streams	Battle-tested, global infrastructure with WSS support
STT	Deepgram Nova-2	Best-in-class accuracy + sub-second latency on 8kHz audio
LLM	Groq Llama-3.3-70b	Fastest inference available — critical for real-time voice
TTS	Deepgram Aura	Low-latency, natural-sounding speech synthesis
Server	FastAPI + WebSockets	Async-first, handles concurrent connections cleanly

Why Groq? The Latency Problem

This is the most important architectural decision in the whole system. In a voice conversation, you have maybe 300–400ms of budget for the entire round-trip from when speech ends to when the response starts playing. Breaking that down:

That's already 310ms with zero slack. Standard LLM APIs would blow this budget entirely. Groq's purpose-built LPU (Language Processing Unit) hardware is what makes real-time voice agents feasible — it's genuinely 10–20x faster than GPU-based inference for token generation speed.

Key insight: For voice AI, LLM inference speed matters more than model size. A faster, smaller model (Llama-3.3-70b on Groq) will always outperform a slower, larger model for real-time telephony.

The Audio Pipeline: Technical Specifications

Twilio's Media Streams deliver audio in a very specific format that the entire pipeline is built around:

Encoding: 8-bit PCMU (G.711 mu-law) — the standard for telephony
Sample rate: 8000 Hz — lower than modern audio, but universal across phone networks
Channel: Mono
Frame size: 160 bytes = 20ms of audio per WebSocket message

Deepgram Nova-2 handles 8kHz mu-law natively — resampling on the fly would add latency. The TTS output from Deepgram Aura is similarly fragmented into 20ms frames for smooth playback through the telephony channel.

Emergency Triage Logic

One of the most critical features for production deployment is the emergency fallback system. The LLM output monitor runs concurrently with response generation and watches for a configurable set of trigger phrases.

When a trigger is detected:

Current audio playback is interrupted
The system calls the Twilio REST API immediately
The call is redirected to the configured EMERGENCY_FALLBACK_NUMBER
The event is logged for auditing

This is essential for any real-world deployment — medical triage, mental health lines, technical support escalation — where certain situations require immediate human intervention rather than continued AI interaction.

Project Structure

AI-voice-agent/
├── app/
│   └── core/
│       └── config.py    # SYSTEM_PROMPT lives here — customize behavior
├── .env.example          # All required environment variables documented
├── requirements.txt
└── run.py                # Single entry point — manages full lifecycle

The unified run.py entry point is a deliberate design decision: it manages ngrok tunnel setup, Twilio webhook synchronization, and FastAPI startup in the correct order — deployment is a single command.

Customizing the Agent's Behavior

The agent's entire personality and domain expertise is controlled by a single system prompt in app/core/config.py. This makes it trivially easy to redeploy the same infrastructure for completely different use cases:

# Medical triage agent
SYSTEM_PROMPT = """You are a medical intake assistant...
Emergency triggers: ['chest pain', 'can't breathe', 'unconscious']"""

# Technical support agent
SYSTEM_PROMPT = """You are a tier-1 technical support agent...
Escalation triggers: ['billing issue', 'data loss', 'security breach']"""

# Appointment scheduling agent
SYSTEM_PROMPT = """You are a scheduling assistant..."""

Deployment

git clone https://github.com/Sameershahh/AI-voice-agent
cd AI-voice-agent
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Configure credentials
cp .env.example .env
# Fill in GROQ_API_KEY, DEEPGRAM_API_KEY, TWILIO_ACCOUNT_SID,
# TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER, EMERGENCY_FALLBACK_NUMBER, PUBLIC_URL

# Start everything
python run.py

Session Logging

All session interactions and transcripts are automatically persisted to the logs/ directory. This is non-optional for production — you need a complete audit trail for compliance, debugging, and performance analysis. Logs include full call transcripts, LLM responses, latency measurements, and any emergency triage events.

Production Use Cases

Medical triage — AI handles initial intake, escalates critical cases to on-call staff
Technical support — tier-1 resolution with intelligent escalation
Appointment scheduling — natural conversation flow for booking and rescheduling
Lead qualification — automated inbound sales calls with CRM integration
Emergency hotlines — AI-assisted triage with guaranteed human escalation path

Resources

If you're building voice AI infrastructure or have questions about latency optimization, WebSocket audio pipelines, or the emergency triage implementation — drop a comment. There aren't many publicly documented implementations of this full stack yet, and I'm happy to go deeper on any part of it.

Built by Sameer Shah — AI & Full-Stack Developer | Portfolio

Building an AI-Powered Event Feedback System: Gemini 2.5 + FastAPI + Supabase + Automated PDF Reports

Sameer Shah — Mon, 13 Apr 2026 17:16:16 +0000

After running an event, the feedback collection process is almost always broken. You get a spreadsheet of raw responses, manually tally the ratings, write a summary email to the organizer, and hope you didn't miss anything important. It's tedious, slow, and doesn't scale.

I built a full-stack AI-powered event feedback system that automates this entire pipeline end-to-end — from the moment an attendee submits feedback to a branded PDF report landing in the organizer's inbox, complete with AI-generated sentiment analysis and improvement suggestions.

What the System Does

Every feedback submission triggers a fully automated chain:

Form data is cleaned and validated by the FastAPI backend
Gemini 2.5 analyzes the text for sentiment, urgency, highlights, and suggestions
A branded HTML report is generated and converted to PDF
The PDF is emailed to the event organizer via Gmail SMTP
All data (including AI output) is persisted to a Supabase PostgreSQL database
The analytics dashboard reflects the new submission in real time

System Architecture

Form Submission
      │
      ▼
Data Cleaning & Normalization
      │
      ▼
Gemini 2.5 AI Analysis
      │
      ▼
HTML Report Generation (Branded)
      │
      ▼
PDF Conversion
      │
      ├──▶ SMTP Email to Organizer
      │
      └──▶ Supabase Database Storage
                │
                ▼
      Dashboard & Analytics
                │
                ▼
      Date-Range Bulk Report (PDF + Email)

Tech Stack

Layer	Technology
Frontend	Next.js + Tailwind CSS
Backend	FastAPI (Python)
AI Analysis	Google Gemini 2.5
Email	Gmail SMTP (TLS, Port 587)
PDF Generation	HTML-to-PDF conversion
Database	Supabase (PostgreSQL)

The AI Analysis Layer: Gemini 2.5

The most interesting part of this system is the AI pipeline. Rather than using a generic sentiment library, I integrated Google Gemini 2.5 directly to generate structured, actionable intelligence from each feedback submission.

The prompt is designed to return a consistent JSON structure every time:

{
  "sentiment": "Positive | Neutral | Negative",
  "summary": "2-3 sentence summary of the feedback",
  "highlights": ["key point 1", "key point 2"],
  "improvementSuggestions": ["suggestion 1", "suggestion 2"],
  "urgency": "Low | Medium | High"
}

The urgency field is particularly useful — it lets organizers immediately identify submissions that need follow-up action, without reading every comment manually.

Tip: When prompting Gemini for structured JSON output, always explicitly specify the expected schema in your prompt and validate the response server-side. LLMs can occasionally deviate from the schema under edge cases.

📄 Automated PDF Report Generation

After AI analysis, the backend generates a branded HTML report and converts it to PDF. The report includes:

Event metadata (name, date, organizer)
Attendee information
Star rating visualization
AI-generated summary and highlights
Improvement suggestions with urgency indicator
Submission ID and timestamp for traceability

Using HTML-to-PDF conversion rather than a PDF library gives you full control over visual design — CSS, fonts, colors, layout — without being constrained by a library's API.

SMTP Email Delivery

The PDF report is automatically emailed to the event organizer via Gmail SMTP:

# Email config
SMTP_HOST = "smtp.gmail.com"
SMTP_PORT = 587
SMTP_SECURITY = "TLS"

# The SMTP password is stored in .env — never hardcoded

Important: Gmail requires an App Password (not your account password) when authenticating via SMTP with 2FA enabled. Store it exclusively in environment secrets.

Database Schema: Supabase

All submissions and AI outputs are persisted to a Supabase PostgreSQL table (feedback_reports). The schema captures everything — raw input, AI analysis fields, PDF URL, and email delivery status — in a single row per submission, making dashboard queries simple and fast.

Key fields: submissionId, eventName, eventDate, rating, sentiment, summary, urgency, highlights, improvementSuggestions, pdfUrl, emailSent.

The Analytics Dashboard

The Next.js dashboard connects to the API and provides:

Real-time submission counts and average ratings
Sentiment breakdown (Positive / Neutral / Negative)
High-urgency submission highlighting
Filterable table with date range, event name, and sentiment filters
Per-submission PDF access links

The bulk report feature is especially powerful: select a date range, and the system retrieves all submissions in that window, passes the full dataset to Gemini for a consolidated analysis, generates a combined PDF, and emails it to the admin. A multi-hour manual reporting task reduced to a single button click.

Security Considerations

SMTP credentials stored exclusively in environment variables — never in the codebase
Supabase connection credentials managed via env secrets
Input validation at the FastAPI layer using Pydantic before data reaches AI or storage
Gemini API key scoped to the backend only — never exposed to the client

Getting Started

git clone https://github.com/Sameershahh/feedback-form-system
cd feedback-form-system

# Backend
cd api
pip install -r requirements.txt
# Add .env with Supabase, Gmail SMTP, and Gemini credentials
uvicorn main:app --reload

# Frontend
cd ../frontend
npm install
npm run dev

Real-World Use Cases

Conference and workshop organizers collecting post-event feedback
Corporate training teams tracking session quality
SaaS products embedding feedback collection into onboarding flows
Event agencies needing automated client-ready reports

Resources

Curious about the AI prompt engineering behind the Gemini integration, or how the bulk report aggregation works? Drop a comment — happy to go deeper on any part of the stack.

Built by Sameer Shah — AI & Full-Stack Developer | Portfolio

I Built a Full-Stack Page Generation Engine with FastAPI + Next.js (And Here's the Architecture)

Sameer Shah — Mon, 13 Apr 2026 09:15:35 +0000

There's a common problem in modern web development: you have structured data — JSON from a CMS, a database, or a configuration file — and you need to turn it into fully-rendered, styled web pages dynamically. Most solutions either lock you into a specific CMS or require a lot of glue code across disconnected systems.

So I built PageForge API — a full-stack page generation engine that takes structured JSON input and forges dynamic, styled web pages through a clean API-first architecture. Here's how it works and why I made the architectural choices I did.

The Problem: Structured Data → Web Pages

Imagine you're building a platform where clients need unique landing pages, product pages, or documentation pages generated from configuration data. Hard-coding templates doesn't scale. A headless CMS is overkill. What you need is a programmable, API-driven page forge.

PageForge API solves this by acting as a bridge: you send it JSON, it returns rendered pages.

Architecture Overview

The project is split into two distinct layers that communicate cleanly:

Layer	Technology	Responsibility
Backend	FastAPI (Python)	Data ingestion, processing, template logic
Frontend	Next.js (TypeScript)	Page rendering, routing, client-side display
Data	JSON schemas	Input contracts for page structures

The backend handles all the heavy lifting — validation, data normalization, template selection, and response construction. The frontend consumes the processed output and renders it via Next.js's hybrid rendering capabilities (SSR + CSR depending on the use case).

Repository Structure

page-forge-api/
├── fastapi/          # Python backend — core API logic
├── nextjs/           # TypeScript frontend — rendering layer
├── forge-data.json   # Sample forged page data
├── input-data.json   # Example input schema
├── new-sample.json   # Additional test fixtures
└── sample-data.json  # Reference data structures

The multiple JSON fixture files aren't just test data — they demonstrate the range of input schemas the engine is designed to handle. Real-world page generators need to be resilient to varied input shapes.

The FastAPI Backend

FastAPI was the natural choice here for several reasons:

Automatic OpenAPI docs — as soon as you define your Pydantic models, you get interactive API documentation for free.
Type safety from day one — Pydantic models enforce your input schema contract, so malformed data is rejected at the boundary, not deep inside your logic.
Async-first — if the engine needs to fetch external resources or call downstream services, async handlers ensure the server stays non-blocking.

The backend receives JSON input representing a page definition, validates it against a schema, applies transformation logic, and returns a structured response that the frontend can render.

The Next.js Frontend

Next.js sits at the frontend layer and handles rendering. The key design decision here was keeping the frontend dumb — it doesn't make business logic decisions. It receives processed data from the API and maps it to components.

This decoupling is powerful because:

You can swap the rendering layer without touching backend logic
The API can serve multiple frontends or even mobile clients
Testing is clean — you can unit test the API independently of rendering

TypeScript (93.4% of the codebase) ensures that data contracts between the API response and frontend components are enforced at compile time.

The Data Layer: JSON Contracts

The most interesting design problem in a page generation engine is the input schema. Too rigid and it's unusable. Too flexible and validation becomes a nightmare.

PageForge solves this with layered JSON schemas — a base contract all inputs must satisfy, with optional extension fields for more complex page types. Here's a simplified conceptual structure:

{
  "pageType": "landing | product | docs",
  "meta": {
    "title": "string",
    "description": "string",
    "slug": "string"
  },
  "sections": [
    {
      "type": "hero | features | cta | content",
      "data": {}
    }
  ]
}

The backend validates this structure, resolves any references, applies defaults, and returns a fully-resolved page object ready to render.

Why API-First Page Generation Matters

The API-first approach to page generation is increasingly relevant as teams move toward:

Programmatic content generation at scale
AI-driven page creation pipelines
Multi-channel publishing from a single data source
Headless architectures where the rendering layer is replaceable

PageForge is a great foundation if you're building a white-label page generator, an AI-assisted website builder, or any system where page structure is driven by data rather than manual design.

Running It Locally

# Clone the repo
git clone https://github.com/candidateconnectt/page-forge-api

# Backend setup
cd fastapi
pip install -r requirements.txt
uvicorn main:app --reload

# Frontend setup (new terminal)
cd ../nextjs
npm install
npm run dev

What's Next

Some natural extensions I'm considering for this engine:

AI-powered section generation — pass a prompt, get a page section back
Template versioning and A/B testing support
Webhook support for triggering page regeneration on data changes
Export to static HTML for edge deployment

Resources

If you're building anything similar — headless page builders, dynamic content engines, or API-first web tools — I'd love to hear about your approach in the comments. And if this was useful, a ⭐ on GitHub goes a long way!

Built by Sameer Shah — AI & Full-Stack Developer | Portfolio

How I Built a Facial-Expression Recognition Model with PyTorch (FER-2013, 72% Val Acc)

Sameer Shah — Tue, 12 Aug 2025 21:30:24 +0000

TL;DR

I trained a 3-block CNN in PyTorch on the FER-2013 dataset to classify 7 emotions. This post explains the dataset challenges, preprocessing and augmentation, exact model architecture, training recipe, evaluation (confusion matrix + per-class F1), and next steps for deployment.

Introduction

Emotion recognition enables richer human–computer interactions. I chose FER-2013 because it’s realistic: low-resolution (48×48), grayscale, and class-imbalanced. The goal: produce a reproducible, deployment-ready CNN pipeline that balances accuracy and efficiency for real-time inference.

Problem statement

Input: 48×48 grayscale faces.
Task: 7-class classification — Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral.
Challenges: small images → limited features, class imbalance, noisy labels, and intra-class variation.

Dataset & preprocessing

Source: FER-2013 (Kaggle). Split into train/val/test as in the original CSV (or your split).
Preprocessing pipeline (PyTorch transforms):

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.RandomHorizontalFlip(),     # sensible for faces
    transforms.RandomRotation(10),         # small rotations
    transforms.RandomResizedCrop(48, scale=(0.9,1.0)),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

val_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((48,48)),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

Model architecture

Input: 48 × 48 × 1 (grayscale)
Block 1: Conv2d(1 → 64, 3×3, pad=1) → BatchNorm2d(64) → ReLU → MaxPool2d(2×2) → OUTPUT 24×24×64
Block 2: Conv2d(64 → 128, 3×3, pad=1) → BatchNorm2d(128) → ReLU → MaxPool2d(2×2) → OUTPUT 12×12×128
Block 3: Conv2d(128 → 256, 3×3, pad=1) → BatchNorm2d(256) → ReLU → MaxPool2d(2×2) → OUTPUT 6×6×256
Dropout2d(p=0.25) → Flatten (9216) → FC(9216 → 512) → ReLU → Dropout(p=0.5) → FC(512 → 7) → Softmax (inference)

Training recipe

Loss: CrossEntropyLoss()
Optimizer: AdamW(lr=1e-3, weight_decay=1e-4)
Scheduler: ReduceLROnPlateau or CosineAnnealingLR (I used ReduceLROnPlateau on val loss)
Batch size: 64 (adjust by GPU memory)
Epochs: 30–60 with early stopping (patience 7 on val loss)
Checkpoint: save best_model.pt by val F1 (or loss)

Minimal training loop snippet

`criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=3)

for epoch in range(1, epochs+1):
train_one_epoch(model, train_loader, optimizer, criterion)
val_loss, val_metrics = validate(model, val_loader, criterion)
scheduler.step(val_loss)
if val_metrics['f1_macro'] > best_f1:
best_f1 = val_metrics['f1_macro']
torch.save(model.state_dict(), 'best_model.pt')
`

Reproducibility

import random, numpy as np, torch
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Conclusion

Check code on GitHub. If you want this adapted for a real-time webcam or a Django web deploy, contact me.