Forem: Boopathi

Sherlock Holmes: The Case Of AI Brought Down Our Servers

Boopathi — Sun, 01 Mar 2026 23:08:46 +0000

There are two kinds of production incidents.

The first kind gives you signals. Metrics drift slightly off baseline. Latency edges upward. Dashboards turn yellow long before anything turns red. You have time to reason about it.

The second kind doesn’t negotiate. It lets you sleep peacefully and then informs you in the morning that your server died multiple times overnight.

This was the second kind.

The Setup

We’re building a voice agent platform.

Calls come in from users. Audio streams over WebSocket. We integrate with Twilio for real-time media streams. AI agents process the conversation, decide what to say next, and occasionally invoke tools. Some of those tools query our database to fetch context or perform actions.

Architecturally, nothing unusual. A fairly standard real-time pipeline: streaming input, AI orchestration, tool execution, database lookups.

And everything had been working fine.

Then one night, one of our Kubernetes pods limited to 1GB of memory started crashing repeatedly. There was no deployment. No configuration change. No obvious traffic spike. No infrastructure event. Just restarts.

That’s always unsettling. When nothing changed, but something clearly broke.

The First Suspect: Streaming

When memory spikes in a real-time system, your instinct immediately points to streaming.

WebSockets can buffer unexpectedly. Audio chunks might accumulate if something downstream slows down. Garbage collection might not keep up under bursty traffic. Maybe some array was growing quietly in memory.

All of those were reasonable hypotheses.

We spun up a test environment and tried to simulate the issue. We created parallel calls. We streamed audio continuously. We monitored memory closely, expecting to see the same runaway pattern.

Nothing happened.

Memory usage remained stable. The heap grew and shrank normally. No vertical spikes. No crashes.

That almost made it worse. Because in production, it was reproducible —just not consistently. During US night hours, when traffic was low, we triggered calls manually and sometimes we could reproduce the crash. Other times, everything behaved perfectly.

Intermittent, probabilistic failures are far harder to reason about than deterministic ones.

Heap Snapshots and False Leads

Next, we went for heap snapshots using V8 and Chrome DevTools. If something large was being retained, the snapshot would reveal it.

process.on("SIGUSR1", async () => {
  console.log("Received SIGUSR1 event on proc. Executing heap snapshot.");
  const fs = require("fs");
  const v8 = require("v8");

  function writeHeapSnapshot(filename = `/profiling/heap-${Date.now()}.heapsnapshot`) {
    const snapshotStream = v8.getHeapSnapshot();
    const fileStream = fs.createWriteStream(filename);

    snapshotStream.pipe(fileStream);
    console.log(`Heap snapshot saved as ${filename}`);
  }

  writeHeapSnapshot();
});

We added a signal handler to our Node.js process so we could trigger heap snapshots on demand. The plan was simple: wait until memory rose, send the signal, capture the snapshot, and analyze it offline.

There’s a catch, though. Generating a heap snapshot requires additional memory. If your pod is already close to its limit, the snapshot process itself can push it over.

That’s exactly what happened.

Sometimes the pod crashed before the snapshot completed. Other times it succeeded, but the analysis didn’t reveal anything clearly catastrophic. We saw objects. We saw JSON structures. We saw logs. But nothing that screamed “this is it.”

We compared multiple snapshots normal state versus spike moments. The differences weren’t obvious enough to explain a near 1GB allocation.

Meanwhile, the day was progressing. It was evening in India. Which meant it was morning in the US.

Traffic was about to return.

Watching It Happen Live

As calls started coming in, we stopped theorizing and simply watched production.

There’s something tense about staring at live memory graphs when you know a crash is possible.

At first, everything looked normal. Heap usage was steady. CPU was fine. Calls were connecting. Conversations were flowing.

One call completed. No issue.

Another started. Still stable.

A few more came in. The graph moved slightly, but within normal range. For a moment, we thought maybe the issue had somehow resolved itself.

Then it happened.

The memory line didn’t drift upward gradually. It didn’t climb in a smooth curve. It jumped. A sharp vertical spike as if a massive object had been allocated in a single operation.

Within seconds, the pod was terminated due to OOM killed

Restarting.

This wasn’t a leak accumulating over time. This was a sudden allocation.

Scaling Didn’t Save Us

Under pressure and with leadership understandably concerned we tried the obvious mitigation: horizontal scaling.

If one pod was overloaded, maybe splitting the traffic would help. So we spun up an additional pod and routed traffic between them.

The assumption was simple: less load per instance means less memory pressure.

It didn’t help.

Both pods eventually crashed.

That clarified something important. Scaling helps when the issue is cumulative load. It does not help when a single request is catastrophic. If one request allocates hundreds of megabytes, any pod that processes that request will fail independently.

The problem wasn’t load distribution. It was logic.

Observing Memory in Real Time

Instead of relying on snapshots, I added periodic memory logging directly in the application. Node.js exposes memory usage metrics like rss, heapTotal, heapUsed, external, and arrayBuffers. We logged them every few seconds.

if you are a nodejs dev you might know what each mean if not for you

rss (Resident Set Size): The total memory allocated for the process execution in main memory, including the heap, stack, and code segments.
heapTotal: The total size of the allocated memory heap, which is managed by the V8 engine and stores objects, strings, and closures.
heapUsed: The actual memory currently being used within the heapTotal. This is often the most relevant parameter for identifying memory leaks in the JavaScript code itself.
external: Memory used by C++ objects that are bound to JavaScript objects managed by V8.
arrayBuffers: Memory allocated for ArrayBuffers and SharedArrayBuffers, which is also included in the external value

So I added below code on our codebase to log the memory usage over period of time

setInterval(() => { 
  const memory = process.memoryUsage();

for (let key in memory) {
  console.log(`[MEMORY]${key}: ${(memory[key] / 1024 / 1024).toFixed(2)} MB`);
}
}, 5000);

so what we thought was If streaming is issue we see the increase of arrayBuffers or external

But supersinlgly we see the heapUsed is get increased and The pattern was consistent.

Then suddenly, heapUsed would spike dramatically hundreds of megabytes in a short window and the pod would be killed.

This ruled out slow leaks and twilio audio streams Garbage collection wasn’t failing. Something large was being allocated all at once.

The Pattern

Eventually, our one of dev noticed something interesting in the logs around the spike.

A tool call.

More specifically, a tool call with an empty object as parameters.

Our AI agents can invoke tools. One of those tools performs a database search and expects a required parameter to search on the collection so what we saw in log was empty object

params: {}

At first glance, it didn’t look dangerous. It was syntactically valid. It didn’t throw an error. The function executed normally.

But that empty object changed everything. what we do was after receving the params we search on db using that


db.collection.find(params)

Our collection contained around one million documents.

When you execute Model.find({}) in MongoDB, you are not asking for nothing. You are asking for everything.

MongoDB did exactly what we requested. It returned all documents.

The Node.js driver then deserialized those documents into JavaScript objects in memory before our code could process them. That meant potentially hundreds of megabytes being allocated almost instantly.

Inside a pod limited to 1GB.

The vertical memory spike finally made sense.

This wasn’t a memory leak. It wasn’t streaming buffers accumulating. It wasn’t garbage collection lag. It was a full-collection query triggered by an empty filter.

Why It Was So Hard to Reproduce

It didn’t happen on every call.

Only one agent had access to that tool. Only certain conversation flows triggered it. Only when the AI decided the tool was relevant. And only when the model generated an empty object instead of a properly populated parameter set.

Unless that exact probabilistic sequence occurred, the system behaved perfectly.

Traditional bugs are deterministic. Given the same input, you get the same output.

AI-integrated systems introduce probabilistic behavior. The model didn’t crash the server directly. It generated a syntactically valid tool call that was semantically unsafe. And we trusted it.

That trust was the real bug.

The Fix

Once understood, the fix was straightforward.

We added strict schema validation before executing any tool call. If required parameters were missing, the call was rejected immediately. Empty filters were explicitly disallowed. We chose to fail fast instead of querying blindly.

There was no infrastructure change. No scaling adjustment. No tuning of garbage collection.

Just validation.

After that, the crashes stopped. it almost took for us 24 hours to find and fix.

2025 Voice AI Guide How to Make Your Own Real-Time Voice Agent (Part-3)

Boopathi — Sun, 11 Jan 2026 12:25:50 +0000

Welcome back! The waiting is over. In Part 3, we are going to see how to run the components of our voice agent locally, even on a CPU. Finally, you will have homework where you need to integrate all these into generic code to work it locally.

The Performance Reality: Setting Expectations with Latency Budgets

Before we dive into running components, you need to understand what "fast" actually means in voice AI. Industry benchmarks show that users perceive natural conversation when end-to-end latency (time from user finishing speaking to hearing the agent's response) is under 800ms, with the gold standard being under 500ms.

Let's break down where those milliseconds go:

Latency Budget Breakdown

Component	Target Latency	Upper Limit	Notes
Speech-to-Text (STT)	200-350ms	500ms	Measured from silence detection to final transcript
LLM Time-to-First-Token (TTFT)	100-200ms	400ms	First token generation (not full response)
Text-to-Speech TTFB	75-150ms	250ms	Time to first byte of audio
Network & Orchestration	50-100ms	150ms	WebSocket hops, service-to-service handoff
Total Mouth-to-Ear Gap	500-800ms	1100ms	Complete turn latency

Why this matters: If your STT alone takes 500ms, you've already exhausted most of your latency budget. This is why model choice and orchestration matter a lot.

If you want more depth about latency and other thing you can check articel from pipecat Conversational Voice AI in 2025 where they cover indepth.

For local inference on CPU/modest GPU:

Expect 1.2-1.5s latency for the first response
Subsequent turns may hit 800-1000ms as models warm up
This is acceptable for local development; production requires better hardware or cloud providers

The Hardware Reality: CPU vs GPU

Before we run anything, we need to address the elephant in the room: Computation.

Why do models crave GPUs?

AI models are essentially giant math problems involving billions of matrix multiplications.

CPUs are like a Ferrari: insanely fast at doing one or two complex things at a time (Sequential Processing).
GPUs are like a bus service: slower at individual tasks, but can transport thousands of people (numbers) at once (Parallel Processing).

Since neural networks need to calculate billions of numbers simultaneously, GPUs are exponentially faster.

"But I only have a CPU!"

Don't worry. We can still run these models using a technique called Quantization.

Standard models use 16-bit floating-point numbers (e.g., 3.14159...). Quantization rounds these down to 4-bit or 8-bit integers (e.g., 3). This drastically reduces the size of the model and makes the math simple enough for a CPU to handle reasonably well, though it will practically always be slower than a GPU.

Minimum System Requirements for Local Voice Agents

Here's what you actually need to get started:

For Development (CPU-Only)

Component	Minimum	Recommended
CPU	4-core modern processor (Intel i5/AMD Ryzen 5)	8-core or better
RAM	16GB	32GB
Storage	50GB SSD	100GB NVMe SSD
GPU	None required	NVIDIA GTX 1070 or better
Latency	1.5-2.5s per turn	800-1200ms per turn

For Production (GPU-Accelerated)

Component	Entry	Mid-Range	High-Performance
GPU	NVIDIA RTX 3060 (12GB)	RTX 3080 (10GB)	RTX 4090 (24GB) or Tesla A100
VRAM	8-12GB	16GB	24GB+
System RAM	32GB	64GB	128GB
CPU	8-core (Intel i7/Ryzen 7)	16-core	32-core workstation
Latency Target	800-1000ms	500-700ms	<500ms

The 2x VRAM Rule: Your system RAM should be at least double your total GPU VRAM. If you have a single RTX 3080 (10GB), you need at least 20GB of system RAM; 32GB+ is better.

Speech-to-Text (STT)

First, we are going to see how to run the STT component. As mentioned in Part 1, we are using Whisper from OpenAI. But before we blindly pick a model, we need to know what to look for.

The Blueprints of Hearing: STT Selection Criteria

When selecting a Speech-to-Text model for production, "it works" isn't enough. You need to verify specific metrics to ensure it won't break your conversational flow.

1. Word Error Rate (WER)

This is the cornerstone accuracy metric. It calculates the percentage of incorrect words.
Formula: WER = (Substitutions + Deletions + Insertions) / Total Words

Goal: Pro systems aim for 5-10% WER (90-95% accuracy).
Reality Check: For casual voice chats, anything under 15-20% is often acceptable.
Context Matters: A "digit recognition" task might have 0.3% WER, while "broadcast news" might have 15%. Don't blindly trust paper benchmarks test on your audio.

2. Latency & Real-Time Factor (RTF)

Speed is more than just feeling fast; it's about physics.

Time to First Byte (TTFB): Time from "speech start" to "partial transcript". Target <300ms.
Real-Time Factor (RTF): Processing Time / Audio Duration.
- If RTF > 1.0, the system is slower than real-time (impossible for live agents).
- Target: You want an RTF of 0.5 or lower (processing 10s of audio in 5s) to handle overheads.
The "Flush Trick": Advanced pipelines don't wait. When VAD detects silence, they "flush" the buffer immediately, cutting latency from ~500ms to ~125ms.

3. Noise Robustness & SNR

Lab audio is clean; user audio is messy. Performance drops sharply when Signal-to-Noise Ratio (SNR) falls below 3dB.

"Talking" Noise: Background chatter usually doesn't break modern models like Whisper.
"Crowded" Noise: Train stations or cafes are the hardest tests. If your users are mobile, prioritize noise-robust models (like distil-whisper) over pure accuracy models.

4. Critical Features for Agents

Speaker Diarization: "Who spoke when?" Essential if you want your agent to talk to multiple people, though it adds latency.
Punctuation & Capitalization: Raw STT is lowercase streams (hello world). Good models add punctuation (Hello, world.) which is critical for the LLM to understand semantics and mood.

Model Selection for Real-Time Performance

From faster-whisper itself, we have used Systran/faster-distil-whisper-medium.en from Hugging Face, but feel free to explore others:

Model name	Params	Type	Real-Time Factor (RTF)*	Typical use case
tiny	39M	Multilingual	0.05 (20x real-time)	Very fast, rough drafts, low-end CPU
tiny.en	39M	English-only	0.08 (12x real-time)	Fast English-only STT with small footprint
base	74M	Multilingual	0.15 (6.5x real-time)	Better than tiny, still lightweight
base.en	74M	English-only	0.20 (5x real-time)	Accurate English with low compute
small	244M	Multilingual	0.35 (2.8x real-time)	Good balance of speed and quality
small.en	244M	English-only	0.40 (2.5x real-time)	Higher-quality English on moderate hardware
distil-medium	140M	Multilingual	0.25 (4x real-time)	Best local balance: 49% smaller, within 1% WER of full medium
medium	769M	Multilingual	0.80 (1.25x real-time)	High accuracy, slower; needs stronger machine
medium.en	769M	English-only	0.85 (1.17x real-time)	Very accurate English, heavier compute
large / v2	1.55B	Multilingual	2.5 (0.4x real-time)	Best quality older large models, GPU required
large-v3	1.55B	Multilingual	3.2 (0.3x real-time)	Latest, improved multilingual, GPU strongly recommended

RTF (Real-Time Factor) = Time to process audio / Length of audio. 0.05 = 50x faster than real-time.

Recommendation for local voice agents:

CPU-only: distil-medium or small.en (aim for <300ms latency)
GPU with 8GB VRAM: medium.en (aim for 200-250ms latency)
GPU with 16GB+ VRAM: large-v3 (aim for 150-200ms latency)

The Interruptibility Problem: Barge-In and VAD

Here's something rarely discussed openly: VAD isn't just for silence detection it's a critical component for interruption handling (barge-in).

When a user speaks while your agent is talking, three things must happen instantly:

Echo Cancellation (AEC): Remove your agent's voice from the audio stream so the STT doesn't get confused hearing itself
Voice Activity Detection (VAD): Detect the user speaking (probability-based, not just volume threshold)
Immediate TTS Cancellation: Stop the agent from continuing mid-sentence

Typical barge-in detection requires:

VAD Latency: 85-100ms (using algorithms like Silero VAD, which is Bayesian/probability-based rather than energy-based)
Barge-in Stop Latency: <200ms (system must stop speaking within 200ms of user interruption for natural feel)
Accuracy: 95%+ (must not false-trigger on background noise)

Without proper barge-in handling, your voice agent sounds robotic because users can't interrupt they must wait for the full response.

What's better: simple energy-based VAD that misses some speech, or Silero VAD that uses neural networks?

Use Silero VAD which has builtin support in pipecat so we don't want to worry about much they handle for both CPU and GPU automatically. It trains models to understand "speech probability" rather than just volume, so it handles:

Whispers and soft speech
Background noise (doesn't trigger on dog barks)
Different accents and speech patterns
Real-time streaming (10-20ms window processing)

How to run STT

To serve this, we need a server or inference engine. While faster-whisper has a library, we need a server like architecture (similar to Deepgram) where we connect to a WebSocket server, send audio, and receive text. I have written a simple WebSocket server that runs the model on either CPU or GPU.

I have dockerized everything to make our life easier

All the code for this component is located in code/Models/STT. Let's look at what's inside:

server.py: The heart of the STT. It starts a WebSocket server that receives audio chunks, runs them through the Whisper model, and streams back text.
download_model.py: A helper script to download the specific faster-whisper model weights from HuggingFace.
docker-gpu.dockerfile: The environment setup for NVIDIA GPU users (installs CUDA drivers).
docker-cpu.dockerfile: The environment for CPU users (lighter setup).

Architecture Flow

WebSocket Connection: We use WebSockets instead of REST API because we need a persistent connection to stream audio continuously.
Audio Chunking: The client (your browser/mic) records audio and chops it into small "chunks" (bytes).
Streaming: These chunks are sent over the WebSocket instantly.
Processing: The server receives these raw bytes (usually Int16 format), converts them to floating-point numbers (Float32), and feeds them into the Whisper model.
Voice Activity Detection (VAD): The server listens to your audio stream. When it detects silence (you stopped speaking), it commits the transcription and sends it out.

Example Scenario:
Imagine you say "Hello Agent".

Your microphone captures 1 second of audio.
The browser slices this into 20 tiny audio packets and shoots them to the server one by one.
The Server processes them in real-time. It hears "He...", then "Hello...", then "Hello A...".
You stop talking. The VAD logic sees 500ms of silence.
It shouts "STOP!" and sends the final text "Hello Agent" to the next step.

How to Run

On GPU (Recommended):

docker build -f docker-gpu.dockerfile -t stt-gpu .
docker run --gpus all -p 8000:8000 stt-gpu

On CPU:

docker build -f docker-cpu.dockerfile -t stt-cpu .
docker run -p 8000:8000 stt-cpu

Large Language Model (LLM)

Next, we need a brain. But before we just pick "Llama 3", we need to understand the physics of running a brain on your computer.

The Blueprints of Thinking: LLM Selection Criteria

Choosing an LLM for voice isn't about choosing the smartest one; it's about choosing the one that fits.

1. The VRAM Formula

Will it fit? Don't guess. Use the math.
Formula: VRAM (GB) ≈ Params (Billions) × Precision (Bytes) × 1.2 (Overhead)

Precision Refresher:
- FP16 (16-bit): 2 Bytes/param. (The standard).
- INT8 (8-bit): 1 Byte/param. (75% smaller than standard).
- INT4 (4-bit): 0.5 Bytes/param. (The sweet spot for locals).

Example Calculation (Llama 3 8B):

@ FP16: 8 × 2 × 1.2 = 19.2 GB (Needs A100/3090/4090)
@ INT4: 8 × 0.5 × 1.2 = 4.8 GB (Runs on almost any modern GPU/Laptop!)

Note: Context window (KV Cache) adds variable memory. 8K context is usually +1GB.

2. Throughput vs. Latency

Tokens Per Second (TPS): How fast it reads/generates.
- Humans read/listen at ~4 TPS.
- > 8 TPS is diminishing returns for voice.
Time To First Token (TTFT): This is the King metric.
- Sub-200ms = Instant.
- > 2s = "Is it broken?"
- Goal: Optimize for TTFT, not max throughput.

3. Benchmarks That Actually Matter

Don't just look at the leaderboard. Look at the right columns.

MMLU: General knowledge. Good baseline, but vague.
IFEval (Instruction Following): Crucial for Agents. Can it follow your system prompt instructions? Current small models (~2B) are getting good at this (80%+).
GSM8K: Logic/Math. Good proxy for "reasoning" capability.

For a local voice agent, a high IFEval score is often more valuable than a high MMLU score because if the agent ignores your "Keep responses short" instruction, the user experience fails.

Inference Engines

To run a model locally, we need an Inference Engine. If you search Google, you will find many options. Here are a few popular ones:

Engine	Primary Use	Hardware	Quantization Support	Best For
Ollama	Local single-machine LLM serving	CPU, GPU (NVIDIA, Apple Metal)	GGUF (Q4, Q5, Q8)	Local dev, prototypes, low traffic
llama.cpp	CPU-optimized inference	CPU (x86, ARM), GPU	GGUF (Q2-Q8, AWQ, IQ2-IQ4)	Resource-constrained, edge devices
vLLM	High-throughput production LLM serving	NVIDIA GPU, AMD, Intel	INT8, FP8, FP16, AWQ, GPTQ	Production APIs, high concurrency
TensorRT-LLM	Maximum NVIDIA performance	NVIDIA GPU only (CC >= 7.0)	INT8, FP16, FP8 (H100+)	Ultra-low latency, NVIDIA-only
SGLang	High-throughput production LLM serving	NVIDIA GPU, AMD, Intel	FP16, INT8	Research, RadixAttention, multi-turn

From this list, we are going to use SGLang to run our model on GPU, and for CPU, we can go with Ollama, which is very simple and easy to setup.

We are using Llama 3.1 8B, which is the current state-of-the-art for small open-source models.

Why TTFT (Time-to-First-Token) Is What Matters

When users wait for a response, what they perceive is how long until they hear the first word. Here's why:

Prefill Phase: Model processes your entire prompt (100-500ms for 8B models)
Decoding Phase: Model generates one token at a time, streams it immediately to TTS
Key Insight: TTS can start speaking as soon as token #1 arrives

So if your TTFT is 150ms, users hear the first word in 150ms + TTS latency (75-150ms) = 225-300ms total. The full response might take 5 seconds to complete, but the user hears audio within 300ms.

This is why token-generation-speed-per-second (throughput) matters less than TTFT in conversational AI.

Folder Structure

Code location: code/Models/LLM

llama-gpu.dockerfile: Setup for vLLM or SGLang (GPU).
llama-cpu.dockerfile: Setup for Ollama (CPU).

Architecture Flow

The LLM server isn't just a text-in/text-out box. It handles queuing and batching to keep up.

Request Queue: Your prompt enters a waiting line.
Batching: The server groups your request with others (if any).
Prefill: It processes your input text (Prompt) to understand the context.
Decoding (Token by Token): It generates one word-part (token) at a time.
Streaming: As soon as a token is generated, it is sent back. It doesn't wait for the full sentence.

Example Scenario:
Input: "What is 2+2?"

Tokenizer: Converts text to numbers [123, 84, 99].
Inference: The model calculates the most likely next number.
Token 1: Generates "It". Sends it immediately.
Token 2: Generates "is". Sends it.
Token 3: Generates "4". Sends it.
End: Sends <EOS> (End of Sequence).

How to Run

1. On GPU (using SGLang/vLLM):

docker build -f llama-gpu.dockerfile -t llm-gpu .
docker run --gpus all -p 30000:30000 llm-gpu

Note: This exposes an OpenAI-compatible endpoint at port 30000.

2. On CPU (using Ollama):

# Easy method: Just install Ollama from ollama.com
ollama run llama3.1

Or using our dockerfile:

docker build -f llama-cpu.dockerfile -t llm-cpu .
docker run -p 11434:11434 llm-cpu

4. Text-to-Speech (TTS)

Finally, for the Mouth, we use Kokoro.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient.

The Blueprints of Speaking: TTS Selection Criteria

Evaluating a "Mouth" is tricky because it's both objective (speed) and subjective (beauty).

1. Latency & Real-Time Factor

TTFB (Time To First Byte): How fast does the first sound play?
- <100ms: The Gold Standard.
- <300ms: Acceptable.
- >500ms: Breaks immersion.
Real-Time Factor (RTF):
- Anything < 0.1 (generating 10s audio in 1s) is amazing.
- Production systems target < 0.5.

2. Human Quality Metrics (MOS)

There isn't a "perfect" score, but we use Mean Opinion Score (MOS) (rated 1-5 by humans).

4.0 - 5.0: Near Human. (Modern models like Kokoro/ElevenLabs).
2.5: "Robot Voice". (Old school accessibility TTS).

3. Naturalness & Prosody

"Prosody" is the rhythm and intonation.

Context Awareness: Does it raise its pitch at a question mark? Does it pause for a period?
SSML Support: Can you control it? (e.g. <break time="500ms"/> or <emphasis>).
Voice Cloning:
- Zero-Shot: 3s audio clip -> new voice. (Good for dynamic users).
- Fine-Tuned: 3-5 hours of audio training. (Necessary for branded, professional voices).

The Critical: TTS Context Window & Streaming

Here's a nuance many developers miss: TTS models like Kokoro need context windows to avoid sounding robotic when receiving partial text.

The Problem Without Context Awareness:

LLM sends: "It"     → Kokoro generates audio for just "It" → sounds like grunt
LLM sends: "is"     → Kokoro generates audio for just "is" → new voice, disconnected
LLM sends: "4"      → Kokoro generates audio for just "4" → jumpy prosody

The Solution: Context Window in Streaming TTS:

LLM sends: "It"     → Kokoro waits (buffering)
LLM sends: "is"     → Kokoro now has "It is" → generates better prosody
LLM sends: "4"      → Kokoro has "It is 4" → natural cadence
OR, Kokoro predicts: "wait for punctuation before speaking"

Kokoro uses a 250-word context window internally. This means:

It buffers incoming tokens until it reaches punctuation (., !, ?, or a configurable threshold)
Once it has enough context, it generates audio with proper intonation
As more text arrives, it streams the audio bytes back without waiting for the full response

This is why Kokoro excels at streaming it doesn't try to speak partial fragments; it waits just enough to sound natural.

Example:

LLM stream: "Let me think... " (no punctuation yet)
  └─ Kokoro buffers silently

LLM stream: "Let me think... 2+2 equals 4." (full sentence)
  └─ Kokoro now has context → generates natural speech with correct stress
  └─ Streams audio back in chunks (50-100ms windows)

We'll also use the Kokoro library and build a server to expose it as a service.

Folder Structure

Code location: code/Models/TTS/Kokoro

server.py: Takes text input and streams out audio bytes.
download_model.py: Fetches the model weights (v0_19 weights).
kokoro-gpu.dockerfile: GPU setup (Requires NVIDIA container toolkit).
kokoro-cpu.dockerfile: CPU setup (Works on standard laptops).

If you like A minimal Kokoro-FastAPI server impelementation you can check out here

Architecture Flow

The TTS server receives a stream of text tokens from the LLM. It immediately starts converting them to Phonemes (sound units) and generating audio. It streams this audio back to the user before the LLM has even finished the sentence. This Streaming Pipeline is crucial for low latency and natural feel.

How it works:

Token Buffering: TTS receives token #1 from LLM. Checks if it's punctuation.

If no punctuation: buffer and wait for more tokens.
If punctuation or buffer size > 64 tokens: proceed.

Phonemization: Convert buffered text to phonetic units (e.g., "Hello" → /həˈloʊ/).
Model Inference: Kokoro generates audio features (mel-spectrogram) from phonemes.
Waveform Generation: iSTFTNet vocoder converts mel-spec to raw audio bytes.
Streaming: Audio chunks (50-100ms windows) stream back immediately over WebSocket.
Repeat: As LLM sends token #2, buffer grows, phonemization updates, new audio generates.

Example Scenario:
Input Stream: "It" → "is" → "4" → "." (with timestamps)

T=0ms:   LLM sends "It"
         Kokoro: "No punctuation, buffering..."

T=150ms: LLM sends " is"
         Kokoro: "Still buffering: 'It is'"

T=300ms: LLM sends " for"
         Kokoro: "Still buffering: 'It is for'"

T=400ms: LLM sends "."
         Kokoro: "Got punctuation! Phonemize: 'ɪt ɪz for'"
         → Infer mel-spec (100ms)
         → Vocoder (50ms)
         → Stream chunk #1 (40ms audio) at T=550ms ✓ User hears "It"

T=550ms: More tokens arrive, regenerate from updated context "It is for."
         → Refined mel-spec (includes proper prosody now)
         → Stream chunk #2 at T=650ms ✓ User hears "is"
         → Stream chunk #3 at T=750ms ✓ User hears "for"

Total latency: ~550ms to first audio, streaming continues until EOS token.

Performance Benchmarks

Setup	Model Size	TTFB	Throughput	Real-Time Factor
CPU (Intel i7, 32GB RAM)	Kokoro 82M	500-800ms	3-11x RT	Suitable for dev
GPU (RTX 3060, 12GB VRAM)	Kokoro 82M	97-150ms	100x RT	Production-ready
GPU (RTX 4090, 24GB VRAM)	Kokoro 82M	40-60ms	210x RT	Excellent
Quantized (4-bit)	Kokoro INT4	200-300ms	8-15x RT	Good balance

How to Run

1. On GPU:

docker build -f kokoro-gpu.dockerfile -t tts-gpu .
docker run --gpus all -p 8880:8880 tts-gpu

2. On CPU:

docker build -f kokoro-cpu.dockerfile -t tts-cpu .
docker run -p 8880:8880 tts-cpu

Putting It Together: End-to-End Latency

Now that we understand each component, here's what your full local pipeline looks like:

Realistic Local Performance (8B LLM + Kokoro + Whisper on RTX 3060)

User speaks: "What is 2+2?"
    ↓
STT (faster-distil-whisper-medium)     : 200ms ✓
LLM (Llama 3.1 8B, TTFT)               : 120ms ✓
    └─ Token 1 "It" available at 120ms
    ↓
TTS (Kokoro buffering for punctuation) : 400ms ✓
    └─ Buffering tokens until "4." (takes ~300ms for full sentence)
    └─ Phonemization + inference: 100ms
    ↓
Streaming audio starts back to user    : 120 + 400 = 520ms ✓
User hears first word "It"

Subsequent tokens stream in background:
    Token 2 "is" available at 180ms    → Audio generated in parallel
    Token 3 "4" available at 250ms     → User hears full "It is 4" by 650ms
    Token EOS at 300ms                 → Stop TTS

TOTAL MOUTH-TO-EAR: ~650ms (acceptable for local, within production <800ms)

Compare to production APIs:

Deepgram STT + GPT-4 + ElevenLabs TTS (cloud): 200-300ms (optimized, lower variance)
Your local setup: 650-800ms (good for dev, acceptable for many use cases)

Homework: Integrate With Pipecat

So now that all three components are up and running, it's your turn to think through how we can integrate them with Pipecat and get a fully local "Hello World" working end to end.

Challenge:

Run all three Docker containers (STT, LLM, TTS) locally
Create a Pipecat pipeline that:
- Accepts WebSocket audio from client
- Sends to STT server (port 8000)
- Streams STT output to LLM server (port 30000)
- Streams LLM tokens to TTS server (port 8880)
- Streams TTS audio back to client
Implement barge-in handling: If user speaks while TTS is playing, cancel TTS and process new input
Measure latency at each step

Tips:

Use asyncio and WebSocket for non-blocking streaming
Implement a simple latency meter to log timestamps
Test with quiet and noisy audio to validate VAD
Start with synchronous (blocking) for simplicity, then optimize

If you'd like to share your implementation, feel free to raise a PR on our GitHub repo here:
https://github.com/programmerraja/VoiceAgentGuide

Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Boopathi — Sun, 28 Dec 2025 08:42:21 +0000

This is my new series on Prompt Engineering and it's different from everything else out there.

Most blogs give you templates: "Try this prompt!" or "Use these 10 techniques!" That's not what we're doing here.

We're going deep: How do LLMs actually process your prompts? What makes a prompt effective at the mechanical level? Where do LLMs fail and why?

This series will give you the mental models to engineer prompts yourself, not just copy someone else's examples. Let's dive in.

We going to have 5 parts (so far I think but may be in future we add more )

The Foundation - How LLMs Really Work
The Art & Science of Prompting
Prompting techniques and optimization
Prompt Evaluation and Scaling
Tips, Tricks, and Experience

Let's jump into Part 1.

Do LLMs Think Like Humans?

Let me ask you something: Do you think LLMs are intelligent like humans? Do they have a "brain" that understands your questions and thinks through answers?

If you answered yes, you're wrong.

LLMs don't think. They don't understand. They're just next-token predictors—sophisticated autocomplete machines that guess what word (or rather, "token") should come next based on patterns they've seen before.

Now, you might be thinking: "Wait, how can simple next-word prediction answer complex questions, write code, or have conversations?"

That's a great question, and the answer involves some fascinating engineering. But we're not going to dive too deep into the theoretical computer science here—that would make this series endless. We're focusing on what you need to know to write better prompts, nothing more, nothing less.

If you really intrested in understanding How Machines Learn you can check out here where I have written a detail way

Let's start with the basics.

How Does an LLM Process Your Prompt?

When you type a prompt and hit enter, here's the simplified workflow of what happens inside the model:

Step 1: English Text → Tokens

Your text doesn't go directly into the model. First, it gets broken down into tokens.

A token is roughly a chunk of text—sometimes a whole word, sometimes part of a word, sometimes punctuation.

Examples:

"Hello world" → ["Hello", " world"] (2 tokens)
"apple" → ["apple"] (1 token)
"12345" → ["123", "45"] (2 tokens)

Why does this matter? Because:

Models have token limits (context windows), not word limits
The way text is tokenized affects how the model "sees" it
Some words the model handles better because they're single tokens, while others are split up

Step 2: Tokens → Numbers (Embeddings)

The model can't work with text directly—it only understands numbers. Each token gets converted into a long list of numbers called an embedding (basically a mathematical representation of that token).

Step 3: The Transformer Does Its Magic

Your tokens (now numbers) pass through the Transformer architecture layers of neural network computations. Here's where the attention mechanism kicks in, letting the model figure out which tokens relate to which.

Example: In the sentence "The bank of the river was muddy", the model's attention mechanism connects bank with river and muddy to understand we're talking about a riverbank, not a financial institution.

Note: Currently we have some other emerging llm architectures like Diffusion Models, State Space Models, etc.. but for sake of simplicity i cover only Transformer based models.

Step 4: Predict the Next Token (Probabilities)

At the end of all this processing, the model outputs a probability distribution over all possible next tokens in its vocabulary (which can be 50,000+ tokens).

It looks something like this:

Paris:       0.85  (85% probability)
the:         0.05  (5% probability)
beautiful:   0.03  (3% probability)
London:      0.02  (2% probability)
[thousands of other tokens with tiny probabilities...]

The model doesn't "know" Paris is the capital of France. It just calculates that based on the patterns it learned during training, Paris has the highest probability of being the next token after "The capital of France is".

Step 5: Select a Token & Repeat

The model picks a token based on these probabilities (we'll talk about how it picks in a moment), adds it to the sequence, and repeats the whole process to generate the next token, then the next, until it decides to stop.

That's it. That's all an LLM does: predict the next token, over and over, based on probability.

But Wait... How Does This Answer My Questions?

Here's where it gets interesting. If LLMs are just probability machines playing "guess the next word," how do they:

Answer questions correctly?
Write code that actually works?
Hold coherent conversations?
Follow complex instructions?

The answer is training specifically, the two major phases that shape model behavior.

Phase 1: Pre-Training (Learning Patterns from the Internet)

In this phase, the model reads trillions of tokens from:

Websites (Wikipedia, forums, blogs)
Books
Code repositories (GitHub)
Research papers
Social media

What it learns: Statistical patterns. If it sees "The capital of France is Paris" thousands of times, it learns that Paris has a high probability of following "The capital of France is".

What it doesn't learn: How to answer questions like an assistant. A pre-trained "base model" has knowledge but no manners.

Ask a base model: "What is the capital of France?"

It might respond: "What is the capital of Germany? What is the capital of Spain?"

Why? Because it's just completing patterns it saw in training data probably quiz lists from forums. It has information, but no concept of "answering questions."

Phase 2: Post-Training (Teaching It to Be an Assistant)

This is where base models become ChatGPT, Claude, or other chat assistants. Two key steps:

1. Supervised Fine-Tuning (SFT):

Humans write thousands of example conversations: questions and good answers
The model learns: "Oh, when I see a question, I should provide a helpful answer, not continue the question"

2. Reinforcement Learning from Human Feedback (RLHF):

Humans rate different model responses as "good" or "bad"
The model learns to optimize for helpful, harmless, and honest responses
This is why models refuse harmful requests or add disclaimers

The result: A model that not only predicts the next token, but predicts tokens that look like helpful assistant responses because that pattern now has the highest probability in its training.

So when you ask "What is the capital of France?", the model isn't "thinking" about geography. It's predicting that tokens forming a helpful answer have higher probability than tokens that continue the question—because that's what its training reinforced.

It's all still just next-token prediction. The training just shaped which predictions have high probability.

Model Configuration: Controlling the Output

Remember that probability distribution we talked about? Here's where you get control. The model gives you probabilities, but configuration parameters decide how tokens are actually selected from those probabilities.

Temperature: The Creativity Dial

Temperature controls how "random" the model's choices are.

Example scenario: The model predicts:

Paris:     85%
beautiful: 3%
London:    2%

Low Temperature (e.g., 0.2):

The model becomes more "confident" and almost always picks the top choice
Paris might effectively become 95%+ likely
Result: Deterministic, focused, repetitive outputs
Use for: Code generation, data extraction, factual answers

High Temperature (e.g., 0.8):

The model flattens the probability curve
Paris might drop to 60%, beautiful rises to 10%, London to 8%
Result: More varied, creative, unpredictable outputs
Use for: Creative writing, brainstorming, multiple perspectives

Real example:

Prompt: "The sky is"

Temperature 0.2: "blue" (almost always) Temperature 0.8: "blue" or "cloudy" or "vast" or "filled with stars" (varies)

Top-P (Nucleus Sampling): Cutting Off the Nonsense

Top-P (also called nucleus sampling) sets a probability threshold.

If you set Top-P = 0.9, the model only considers tokens that together make up the top 90% of probability, ignoring everything else.

Why this matters:

Without Top-P, even with reasonable temperature, the model might occasionally pick a token with 0.001% probability—resulting in complete nonsense.

With Top-P = 0.9, those ultra-low-probability tokens are never even considered. The model stays coherent while still being creative.

Practical combination:

Temperature 0.7 + Top-P 0.9 = Creative but coherent
Temperature 0.2 + Top-P 1.0 = Deterministic and focused

Top-K: Limiting Choices

Top-K simply limits the model to considering only the K most likely tokens.

Example: Top-K = 50 means the model only looks at the 50 highest-probability tokens and ignores the rest.

This is a simpler version of Top-P and less commonly used in modern systems.

Putting It Together

Let's trace through a complete example:

Your prompt: "Explain photosynthesis in simple terms"

Tokenization: ["Explain", " photosynthesis", " in", " simple", " terms"]
Model processing: Transformer calculates relationships between tokens
Probability distribution for next token:

   Photosynthesis:  40%
   It:             15%
   The:            12%
   In:              8%
   [...]

Configuration applied (Temperature 0.3, Top-P 0.9):

Low temperature sharpens: Photosynthesis → 65%
Model picks Photosynthesis

Repeat: Now the sequence is "Explain photosynthesis in simple terms Photosynthesis"
- Calculate probabilities for the next token
- Pick based on configuration
- Continue until complete answer is generated

The model never "understood" photosynthesis. It predicted tokens that statistically form explanations based on patterns from its training data.

Now You Have the Mental Model

You now understand the fundamental truth: LLMs are probability engines, not reasoning machines. Every response is just a statistical prediction of the next token, shaped by training data and controlled by configuration parameters.

But here's where it gets powerful: If you understand the mechanism, you can engineer the probabilities.

Your prompt doesn't just ask a question it shapes the entire probability landscape the model uses to generate its response. Change a few words, reorder your instructions, add an example, and suddenly different tokens become more likely. Different tokens mean different outputs.

In the next part, we're going to explore The Art & Science of Prompting how to deliberately craft prompts that steer those probabilities in your favor.

The foundation is set. Now let's learn to build on it.

I’ve also set up a GitHub repository for this series, where I’ll be sharing the code and additional resources. Make sure to check it out and give it a star!

Feel free to share your thoughts, comments, and insights below. Let’s learn and grow together!

Spotify Wrapped, But for Developers: Introducing GitHub Wrapped 2025

Boopathi — Sat, 20 Dec 2025 02:02:50 +0000

GitHub Wrapped 2025 is LIVE! 🎉

Your year of code wrapped and ready to explore!

Ever wondered what your year looked like in code?
GitHub Wrapped turns your GitHub activity into a visual story that celebrates the work you actually did not just green dots on a graph.

What Is GitHub Wrapped?

GitHub Wrapped is your personal year-in-review for GitHub activity.
Inspired by Spotify Wrapped and other year review tools, it takes your contributions and turns them into insights you can see, analyze, and share.

With it you can:

📆 See your contribution graph for the year
🏆 Unlock fun badges that reflect your coding persona
📌 Discover your top repositories and where you spent most of your energy
🎨 Apply themes like Cyberpunk and Sunset Vibes for personalized style
🔐 Optionally include private contributions using a personal access token (token is used locally only) no tracking or storage

Why You’ll Love It

As developers, GitHub has become our public portfolio showing commits, collaborations, issues, and code that tell a story of growth, consistency, and learning.

But GitHub itself doesn’t wrap that story up for you at the end of the year. GitHub Wrapped fills that gap by:
✔ giving you a snapshot of the year
✔ surfacing patterns you might miss in the daily grind
✔ making something shareable and fun

What You’ll Get

Once you generate your wrapped summary, you’ll see:

✨ Your year’s contribution graph
🎭 Your developer persona who you were in code this year
🏆 Badges that celebrate your style and activity
📊 Top projects that defined your GitHub year
🎨 Themes to make it your own ([programmerraja.is-a.dev][1])

How It Works

Visit 👉 Githubwrapup
Enter your GitHub username
Add a personal access token to include private contributions
Pick your year & theme
Click Generate
Your GitHub year in code appears ready to explore and share!

Here’s my Github 2025 Wrapped summary 👇

How Your LLM Costs 5X More If You Don't Speak English

Boopathi — Sat, 25 Oct 2025 07:22:29 +0000

Are you overpaying for AI because of your language? If you're building LLM applications in Spanish, Hindi, or Greek, you could be spending up to 6 times more than English users for the exact same functionality.

This blog insipred from the research paper Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

The Hidden Tokenization Tax

When you send text to GPT-4, Claude, or Gemini, your input gets broken into tokens chunks roughly 3-4 characters long in English. You pay per token for both input and output.

The shocking truth: The same sentence costs wildly different amounts depending on your language.

Real Example: "Hello, my name is Sarah"

Language	Tokens Needed	Cost vs English	Annual Cost (10K msgs/day)
English	7 tokens	1.0x baseline	\$16,425
Spanish	11 tokens	1.5x more	\$24,638 (+\$8,213)
Hindi	35 tokens	5.0x more	\$82,125 (+\$65,700)
Greek	42 tokens	6.0x more	\$98,550 (+\$82,125)

That's an $82,000 annual difference for the exact same chatbot purely because of language.

The Complete Language Cost Breakdown

Research from ACL 2023 and recent LLM benchmarks reveals systematic bias in how models tokenize different languages. Here's what it costs to process 24 major languages:

Tokenization cost comparison across 24 languages showing how many times more expensive each language is compared to English due to tokenization differences

Most Efficient Languages (1.0-1.5x English)

English: 1.0x (baseline)
French: 1.2x
Italian: 1.2x
Portuguese: 1.3x
Spanish: 1.5x

Moderately Expensive (1.6-2.5x)
Korean: 1.6x
Japanese: 1.8x
Chinese (Simplified): 2.0x
Arabic: 2.0x
Russian: 2.5x

Highly Expensive (3.0-6.0x)
Ukrainian: 3.0x
Bengali: 4.0x
Thai: 4.0x
Hindi: 5.0x
Tamil: 5.0x
Telugu: 5.0x
Greek: 6.0x (most expensive)

Why Writing Systems Matter

Comparison of tokenization costs and efficiency across different writing systems, showing why Latin-based languages are most cost-effective for LLM applications

The script your language uses creates dramatic efficiency gaps:

Latin script: 1.4x average (73.5% efficient)
Hangul (Korean): 1.6x (63% efficient)
Han/Japanese: 1.8-2.0x (50-56% efficient)
Cyrillic: 2.75x average (36.5% efficient)
Indic scripts: 4-5x average (20% efficient)
Greek: 6.0x (17% efficient—worst)

Why This Inequality Exists

1. Training Data Bias

GPT-4, Claude, and Gemini are trained on English-dominant datasets. The Common Crawl corpus shows stark imbalance:

~60% English
~10-15% combined for Spanish/French/German
<5% for most other languages

Tokenizers learn to compress what they see most. English gets ultra-efficient encoding; everything else is treated as "foreign."

2. Morphological Complexity

Languages with rich morphology generate far more word variations

English: "run" → runs, running, ran (4 forms)
Turkish: Single root → 50+ forms with suffixes
Arabic: Root system → thousands of variations
Hindi: Complex verb conjugations with gender/number/tense

Tokenizers can't learn compact patterns for high-variation, low-data languages.

3. Unicode Encoding Overhead

Different scripts need different byte counts:

Latin: 1 byte per character
Cyrillic: 2 bytes per character
Devanagari/Tamil: 3+ bytes per character

More bytes = more tokens = higher cost—even for the same semantic content.

Real-World Cost Impact

Here's what tokenization inequality means for actual business applications:

Customer Support Chatbot (10,000 messages/day)

English: \$16,425/year
Spanish: \$24,638/year (+50%, +\$8,213)
Hindi: \$82,125/year (+400%, +\$65,700)

Content Generation Platform (1M words/month)

English: \$14,400/year
Spanish: \$21,600/year
Hindi: \$72,000/year

Document Translation Service (100K words/day)

English: \$65,700/year
Spanish: \$98,550/year (+\$32,850)
Hindi: \$328,500/year (+\$262,800)

Code Assistant (50K queries/day)

English: \$91,250/year
Spanish: \$136,875/year
Hindi: \$456,250/year (+\$365,000)

Bottom line: A company serving Hindi users pays \$262,800-\$365,000 more annually than an identical English service.

The Socioeconomic Dimension

Research reveals a disturbing -0.5 correlation between a country's Human Development Index and LLM tokenization cost.

Translation: Less developed countries often speak languages that cost more to process.

Users in developing nations pay premium rates
Communities with fewer resources face higher AI barriers
This creates "double unfairness" in AI democratization

Example: A startup in India building a Hindi customer service bot pays 5x more than a US competitor despite likely having far less funding.

The Future of Fair AI

Language should never determine how much intelligence costs. Yet today, the world’s most spoken tongues pay a silent premium just to access the same models. Fixing this isn’t about optimization it’s about fairness. Until every language is tokenized equally, AI remains fluent in inequality.

feel

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-2)

Boopathi — Sat, 11 Oct 2025 12:50:32 +0000

Welcome to Part 2 of the 2025 Voice AI Guide How to Build Your Own RealTime Voice Agent.

In this section, we’ll dive deep into Pipecat and create a simple “Hello World” program to understand how real-time voice AI works in practice.

If you have not read the part 1 read here

Pipecat

Pipecat is an open-source Python framework developed by Daily.co for building real-time voice and multimodal conversational AI agents. It provides a powerful yet intuitive way to orchestrate audio, video, AI services, and transport protocols to create sophisticated voice assistants, AI companions, and interactive conversational experiences

What makes Pipecat special is its voice-first approach combined with a modular, composable architecture Instead of building everything from scratch, you can focus on what makes your agent unique while Pipecat handles the complex orchestration of real-time audio processing, speech recognition, language models, and speech synthesis.

What You Can Build with Pipecat

Pipecat enables a wide range of applications

Voice Assistants - Natural, streaming conversations with AI
AI Companions - Coaches, meeting assistants, and interactive characters
Phone Agents - Customer support, intake bots, and automated calling systems
Multimodal Interfaces - Applications combining voice, video, and images
Business Agents - Customer service bots and guided workflow systems
Interactive Games - Voice-controlled gaming experiences
Creative Tools - Interactive storytelling with generative media

Pipecat Architecture: How It Works

Understanding Pipecat's architecture is crucial for building effective voice agents. The framework is built around three core concepts:

1. Frames

Frames are data packages that move through your application. Think of them as containers that hold specific types of information:

Audio frames - Raw audio data from microphones
Text frames - Transcribed speech or generated responses
Image frames - Visual data for multimodal applications
Control frames - System messages like start/stop signals

2. Frame Processors

Frame processors are specialized workers that handle specific tasks. Each processor:

Receives specific frame types as input
Performs a specialized transformation (transcription, language processing, etc.)
Outputs new frames for the next processor
Passes through frames it doesn't handle

Common processor types include:

STT (Speech-to-Text) processors that convert audio frames to text frames
LLM processors that take text frames and generate response frames
TTS (Text-to-Speech) processors that convert text frames to audio frames
Context aggregators that manage conversation history

3. Pipelines

Pipelines connect processors together, creating a structured path for data to flow through your application. They handle orchestration automatically and enable parallel processing - while the LLM generates later parts of a response, earlier parts are already being converted to speech and played back to users.

Voice AI Processing Flow

Here's how a typical voice conversation flows through a Pipecat pipeline:

Audio Input - User speaks → Transport receives streaming audio → Creates audio frames
Speech Recognition - STT processor receives audio → Transcribes in real-time → Outputs text frames
Context Management - Context processor aggregates text with conversation history
Language Processing - LLM processor generates streaming response → Outputs text frames
Speech Synthesis - TTS processor converts text to speech → Outputs audio frames
Audio Output - Transport streams audio to user's device → User hears response

The key insight is that everything happens in parallel - this parallel processing enables the ultra-low latency that makes conversations feel natural.

Hello World Voice Agent: Complete Implementation

Now let's build a complete "Hello World" voice agent that demonstrates all the core concepts. This example creates a friendly AI assistant you can have real-time voice conversations with.

Prerequisites

Before we start, you'll need:

Python 3.10 or later
uv package manager (or pip)
API keys from three services:
- Deepgram for Speech-to-Text
- OpenAI for the language model
- Cartesia for Text-to-Speech

Project Setup

First, let's set up our project:

# Install Pipecat with required services
uv add "pipecat-ai[deepgram,openai,cartesia,webrtc]"

Environment Configuration

Create a .env file with your API keys:

# .env
DEEPGRAM_API_KEY=your_deepgram_api_key_here
OPENAI_API_KEY=your_openai_api_key_here  
CARTESIA_API_KEY=your_cartesia_api_key_here

The code is some what bigger so i have not shared the whole code you can check out the whole code here


async def main():
"""Main entry point for the Hello World bot."""
    bot = HelloWorldVoiceBot()

await bot.run_bot()

Understanding the Code Structure

Let's break down the key components of our Hello World implementation:

1. Service Initialization

# Speech-to-Text service
self.stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))

# Language Model service  
self.llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-3.5-turbo")

# Text-to-Speech service
self.tts = CartesiaTTSService(api_key=os.getenv("CARTESIA_API_KEY"), voice_id="...")

Each service is a frame processor that handles a specific part of the voice AI pipeline.

2. Pipeline Configuration

pipeline = Pipeline([
    transport.input(),              # Audio input from browser
    self.stt,                      # Speech → Text
    self.context_aggregator.user(),# Add to conversation history
    self.llm,                      # Generate response
    self.tts,                      # Text → Speech  
    transport.output(),            # Audio output to browser
    self.context_aggregator.assistant(), # Save response to history
])

The pipeline defines the data flow each processor receives frames, transforms them, and passes them to the next processor.

3. Event-Driven Interactions

@transport.event_handler("on_first_participant_joined")
async def on_participant_joined(transport, participant):
    # Trigger bot to greet the user
    await task.queue_frame(LLMMessagesFrame(self.messages))

Event handlers manage the conversation lifecycle - when users join/leave, when they start/stop speaking, etc.

The diagram below shows a typical voice assistant pipeline, where each step happens in real-time:

Running Your Hello World Bot

Save the code as hello_world_bot.py
Run the bot:

   python hello_world_bot.py

Open your browser to http://localhost:7860
Click "Connect" and allow microphone access
Start talking! Say something like "Hello, how are you?"

The bot will:

Listen to your speech (STT)
Process it with OpenAI (LLM)
Respond with natural speech (TTS)
Remember the conversation context

For more examples and advanced features, check out the Pipecat documentation and example repository.*

What Next?

Now that you’re familiar with Pipecat and can build your own real-time voice agent, it’s time to take the next step.

In the upcoming part, we’ll explore how to run all models locally even on a CPU and build a fully offline voice agent.

I’ve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Don’t forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).

Stay tuned for the next part of the 2025 Voice AI Guide!

How Machines Learn: Understanding the Core Concepts of Neural Networks

Boopathi — Sat, 04 Oct 2025 08:22:57 +0000

Imagine trying to teach a child who’s never seen the world to recognize a face, feel that fire is hot, or sense when it might rain. How would you do it?

For centuries, we thought intelligence required something mystical a soul, consciousness, divine spark. But what if it’s just pattern recognition at an extraordinary scale? What if learning is simply tuning millions of tiny parameters until inputs map correctly to outputs?

That’s the bold idea behind deep learning: mathematical systems that can learn any pattern, approximate any function, and tackle problems once thought uniquely human.

In 1989, mathematicians proved the Universal Approximation Theorem showing that even a single hidden layer neural network can approximate any continuous function. In theory, such a network can learn to translate, recognize, play, or predict anything.

But theory alone isn’t enough. The theorem says such a network exists not how to build or train it. That’s where the real craft of deep learning begins: finding the right weights, training efficiently, and learning patterns that generalize.

Let’s unpack the six core ideas that make this possible.

Note: This is a deep dive not a skim. Grab a coffee, settle in, and take your time. By the end, you’ll understand neural networks from the ground up, not just in words but in logic.

1. Neural Networks: Universal Function Approximators

What Are We Trying to Do?

Before we understand neural networks, let's start with something simpler: what is a function?

In mathematics, a function is a relationship that maps inputs to outputs. f(x) = 2x + 1 is a function. You give it x = 3, it returns 7. Simple, deterministic, predictable.

But real-world problems involve functions we can't write down. Consider:

f(image) = "cat" or "dog"
f(email_text) = "spam" or "not spam"
f(patient_symptoms) = disease_probability

These are still functions they map inputs to outputs but we don't know their mathematical form. Traditional programming can't help us here because we can't write explicit rules for every possible image or email.

Building Blocks: The Artificial Neuron

Let's build from the ground up. Start with a single neuron the atomic unit of a neural network.

A neuron does three things:

Receives multiple inputs (x₁, x₂, x₃, ...)
Multiplies each input by a weight (w₁, w₂, w₃, ...)
Sums everything up and adds a bias: z = w₁x₁ + w₂x₂ + w₃x₃ + ... + b

Why this structure? Because it's the simplest way to combine multiple pieces of information into a single decision.

Geometry of a Neuron: Drawing a Line

Let’s ground this in a real example.

Problem: You're a bank deciding whether to approve loans. You have two pieces of information:

x₁ = Annual income (in thousands)
x₂ = Credit score

Goal: Separate "approve" from "reject" applications.

A Single Neuron Creates a Line (2D) or Hyperplane (Higher Dimensions)

The equation z = w₁x₁ + w₂x₂ + b is actually the equation of a line! Let's see how:

Example neuron with specific weights:

z = 0.5·income + 2·credit_score - 150

This neuron outputs positive values for "approve" and negative for "reject". The decision boundary is where z = 0:

0 = 0.5·income + 2·credit_score - 150
credit_score = 75 - 0.25·income

This is a line! Let's plot it:

What the weights mean geometrically:

w₁ = 0.5: For every $1000 increase in income, the decision shifts by 0.5 units toward approval
w₂ = 2.0: For every 1-point increase in credit score, the decision shifts by 2 units toward approval (4× more important than income!)
b = -150: The bias shifts the entire line. Without it, the line would pass through origin (0,0)

The learning process is finding the right line:

Start with a random line (random weights)
See which points it classifies wrong
Adjust the weights to rotate and shift the line
Repeat until the line best separates the two groups

What One Neuron Can and Cannot Do

Cannot separate (non-linearly separable):

XOR is the classic example: you need (0,1) and (1,0) to be class 1, but (0,0) and (1,1) to be class 0. No single line can achieve this separation.

This is why we need multiple layers.

Multiple Neurons, Multiple Lines: Building Complex Boundaries

If one neuron creates one line, what happens with multiple neurons in one layer?

Example: 3 neurons in one layer


Neuron 1: z₁ = w₁₁x₁ + w₁₂x₂ + b₁ [Line 1]
Neuron 2: z₂ = w₂₁x₁ + w₂₂x₂ + b₂ [Line 2]
Neuron 3: z₃ = w₃₁x₁ + w₃₂x₂ + b₃ [Line 3]

Each neuron draws a different line! But without additional layers, we still can't solve XOR. Why? Because we're just drawing multiple lines without combining them in complex ways.

The key insight: We need to combine these lines non-linearly. This is where activation functions and depth come in.

The Layer Abstraction

Now stack multiple neurons side by side that's a layer. Each neuron in the layer:

Receives the same inputs
Has its own unique weights and bias
Produces its own output

A layer with 10 neurons transforms one input vector into 10 different outputs, each representing a different "feature" or "pattern" it has detected.

Solving XOR: A Complete Example

Let's solve XOR step-by-step to understand how layers work together.

The XOR Problem:

Input (x₁, x₂) → Output
(0, 0) → 0
(0, 1) → 1
(1, 0) → 1
(1, 1) → 0

Two-Layer Solution:

Layer 1: Create useful features (2 neurons with ReLU)

Neuron 1: Detects "at least one input is 1"

z₁ = x₁ + x₂ - 0.5
a₁ = ReLU(z₁)

Testing:

(0,0): z₁ = -0.5, a₁ = 0
(0,1): z₁ = 0.5, a₁ = 0.5
(1,0): z₁ = 0.5, a₁ = 0.5
(1,1): z₁ = 1.5, a₁ = 1.5

Neuron 2: Detects "both inputs are 1"

z₂ = x₁ + x₂ - 1.5
a₂ = ReLU(z₂)

Testing:

(0,0): z₂ = -1.5, a₂ = 0

(0,1): z₂ = -0.5, a₂ = 0

(1,0): z₂ = -0.5, a₂ = 0

(1,1): z₂ = 0.5, a₂ = 0.5

Layer 2: Combine features (1 neuron with Sigmoid)

z₃ = a₁ - 2·a₂ - 0.25

output = Sigmoid(z₃)

Testing:
(0,0): z₃ = 0 - 0 - 0.25 = -0.25 → ≈0 ✓

(0,1): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓

(1,0): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓

(1,1): z₃ = 1.5 - 1 - 0.25 = 0.25 → ≈0 ✓

What happened geometrically?

Layer 1 transformed the space:

The first layer created new features where the problem becomes linearly separable!

a₁ captures "OR-ness" (at least one is true)
a₂ captures "AND-ness" (both are true)

Layer 2 drew a simple line in this new space:

a₁ - 2·a₂ = 0.25 [decision boundary]

This line easily separates XOR in the transformed space!

The key insight:

Layer 1: Creates useful intermediate features by drawing multiple lines/planes
Layer 2: Combines these features with another line/plane
Together: They can represent any decision boundary!

The Complete Architecture

A typical neural network:

Input Layer (raw data) 
    → Hidden Layer 1 (low-level features)
    → Hidden Layer 2 (mid-level features)
    → Hidden Layer 3 (high-level features)
    → Output Layer (predictions)

The power lies not in any single neuron, but in the billions of connections between them, each with its own weight, collectively forming a function approximator of extraordinary flexibility.

Universal Approximation Theorem

The Universal Approximation Theorem (1989) proves:

A neural network with just one hidden layer can approximate any continuous function, given enough neurons.

But “enough” might mean billions, which is impractical.

Deep (multi-layer) networks achieve the same expressive power more efficiently through hierarchical composition like compression for abstractions.

So, in theory, neural networks can learn any mapping; in practice, depth makes it tractable.

2. Activation Functions: Breaking Linearity

The Linear Trap: A Fundamental Problem

Imagine we build a neural network with three layers, but we don't use activation functions. Let's trace through what happens mathematically:

Layer 1: z₁ = W₁x + b₁

Layer 2: z₂ = W₂z₁ + b₂ = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂

Layer 3: z₃ = W₃z₂ + b₃ = W₃(W₂W₁x + W₂b₁ + b₂) + b₃

Simplifying: z₃ = (W₃W₂W₁)x + (W₃W₂b₁ + W₃b₂ + b₃)

Notice what happened? No matter how many layers we add, we always end up with: Wx + b a simple linear function. Matrix multiplication of matrices is still a matrix. We've built an expensive way to do simple linear regression.

This is catastrophic. Linear functions can only model linear relationships. The real world is non-linear. The path of a thrown ball, the spread of a virus, the relationship between study time and test scores—all non-linear.

The Solution: Non-Linear Activation Functions

After each neuron computes its weighted sum, we pass it through a non-linear activation function: a = σ(z)

This single addition breaks the linear trap. Now our layers actually do different things, building increasingly complex representations.

What Makes a Good Activation Function?

Let's think about what properties we need:

Non-linearity (obviously, or we're back where we started)
Differentiability (we'll need derivatives for learning)
Computational efficiency (we'll apply it billions of times)
Avoid saturation (outputs shouldn't always be at extremes)
Zero-centered or positive (depending on the problem)

Common Activation Functions

ReLU (Rectified Linear Unit): `f(x) = max(0, x)`

Why it works:

Dead simple: if input is positive, output equals input; if negative, output is zero
Non-linear despite looking linear (the "kink" at zero creates non-linearity)
Computationally trivial: just one comparison and zero multiplication
Doesn't saturate for positive values (unlike sigmoid)
Induces sparsity: many neurons output exactly zero, creating efficient representations

The problem:

"Dying ReLU": if a neuron's weights push it permanently into negative territory, its gradient becomes zero and it stops learning forever
Not zero-centered: all outputs are positive, which can slow convergence

Variants:

Leaky ReLU: f(x) = max(0.01x, x) — allows small gradients when x < 0, preventing death
ELU (Exponential Linear Unit): Smooth curve for negative values, better learning dynamics

Sigmoid: `f(x) = 1/(1 + e^(-x))`

Why it exists:

Squashes any input into range (0, 1)
Historically motivated by biological neurons (firing rates between 0 and 1)
Output can be interpreted as probability

what's happening?

For large positive x: e^(-x) approaches 0, so output approaches 1
For large negative x: e^(-x) approaches infinity, so output approaches 0
At x = 0: output is 0.5

Why it's problematic:

Vanishing gradients: For large positive or negative inputs, the sigmoid is nearly flat. The derivative approaches zero. During backpropagation, gradients get multiplied across layers; zeros multiply to deepen zeros. Deep networks can't learn.
Not zero-centered: Outputs always positive (0 to 1), causing zig-zagging during optimization
Computationally expensive: Exponential function

Where it's still used:

Output layer for binary classification (want probability between 0 and 1)

Tanh: `f(x) = (e^x - e^(-x))/(e^x + e^(-x))`

Advantages over sigmoid:

Zero-centered: outputs range from -1 to 1
Stronger gradients: derivative at zero is 1 (compared to 0.25 for sigmoid)

Still suffers from:

Vanishing gradients for extreme values
Computational cost of exponentials

Softmax: `f(x_i) = e^(x_i) / Σe^(x_j)`

Completely different purpose:

Not used between hidden layers
Exclusively for multi-class classification output layers

In simple term

Takes a vector of arbitrary values (logits)
Converts them into probabilities that sum to 1
Exponentiation ensures all values are positive
Division by sum ensures they sum to 1
Higher inputs get exponentially higher probabilities

Example:

Input: [2.0, 1.0, 0.1]
After softmax: [0.659, 0.242, 0.099]
Notice: still ordered the same way, but now they're probabilities

Why Different Layers Need Different Activations

Hidden Layers: ReLU family (efficiency, avoiding vanishing gradients)
Binary Classification Output: Sigmoid (get probability for one class)
Multi-class Classification Output: Softmax (get probability distribution over all classes)
Regression Output: Often no activation (or linear) — we want the raw value, not a bounded one

3. Forward Propagation: The Prediction Process

What is Propagation?

"Propagation" is just a fancy word for "passing information through." Forward propagation is the process of taking input data and pushing it through every layer until we get a prediction.

Let's build this concept from absolute scratch.

The Single Neuron Case

You have:

Input: x = 3
Weight: w = 2
Bias: b = 1

Step 1: Linear combination z = wx + b = 2(3) + 1 = 7

Step 2: Activation a = ReLU(z) = max(0, 7) = 7

That's it. The neuron outputs 7. This output might be the final prediction (if it's the only neuron), or it might be input to the next layer.

Multiple Inputs, Single Neuron

Now you have three inputs:

Inputs: x = [x₁=2, x₂=3, x₃=1]
Weights: w = [w₁=0.5, w₂=-1, w₃=2]
Bias: b = 1

Step 1: Weighted sum

z = w₁x₁ + w₂x₂ + w₃x₃ + b
z = 0.5(2) + (-1)(3) + 2(1) + 1
z = 1 - 3 + 2 + 1 = 1

Step 2: Activation a = ReLU(1) = 1

Single Layer: Multiple Neurons

Now suppose we have 3 neurons in one layer, all receiving the same 3 inputs.

Neuron 1:

Weights: [w₁₁, w₁₂, w₁₃], Bias: b₁
Output: a₁ = ReLU(w₁₁x₁ + w₁₂x₂ + w₁₃x₃ + b₁)

Neuron 2:

Weights: [w₂₁, w₂₂, w₂₃], Bias: b₂
Output: a₂ = ReLU(w₂₁x₁ + w₂₂x₂ + w₂₃x₃ + b₂)

Neuron 3:

Weights: [w₃₁, w₃₂, w₃₃], Bias: b₃
Output: a₃ = ReLU(w₃₁x₁ + w₃₂x₃₃ + w₃₃x₃ + b₃)

The layer transforms input vector [x₁, x₂, x₃] into output vector [a₁, a₂, a₃]

Matrix Representation: Scaling to Thousands of Neurons

Writing out every neuron individually is tedious. We use matrix notation:

Weight Matrix W:

W = [w₁₁  w₁₂  w₁₃]
    [w₂₁  w₂₂  w₂₃]
    [w₃₁  w₃₂  w₃₃]

Each row represents one neuron's weights.

Input Vector x:

x = [x₁]
    [x₂]
    [x₃]

Forward propagation for the layer:

z = Wx + b
a = ReLU(z)

This single matrix multiplication computes all neurons simultaneously. With modern GPUs optimized for matrix operations, we can process thousands of neurons in parallel.

Deep Networks: Chaining Layers

Now stack multiple layers. The output of layer 1 becomes the input to layer 2:

Layer 1:

z¹ = W¹x + b¹
a¹ = ReLU(z¹)

Layer 2:

z² = W²a¹ + b²
a² = ReLU(z²)

Layer 3 (output):

z³ = W³a² + b³
ŷ = softmax(z³)  [if classification]

The final output ŷ is our prediction.

Concrete Example: Digit Recognition

Input: 28×28 pixel image of a handwritten digit (flattened to 784 values)

Architecture:

Input layer: 784 neurons
Hidden layer 1: 128 neurons (with ReLU)
Hidden layer 2: 64 neurons (with ReLU)
Output layer: 10 neurons (with softmax for digits 0-9)

Forward propagation:

z¹ = W¹x + b¹           [128 values]
a¹ = ReLU(z¹)           [128 values]

z² = W²a¹ + b²          [64 values]
a² = ReLU(z²)           [64 values]

z³ = W³a² + b³          [10 values]
ŷ = softmax(z³)         [10 probabilities summing to 1]

Output might be: [0.01, 0.02, 0.05, 0.7, 0.1, 0.05, 0.03, 0.02, 0.01, 0.01]

The network predicts "3" with 70% confidence (index 3 has highest probability).

Why "Forward"?

Because information flows in one direction: from input → through hidden layers → to output. No loops, no feedback (in standard feedforward networks). Each layer only looks forward, never backward.

Later, during learning, we'll propagate in the opposite direction (backward) to adjust weights. But prediction is always forward.

4. Loss Functions: Quantifying Error

Why Do We Need Loss?

Imagine you're teaching a child to draw circles. They draw something. How do you tell them how "wrong" it is? You need a measurement some way to quantify the difference between what they drew and a perfect circle.

Neural networks face the same problem. After forward propagation, we have a prediction ŷ. We also have the true answer y. The loss function L(ŷ, y) measures how wrong the prediction is.

This single number is crucial because:

It tells us how well the model is performing
It guides the learning process (we'll adjust weights to minimize this number)
Different problems need different ways of measuring "wrongness"

Property Requirements for Loss Functions

Non-negative: L ≥ 0 always (can't be "negative wrong")
Zero when perfect: L = 0 when ŷ = y exactly
Increases with error: Worse predictions → higher loss
Differentiable: We need gradients for learning (calculus requirement)
Appropriate for the task: Regression vs classification need different measures

Mean Squared Error (MSE): For Regression

The Problem: Predict a continuous value (house price, temperature, stock price)

The most intuitive approach: absolute difference |ŷ - y|

If true value is 100 and we predict 90, error = 10
Simple, interpretable

But there's a problem: absolute value isn't differentiable at zero (the derivative has a discontinuity). This complicates learning algorithms.

Better approach: Square the difference

L = (ŷ - y)²

Why squaring?

Always positive (negative errors don't cancel positive ones)
Differentiable everywhere: dL/dŷ = 2(ŷ - y)
Penalizes large errors more (error of 10 contributes 100, but error of 1 contributes only 1)
Mathematically convenient (leads to elegant solutions)

For multiple predictions (a batch):

MSE = (1/n) Σᵢ(ŷᵢ - yᵢ)²

We average across all samples to get a single loss value.

Concrete Example:

Predicting house prices
True prices: [200k, 300k, 250k]
Predicted: [210k, 280k, 255k]
Errors: [10k, -20k, 5k]
Squared errors: [100M, 400M, 25M]
MSE = (100M + 400M + 25M) / 3 = 175M

The large middle error dominates the loss, signaling that's where improvement is needed most.

Variant: MAE (Mean Absolute Error)

MAE = (1/n) Σᵢ|ŷᵢ - yᵢ|

More robust to outliers (doesn't square them)
Less sensitive to large errors
Harder to optimize (non-smooth at zero)

Cross-Entropy Loss: For Classification

The Problem: Predict discrete categories (cat vs dog, spam vs ham, digit 0-9)

MSE doesn't work well here. Why? Because classification outputs are probabilities, and we need to measure "how wrong" a probability distribution is.

Binary Cross-Entropy (Two Classes)

Setup:

True label: y ∈ {0, 1} (e.g., 0 = not spam, 1 = spam)
Predicted probability: ŷ ∈ [0, 1] (from sigmoid activation)

If true label is 1 (positive class):

If we predict ŷ = 1.0 (certain it's positive): perfect, loss should be 0
If we predict ŷ = 0.9 (very confident): small loss
If we predict ŷ = 0.5 (uncertain): moderate loss
If we predict ŷ = 0.1 (confident it's negative): large loss
If we predict ŷ = 0.0 (certain it's negative): infinite loss (catastrophically wrong)

The formula that captures this:

L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]

Why this works:

Case 1: y = 1 (true class is positive)

L = -log(ŷ)

If ŷ = 1: L = -log(1) = 0 ✓
If ŷ = 0.5: L = -log(0.5) ≈ 0.69
If ŷ = 0.1: L = -log(0.1) ≈ 2.30
If ŷ → 0: L → ∞ (massive penalty for confident wrong answer)

Case 2: y = 0 (true class is negative)

L = -log(1-ŷ)

If ŷ = 0: L = -log(1) = 0 ✓
If ŷ = 0.5: L = -log(0.5) ≈ 0.69
If ŷ = 0.9: L = -log(0.1) ≈ 2.30
If ŷ → 1: L → ∞

The logarithm creates the right penalty structure: small errors have small losses, but confident mistakes are punished severely.

Why "cross-entropy"?

It comes from information theory. Cross-entropy measures the average number of bits needed to encode data from one distribution using another distribution. Here, we're measuring the "distance" between the true distribution (y) and predicted distribution (ŷ).

Categorical Cross-Entropy (Multiple Classes)

Setup:

True label: one-hot encoded vector (e.g., [0, 0, 1, 0, 0] for class 3)
Predicted: probability distribution from softmax (e.g., [0.1, 0.2, 0.5, 0.15, 0.05])

Formula:

L = -Σᵢ yᵢ·log(ŷᵢ)

Since y is one-hot (only one element is 1, rest are 0), this simplifies to:

L = -log(ŷ_true_class)

Example: Digit classification (0-9)

True label: 7 → one-hot: [0,0,0,0,0,0,0,1,0,0]
Predicted: [0.05, 0.05, 0.1, 0.05, 0.05, 0.05, 0.1, 0.4, 0.1, 0.05]

Loss = -log(0.4) ≈ 0.916

If the model had predicted 7 with 0.9 probability: Loss = -log(0.9) ≈ 0.105 (much better)

Intuition: We only care about the probability assigned to the correct class. The loss increases as this probability decreases.

Choosing the Right Loss Function

Regression (predicting continuous values):

MSE: Standard choice, penalizes large errors heavily
MAE: More robust to outliers
Huber Loss: Combines benefits of both (MSE for small errors, MAE for large)

Binary Classification:

Binary Cross-Entropy: Standard choice when using sigmoid output

Multi-class Classification:

Categorical Cross-Entropy: When labels are one-hot encoded
Sparse Categorical Cross-Entropy: When labels are integers (more memory efficient)

Custom Loss Functions: Sometimes you need domain-specific losses. For example:

Medical diagnosis: False negatives might be more costly than false positives
Image generation: Perceptual losses that compare high-level features, not pixels
Reinforcement learning: Reward-based losses

The loss function is the objective we're optimizing. Choose it carefully—your model will become excellent at minimizing it, for better or worse.

5. Backpropagation: The Learning Algorithm

This step is crucial it’s where the real learning happens.

Our neural network has millions of tiny adjustable numbers called weights. We make a prediction, compare it with the correct answer, and realize we’re off. The big question is: how do we tweak those millions of weights to make the next prediction better?

It’s not as simple as it sounds. Each weight affects many others, and changing even one can ripple through the entire network. Should we increase it or decrease it? And by how much?

That’s where backpropagation comes in a beautifully systematic way to figure out exactly how every single weight should change to reduce the overall error.

To really grasp what’s happening here, you’ll need a bit of comfort with calculus, especially with derivatives and how small changes in one variable affect another.

The Core Insight: The Chain Rule of Calculus

Everything in backpropagation stems from one calculus concept: the chain rule**.

Simple example: If z = f(y) and y = g(x), then:

dz/dx = (dz/dy) · (dy/dx)

In words: The rate of change of z with respect to x equals the rate of change of z with respect to y, multiplied by the rate of change of y with respect to x.

This might seem abstract, so let's make it concrete.

Concrete Example: A Tiny Network

Architecture:

One input: x = 2
One weight: w = 3
One bias: b = 1
Activation: ReLU
True output: y = 15

Forward pass:

z = wx + b = 3(2) + 1 = 7
a = ReLU(z) = 7
L = (a - y)² = (7 - 15)² = 64

Loss is 64. We want to reduce it. Should we increase or decrease w?

Backward pass (backpropagation):

We need dL/dw (how much does loss change when we change w?).

Using the chain rule:

dL/dw = (dL/da) · (da/dz) · (dz/dw)

Let's calculate each piece:

Step 1: dL/da (how does loss change with activation?)

L = (a - y)²
dL/da = 2(a - y) = 2(7 - 15) = -16

Step 2: da/dz (how does activation change with pre-activation?)

a = ReLU(z) = max(0, z)
For z > 0: da/dz = 1
For z ≤ 0: da/dz = 0
Since z = 7 > 0: da/dz = 1

Step 3: dz/dw (how does pre-activation change with weight?)

z = wx + b
dz/dw = x = 2

Combine them:

dL/dw = (dL/da) · (da/dz) · (dz/dw)
dL/dw = (-16) · (1) · (2) = -32

Interpretation: The gradient is -32. This means:

If we increase w by a tiny amount, the loss will decrease by approximately 32 times that amount
The negative sign tells us to increase w (move opposite to the gradient)
The magnitude (32) tells us how sensitive the loss is to changes in w

Update the weight:

w_new = w_old - learning_rate · (dL/dw)
w_new = 3 - 0.01 · (-32) = 3 + 0.32 = 3.32

We've just learned! The network adjusted its weight to reduce the loss.

Scaling to Deep Networks

In real networks with many layers, we calculate gradients layer by layer, moving backward from the output.

Example: 3-layer network

Forward pass:

Layer 1: z¹ = W¹x + b¹,  a¹ = ReLU(z¹)
Layer 2: z² = W²a¹ + b², a² = ReLU(z²)
Layer 3: z³ = W³a² + b³, ŷ = softmax(z³)
Loss: L = CrossEntropy(ŷ, y)

Backward pass:

Layer 3 (output layer):

dL/dz³ = ŷ - y  [derivative of softmax + cross-entropy]
dL/dW³ = (dL/dz³) · a²ᵀ
dL/db³ = dL/dz³
dL/da² = W³ᵀ · (dL/dz³)  [pass gradient to previous layer]

Layer 2:

dL/dz² = (dL/da²) ⊙ ReLU'(z²)  [⊙ is element-wise multiplication]
dL/dW² = (dL/dz²) · a¹ᵀ
dL/db² = dL/dz²
dL/da¹ = W²ᵀ · (dL/dz²)

Layer 1:

dL/dz¹ = (dL/da¹) ⊙ ReLU'(z¹)
dL/dW¹ = (dL/dz¹) · xᵀ
dL/db¹ = dL/dz¹

Notice the pattern:

Calculate gradient with respect to pre-activation (z)
Calculate gradient for weights: dL/dW = (dL/dz) · inputᵀ
Calculate gradient for bias: dL/db = dL/dz
Pass gradient backward: dL/d(previous_activation) = Wᵀ · (dL/dz)

Why "Backpropagation"?

Because we propagate gradients backward through the network, from output to input. Each layer receives the gradient from the layer ahead, computes its own gradients, and passes gradients to the layer behind.

The Vanishing Gradient Problem

Fundamental issue in deep networks:

When we multiply many small numbers (gradients) together through many layers, the product can become vanishingly small—approaching zero.

Example: If each layer has gradient 0.1, after 10 layers:

0.1¹⁰ = 0.0000000001

The early layers receive essentially zero gradient and stop learning. The network is deep but only the last few layers are actually training.

Solutions:

ReLU activation: Gradient is 1 for positive inputs (doesn't shrink)
Residual connections: Skip connections that allow gradients to bypass layers
Batch normalization: Keeps activations in a healthy range
Careful initialization: Start with weights that don't lead to extreme activations

The Exploding Gradient Problem

The opposite issue: gradients grow exponentially.

If each layer has gradient 2, after 10 layers:

2¹⁰ = 1024

Weights update by huge amounts, causing wild oscillations and instability. The model never converges.

Solutions:

Gradient clipping: Cap gradients at a maximum value
Careful initialization: Start with smaller weights
Batch normalization: Stabilizes the scale of activations and gradients
Lower learning rates: Smaller update steps

Computational Efficiency: Why Backpropagation is Brilliant

Naive approach to finding gradients: For each weight, we could:

Make a tiny change: w → w + ε
Recalculate the entire loss
Compute: (L_new - L_old) / ε

For a network with 1 million weights, this requires 1 million forward passes. Computationally prohibitive.

Backpropagation insight: Calculate all gradients in a single backward pass by reusing intermediate calculations. For N weights, we need:

1 forward pass
1 backward pass

That's it. Backpropagation computes all million gradients with just two passes through the network. This is why deep learning became practical.

The Mathematics: Derivatives of Common Components

ReLU:

f(x) = max(0, x)
f'(x) = 1 if x > 0, else 0

Sigmoid:

σ(x) = 1/(1 + e^(-x))
σ'(x) = σ(x)(1 - σ(x))

Tanh:

tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))
tanh'(x) = 1 - tanh²(x)

Softmax + Cross-Entropy (combined):

dL/dz = ŷ - y

This remarkably simple gradient is why we use softmax with cross-entropy.

MSE:

L = (ŷ - y)²
dL/dŷ = 2(ŷ - y)

Memory Requirements

Backpropagation requires storing all activations from the forward pass to compute gradients in the backward pass. For a network with:

Batch size: 32
4 layers with 1000 neurons each

We must store: 32 × 4 × 1000 = 128,000 activation values in memory.

This is why training large models requires substantial GPU memory, and why techniques like gradient checkpointing (recomputing some activations rather than storing them) become necessary.

6. Gradient Descent: The Optimization Algorithm

Imagine you're standing on a mountain in thick fog. You can't see the bottom of the valley, but you can feel the slope beneath your feet. Your goal: reach the lowest point.

Strategy: Take a step in the direction of steepest descent.

This is gradient descent. The "mountain" is the loss landscape—a high-dimensional surface where each dimension represents one weight, and the height represents the loss.

The Mathematical Foundation

After backpropagation, we have gradients: ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ

Each gradient tells us:

Direction: Positive gradient means loss increases when weight increases
Magnitude: Large gradient means weight strongly affects loss

Gradient descent update rule:

w_new = w_old - α · (∂L/∂w)

Where α (alpha) is the learning rate.

Why subtract? The gradient points in the direction of increasing loss. We want to decrease loss, so we move in the opposite direction (negative gradient).

The Learning Rate: The Most Critical Hyperparameter

The learning rate controls the step size. Choosing it is an art and science.

Too large (α = 1.0):

Iteration 1: Loss = 100
Iteration 2: Loss = 250  [overshot the minimum]
Iteration 3: Loss = 80
Iteration 4: Loss = 300  [wild oscillations]
...never converges

Too small (α = 0.000001):

Iteration 1: Loss = 100.00
Iteration 2: Loss = 99.99
Iteration 3: Loss = 99.98
...painfully slow, might get stuck in local minimum

Just right (α = 0.01):

Iteration 1: Loss = 100
Iteration 2: Loss = 85
Iteration 3: Loss = 73
...steady progress toward minimum

Typical ranges:

Small networks: 0.001 - 0.01
Large networks: 0.0001 - 0.001
With Adam optimizer: 0.001 (default)

Variants of Gradient Descent

1. Batch Gradient Descent

Approach: Use the entire dataset to compute one gradient update.

for epoch in range(num_epochs):
    # Compute gradient using ALL training samples
    gradient = compute_gradient(all_data)
    weights = weights - learning_rate * gradient

Pros:

Smooth convergence
Guaranteed to find the minimum (for convex functions)

Cons:

Slow: One update per epoch
Memory intensive: Must load entire dataset
Gets stuck in local minima (for non-convex functions)

2. Stochastic Gradient Descent (SGD)

Approach: Use one random sample at a time.

for epoch in range(num_epochs):
    shuffle(data)
    for sample in data:
        # Compute gradient using ONE sample
        gradient = compute_gradient(sample)
        weights = weights - learning_rate * gradient

Pros:

Fast updates: One update per sample
Can escape local minima (due to noise)
Memory efficient

Cons:

Noisy updates: path to minimum is erratic
Doesn't fully utilize parallel computing (GPUs)
May oscillate around minimum without settling

3. Mini-Batch Gradient Descent (Most Common)

Approach: Use a small batch of samples (typically 32, 64, 128, or 256).

for epoch in range(num_epochs):
    shuffle(data)
    for batch in create_batches(data, batch_size=32):
        # Compute gradient using BATCH of samples
        gradient = compute_gradient(batch)
        weights = weights - learning_rate * gradient

Pros:

Balanced: More stable than SGD, faster than batch GD
Efficient: Perfect for GPU parallelization
Moderate memory usage
Noise helps escape local minima, but not too much

Cons:

Another hyperparameter to tune (batch size)

This is the standard in modern deep learning.

Advanced Optimizers: Beyond Basic Gradient Descent

Basic gradient descent treats all parameters equally and uses a fixed learning rate. Modern optimizers are more sophisticated.

Momentum

Problem with basic GD: Imagine a narrow valley: steep sides, gentle slope toward minimum. Basic GD oscillates between sides while slowly progressing forward.

Solution: Momentum

velocity = 0
for iteration:
    gradient = compute_gradient()
    velocity = β * velocity - learning_rate * gradient
    weights = weights + velocity

Intuition: Remember previous gradients. If we keep going in the same direction, accelerate. If we oscillate, dampen the movement.

Effect:

Faster convergence in consistent directions
Reduced oscillations
Can roll through small local minima

Typical β: 0.9 (use 90% of previous velocity)

RMSprop (Root Mean Square Propagation)

Problem: Some parameters need large updates, others need small ones. A single learning rate is suboptimal.

Solution: Adapt the learning rate for each parameter based on recent gradient magnitudes.

squared_gradient_avg = 0
for iteration:
    gradient = compute_gradient()
    squared_gradient_avg = β * squared_gradient_avg + (1-β) * gradient²
    adjusted_gradient = gradient / (sqrt(squared_gradient_avg) + ε)
    weights = weights - learning_rate * adjusted_gradient

Intuition:

Parameters with consistently large gradients get smaller effective learning rates (divided by large number)
Parameters with small gradients get larger effective learning rates (divided by small number)

Effect: Each parameter gets its own adaptive learning rate.

Adam (Adaptive Moment Estimation)

The gold standard: Combines momentum and RMSprop.

m = 0  # first moment (momentum)
v = 0  # second moment (RMSprop)

for iteration:
    gradient = compute_gradient()

    # Update moments
    m = β₁ * m + (1-β₁) * gradient
    v = β₂ * v + (1-β₂) * gradient²

    # Bias correction (important in early iterations)
    m_corrected = m / (1 - β₁^t)
    v_corrected = v / (1 - β₂^t)

    # Update weights
    weights = weights - learning_rate * m_corrected / (sqrt(v_corrected) + ε)

Why Adam dominates:

Combines best of both worlds: momentum + adaptive learning rates
Robust to hyperparameter choices (default values work well)
Efficient and converges quickly
Works across diverse problem types

Default hyperparameters:

learning_rate = 0.001
β₁ = 0.9 (momentum)
β₂ = 0.999 (RMSprop)
ε = 1e-8 (numerical stability)

Learning Rate Schedules

Even with Adam, learning rates can be adjusted during training.

1. Step Decay

Epochs 1-30:   lr = 0.001
Epochs 31-60:  lr = 0.0001
Epochs 61+:    lr = 0.00001

Why: Start with larger steps to quickly find the general region, then smaller steps to fine-tune.

2. Exponential Decay

lr(t) = lr₀ * e^(-kt)

Smoothly decreases learning rate over time.

3. Cosine Annealing

lr(t) = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(πt/T))

Gradually reduces learning rate following a cosine curve.

4. Warm Restarts

Periodically reset learning rate to initial value. Helps escape local minima by occasionally taking large steps again.

5. Learning Rate Warmup

Start with very small learning rate, gradually increase to target value over first few epochs. Prevents instability in early training.

The Convergence Question: When to Stop?

Training loss keeps decreasing but should we keep training?

Early Stopping

Concept: Monitor performance on a validation set (data the model hasn't trained on).

Epoch 1:  Train Loss = 2.5, Val Loss = 2.6
Epoch 5:  Train Loss = 1.2, Val Loss = 1.3
Epoch 10: Train Loss = 0.8, Val Loss = 0.9
Epoch 15: Train Loss = 0.4, Val Loss = 0.85  [val loss stopped decreasing]
Epoch 20: Train Loss = 0.2, Val Loss = 0.9   [val loss increasing!]

Stop at epoch 10: Model is starting to overfit (memorizing training data rather than learning generalizable patterns).

Implementation:

best_val_loss = infinity
patience = 5  # epochs to wait for improvement
patience_counter = 0

for epoch:
    train()
    val_loss = validate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_model()
        patience_counter = 0
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print("Early stopping!")
        break

Challenges in the Optimization Landscape

Local Minima

The loss surface has multiple valleys. Gradient descent might settle into a shallow local minimum instead of the deep global minimum.

Solutions:

Momentum (can roll over small bumps)
Multiple random initializations
Stochastic updates (noise helps escape)

Saddle Points

Points where gradient is zero but it's neither a minimum nor maximum—a "saddle" shape. More common than local minima in high dimensions.

Solutions:

Momentum helps push through
Second-order methods (Newton's method)

Plateaus

Flat regions where gradients are nearly zero. Progress stalls.

Solutions:

Adaptive learning rates (Adam)
Patience (eventually gradients increase again)

Batching and Parallelization

Why batches matter for GPUs:

Modern GPUs have thousands of cores. Computing gradients for 32 samples independently is slow. Computing them in parallel is fast.

Matrix operations on batches:

Input batch:  [32 × 784] (32 images, 784 pixels each)
Weights:      [784 × 128]
Output:       [32 × 128] (32 outputs, 128 neurons)

Single matrix multiplication computes all 32 samples simultaneously. This is why GPUs are essential for deep learning.

Batch size trade-offs:

Small batches (e.g., 8-32):

More frequent updates
More noise (helps generalization)
Less memory
Slower per epoch

Large batches (e.g., 256-1024):

Fewer updates per epoch
Smoother gradients
More memory required
Faster per epoch
Risk of poor generalization (too smooth)

Sweet spot: Usually 32-128 for most applications.

The Complete Training Loop: Putting It All Together

Now we understand all the pieces. Here's how they work together:

Initialization

# Initialize weights (Xavier/He initialization)
for layer in network:
    layer.weights = random_normal(0, sqrt(2/n_inputs))
    layer.biases = zeros()

# Initialize optimizer
optimizer = Adam(learning_rate=0.001)

Why careful initialization matters:

Too large: Exploding activations and gradients
Too small: Vanishing gradients
Xavier/He initialization: Scaled to maintain activation variance across layers

The Training Loop

for epoch in range(num_epochs):
    # Shuffle data for randomness
    shuffle(training_data)

    for batch in create_batches(training_data, batch_size=32):
        # 1. FORWARD PROPAGATION
        x, y_true = batch

        z1 = W1 @ x + b1
        a1 = relu(z1)

        z2 = W2 @ a1 + b2
        a2 = relu(z2)

        z3 = W3 @ a2 + b3
        y_pred = softmax(z3)

        # 2. COMPUTE LOSS
        loss = cross_entropy(y_pred, y_true)

        # 3. BACKPROPAGATION
        dL_dz3 = y_pred - y_true
        dL_dW3 = dL_dz3 @ a2.T
        dL_db3 = sum(dL_dz3, axis=0)
        dL_da2 = W3.T @ dL_dz3

        dL_dz2 = dL_da2 * relu_derivative(z2)
        dL_dW2 = dL_dz2 @ a1.T
        dL_db2 = sum(dL_dz2, axis=0)
        dL_da1 = W2.T @ dL_dz2

        dL_dz1 = dL_da1 * relu_derivative(z1)
        dL_dW1 = dL_dz1 @ x.T
        dL_db1 = sum(dL_dz1, axis=0)

        # 4. OPTIMIZATION (using Adam)
        W3, b3 = optimizer.update(W3, b3, dL_dW3, dL_db3)
        W2, b2 = optimizer.update(W2, b2, dL_dW2, dL_db2)
        W1, b1 = optimizer.update(W1, b1, dL_dW1, dL_db1)

    # 5. VALIDATION
    val_loss = evaluate(validation_data)
    print(f"Epoch {epoch}: Train Loss = {loss:.4f}, Val Loss = {val_loss:.4f}")

    # 6. EARLY STOPPING CHECK
    if should_stop(val_loss):
        break

# 7. FINAL EVALUATION
test_accuracy = evaluate(test_data)
print(f"Final Test Accuracy: {test_accuracy:.2%}")

What Happens Over Time

Epoch 1:

Weights are random
Predictions are terrible (10% accuracy on 10 classes = random guessing)
Loss is high (maybe 2.3)
Large gradients
Big weight updates

Epoch 10:

Network learned basic patterns
Accuracy improved to 60%
Loss decreased to 1.2
Moderate gradients
Steady learning

Epoch 50:

Network refined understanding
Accuracy at 92%
Loss at 0.3
Small gradients
Fine-tuning details

Epoch 100:

Diminishing returns
Accuracy 93% (validation starting to plateau)
Risk of overfitting
Time to stop

Monitoring Training: What to Watch

1. Training Loss

Should decrease steadily
If fluctuating wildly: learning rate too high
If barely moving: learning rate too low or stuck in minimum

2. Validation Loss

Should track training loss initially
If diverging: overfitting
If much higher from start: train/val data distribution mismatch

3. Gradient Norms

Should be moderate (0.001 - 1.0)
If very small (< 0.0001): vanishing gradients
If very large (> 10): exploding gradients

4. Activation Statistics

Mean should be near zero
Std should be moderate (~1)
If activations saturate (all 0 or all max): architectural problem

5. Learning Rate

Can be adjusted based on progress
Too aggressive: divergence
Too conservative: slow progress

Conclusion: The Symphony of Learning

Machine learning is not one algorithm—it's a carefully orchestrated system:

Architecture provides the capacity to represent complex functions (Universal Approximation Theorem)
Activation functions enable non-linear transformations
Forward propagation generates predictions
Loss functions quantify error
Backpropagation computes gradients efficiently
Gradient descent iteratively improves weights

Each component is essential. Remove any one, and learning fails.

The beauty lies in the simplicity of each piece and the power of their combination. From these building blocks—matrix multiplications, non-linear functions, derivatives, and iterative updates—emerges the capability to:

Recognize faces in photos
Translate between languages
Generate realistic images
Play games at superhuman levels
Predict protein structures
Drive cars autonomously

All from the same fundamental algorithm, repeated billions of times, gradually sculpting random weights into a representation of the world's patterns.

This is how machines learn: not through magic, but through mathematics, iteration, and the elegant interplay of calculus and optimization across high-dimensional spaces.

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)

Boopathi — Sun, 21 Sep 2025 04:34:25 +0000

Over the past few months I’ve been building a fully open-source voice agent, exploring the stack end-to-end and learning a ton along the way. Now I’m ready to share everything I discovered.

The best part? In 2025 you actually can build one yourself. With today’s open-source models and frameworks you can piece together a real-time voice agent that listens, reasons, and talks back almost like a human without relying on closed platforms.

Let’s walk through the building blocks, step by step.

Let’s walk through the building blocks step by step.

The Core Pipeline

At a high level, a modern voice agent looks like this:

Pretty simple on paper but each step has its own challenges. Let’s dig deeper.

Speech-to-Text (STT)

Speech is a continuous audio wave it doesn’t naturally have clear sentence boundaries or pauses. That’s where Voice Activity Detection (VAD) comes in:

VAD (Voice Activity Detection): Detects when the user starts and stops talking. Without it, your bot either cuts you off too soon or stares at you blankly.

Once the boundaries are clear, the audio is passed into an STT model for transcription.

Popular VAD

Factor	Silero VAD	WebRTC VAD	TEN VAD	Yamnet VAD	Cobra (Picovoice)
Accuracy	State-of-the-art, >95% in multi-noise	Good for silence/non-silence; lower speech/noise discrimination	High, lower false positives than WebRTC/Silero	Good, multi-class capable	Top-tier (see Picovoice benchmarks)
Latency	<1ms per 30+ms chunk (CPU/GPU/ONNX)	10-30ms frame decision, ultra low-lag	2-5ms (real-time capable)	5–10ms/classify	5–10ms/classify
Chunk Size	30, 60, 100ms selector	10–30ms	20ms, 40ms custom	30-50ms	30-50ms
Noise Robustness	Excellent, trained on 100+ noises	Poor for some background noise/overlapping speech	Excellent	Moderate	Excellent
Language Support	6000+ languages/no domain restriction	Language-agnostic, good for basic speech/silence	Language-agnostic	Multi-language possible	Language-agnostic
Footprint	~2MB JIT, <1MB ONNX, minimal CPU/edge	~158KB binary, extremely light	~400KB	~2MB (.tflite format)	Small, edge-ready
Streaming Support	Yes, supports real-time pipelines	Yes, designed for telecom/audio streams	Yes, real-time	Yes	Yes
Integration	Python, ONNX, PyTorch, Pipecat, edge/IoT data	C/C++/Python, embedded/web/mobile	Python, C++, web	TensorFlow Lite APIs	Python, C, web, WASM
Licensing	MIT (commercial/edge/distribution OK)	BSD (very permissive)	Apache 2.0, open	Apache 2.0	Apache 2.0

Silero VAD is the gold standard and pipecat has builtin support so I have choosen that :

Sub-1ms per chunk on CPU
Just 2MB in size
Handles 6000+ languages
Works with 8kHz & 16kHz audio
MIT license (unrestricted use)

Popular STT Options

What are thing we need focus on choosing STT for voice agent

Accuracy:
- Word Error Rate (WER): Measures transcription mistakes (lower is better).
  - Example: WER 5% means 5 mistakes per 100 words.
- Sentence-level correctness: Some models may get individual words right but fail on sentence structure.
Multilingual support: If your users speak multiple languages, check language coverage.
Noise tolerance: Can it handle background noise, music, or multiple speakers?
Accent/voice variation handling: Works across accents, genders, and speech speeds.
Voice Activity Detection (VAD) integration: Detects when speech starts and ends.
Streaming: Most STT models work in batch mode (great for YouTube captions, bad for live conversations). For real-time agents, we need streaming output words should appear while you’re still speaking.
Low Latency: Even 300 500ms delays feel unnatural. Target sub-second responses.

Whisper often comes first to mind for most people when discussing speech-to-text because it has a large community, numerous variants, and is backed by OpenAI.

OpenAI Whisper Family

Whisper Large V3 — State-of-the-art accuracy with multilingual support
Faster-Whisper — Optimized implementation using CTranslate2
Distil-Whisper — Lightweight for resource-constrained environments
WhisperX — Enhanced timestamps and speaker diarization

NVIDIA also offers some interesting STT models, though I haven’t tried them yet since Whisper works well for my use case. I’m just listing them here for you to explore:

Canary Qwen 2.5B — Leading performance, 5.63% WER
Parakeet TDT 0.6B V2 — Ultra-fast inference (3,386 RTFx)

Here the comparsion table

Model	WER (EN, Public Bench.)	Multilingual	Noise/Accent/Voice	Sentence Accuracy	VAD Integration	Streaming	Latency
Whisper Large V3	2–5%	99+	Excellent	Excellent	Yes (Silero)	Batch†	~700ms†
Faster-Whisper	2–5%	99+	Excellent	Excellent	Yes (Silero)	Yes	~300ms‡
Canary 1B	3.06% (MLS EN)	4 (EN, DE, ES, FR)	Top-tier, fair on voice/gender/age	Excellent	Yes	Yes	~500ms–<1s
Parakeet TDT 0.6B	5–7%	3 (EN, DE, FR)	Good	Very Good	Yes	Yes	Ultra Low (~3,400x Real-time)

Why I Chose FastWhisper

After testing, my pick is FastWhisper, an optimized inference engine for Whisper.

Key Advantages:

12.5× faster than original Whisper
3× faster than Faster-Whisper with batching
Sub-200ms latency possible with proper tuning
Same accuracy as Whisper
Runs on CPU & GPU with automatic fallback

It’s built in C++ + CTranslate2, supports batching, and integrates neatly with VAD.

For more you can check Speech to Text AI Model & Provider Leaderboard

Large Language Model (LLM)

Once speech is transcribed, the text goes into an LLM the “brain” of your agent.

What we want in an LLM for voice agents:

Understands prompts, history, and context
Generates responses quickly
Supports tool calls (search, RAG, memory, APIs)

Leading Open-Source LLMs

Meta Llama Family

Llama 3.3 70B — Open-source leader
Llama 3.2 (1B, 3B, 11B) — Scaled for different deployments
128K context window — remembers long conversations
Tool calling support — built-in function execution

Others

Mistral 7B / Mixtral 8x7B — Efficient and competitive
Qwen 2.5 — Strong multilingual support
Google Gemma — Lightweight but solid

My Choice: Llama 3.3 70B Versatile

Why?

Large context window → keeps conversations coherent
Tool use built-in
Widely supported in the open-source community

Text-to-Speech (TTS)

Now the agent needs to speak back and this is where quality can make or break the experience.

A poor TTS voice instantly ruins immersion. The key requirements are:

Low latency avoid awkward pauses
Natural speech no robotic tone
Streaming output start speaking mid-sentence

Open-Source TTS Models I’ve Tried

There are plenty of open-source TTS models available. Here’s a snapshot of the ones I experimented with:

Kokoro-82M — Lightweight, #1 on HuggingFace TTS Arena, blazing fast
Chatterbox — Built on Llama, fast inference, rising adoption
XTTS-v2 — Zero-shot voice cloning, 17 languages, streaming support
FishSpeech — Natural dialogue flow
Orpheus — Scales from 150M–3B
Dia — A TTS model capable of generating ultra-realistic dialogue in one pass.

Factor	Kokoro-82M	Chatterbox	XTTS-v2	FishSpeech	Orpheus
Voice Naturalness	Human-like, top-rated in community	Very natural, quickly improving	High, especially with good samples	Natural, especially for dialogue	Good, scales with model size
Expressiveness / Emotion	Moderate, some emotional range	Good, improving	High, can mimic sample emotion	Moderate, aims for conversational flow	Moderate-High, model-dependent
Accent / Language Coverage	8+ languages (EN, JP, ZH, FR, more)	EN-focused, expanding	17+ languages, strong global support	Several; focus varies	Varies by checkpoint (3B supports many)
Latency / Inference	<300ms for any length, streaming-first	Fast inference, suitable for real-time	~500ms (depends on hardware), good streaming support	~400ms, streaming variants	3B: ~1s+ (large), 150M: fast (CPU/no-GPU)
Streaming Support	Yes, natural dialogue with chunked streaming	Yes	Yes, early output	Yes	Yes (3B may be slower)
Resource Usage	Extremely light (<300MB), great for CPU/edge	Moderate (500M params), GPU preferred	Moderate-high, 500M+ params, GPU preferred	Moderate, CPU/GPU	150M–3B options (higher = more GPU/memory)
Quantization / Optimization	8-bit available, runs on most hardware	Some support	Yes, 8-bit/4-bit	Yes	Yes
Voice Cloning / Custom	Not by default, needs training	Via fine-tuning	Zero-shot (few seconds of target voice)	Beta, improving cloning	Fine-tuning supported for custom voices
Documentation / Community	Active, rich demos, open source, growing	Good docs, quickly growing	Very large (Coqui), strong docs	Medium but positive community	Medium, active research group
License	Apache 2.0 (commercial OK)	Commercial/Proprietary use may require license	LGPL-3.0, open (see repo)	See repo, mostly permissive	Apache 2.0
Pretrained Voices / Demos	Yes (multiple voices, demos available)	Yes, continually adding more	Yes, huge library, instant demo	Yes	Yes (many public models on Hugging Face)

Why I Chose Kokoro-82M

Key Advantages:

5–15× smaller than competing models while maintaining high quality
Runs under 300MB — edge-device friendly
Sub-300ms latency
High-fidelity 24kHz audio
Streaming-first design — natural conversation flow

Limitations:

No zero-shot voice cloning (uses a fixed voice library)
Less expressive than XTTS-v2
Relatively new model with a smaller community

You can also check out my minimal Kokoro-FastAPI server to experiment with it:

Speech-to-Speech Models

Speech-to-Speech (S2S) models represent an exciting advancement in AI, combining speech recognition, language understanding, and text-to-speech synthesis into a single, end-to-end pipeline. These models allow natural, real-time conversations by converting speech input directly into speech output, reducing latency and minimizing intermediate processing steps.

Some notable models in this space include:

Moshi: Developed by Kyutai-Labs, Moshi is a state-of-the-art speech-text foundation model designed for real-time full-duplex dialogue. Unlike traditional voice agents that process ASR, LLM, and TTS separately, Moshi handles the entire flow end-to-end.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
VALL-E & VALL-E X (Microsoft): These models support zero-shot voice conversion and speech-to-speech synthesis from limited voice samples.
AudioLM (Google Research): Leverages language modeling on audio tokens to generate high-quality speech continuation and synthesis.

Among these, I’ve primarily worked with Moshi. I’ve implemented it on a FastAPI server with streaming support, which allows you to test and interact with it in real-time. You can explore the FastAPI implementation here: FastAPI + Moshi GitHub.

Framework (The Glue)

Finally, you need something to tie all the pieces together: streaming audio, message passing, and orchestration.

Open-Source Frameworks

Pipecat

Purpose-built for voice-first agents
Streaming-first (ultra-low latency)
Modular design — swap models easily
Active community

Vocode

Developer-friendly, good docs
Direct telephony integration
Smaller community, less active

LiveKit Agents

Based on WebRTC
Supports voice, video, text
Self-hosting options

Traditional Orchestration

LangChain — great for docs, weak at streaming
LlamaIndex — RAG-focused, not optimized for voice
Custom builds — total control, but high overhead

Why I Recommend Pipecat

Voice-Centric Features

Streaming-first, frame-based pipeline (TTS can start before text is done)
Smart Turn Detection v2 (intonation-aware)
Built-in interruption handling

Production Ready

Sub-500ms latency achievable
Efficient for long-running agents
Excellent docs + examples
Strong, growing community

Real-World Performance

~500ms voice-to-voice latency in production
Works with Twilio + phone systems
Supports multi-agent orchestration
Scales to thousands of concurrent users

Feature	Pipecat	Vocode	LiveKit	LangChain
Voice-First Design	✅	✅	⚠️	❌
Real-Time Streaming	✅	✅	✅	❌
Vendor Neutral	✅	✅	✅	⚠️
Turn Detection	✅ Smart V2	⚠️ Basic	✅	❌
Community Activity	✅ High	⚠️ Moderate	✅ High	✅ High
Learning Curve	⚠️ Moderate	⚠️ Moderate	❌ Steep	✅ Easy

Lead to Next Part

In this first part, we’ve covered the core tech stack and models needed to build a real-time voice agent.

In the next part of the series, we’ll dive into integration with Pipecat, explore our voice architecture, and walk through deployment strategies. Later, we’ll show how to enhance your agent with RAG (Retrieval-Augmented Generation), memory features, and other advanced capabilities to make your voice assistant truly intelligent.

Stay tuned the next guide will turn all these building blocks into a working, real time voice agent you can actually deploy.

I’ve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Don’t forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).

Resources

Voice AI & Voice Agents An Illustrated Primer

The Other Side of OpenAI 12 Surprising Stories You Haven’t Heard

Boopathi — Wed, 10 Sep 2025 01:13:35 +0000

While browsing YouTube, I stumbled across a video titled This Book Changed How I Think About AI. Curious, I clicked and it introduced me to Empire of AI by Karen Hao, a book that dives deep into the evolution of OpenAI.

The book explores OpenAI’s history, its culture of secrecy, and its almost single-minded pursuit of artificial general intelligence (AGI). Drawing on interviews with more than 260 people, along with correspondence and internal documents, Hao paints a revealing picture of the company.

After reading it, I uncovered 12 particularly fascinating facts about OpenAI that most people don’t know. Let’s dive in.

1. The “Open” in OpenAI Was More Branding Than Belief

The name sounds noble who doesn’t like the idea of “open” AI? But here’s the catch: from the very beginning, openness was more narrative than commitment. Founders Sam Altman, Greg Brockman, and Elon Musk leaned into it because it helped them stand out. Behind closed doors, though, cofounder Ilya Sutskever was already suggesting they could scale it back once the story had served its purpose. In other words: open, until it wasn’t convenient.

2. Elon Musk’s Billion-Dollar Promise? Mostly Smoke and Mirrors

Remember Musk’s flashy $1 billion funding pledge? Turns out, OpenAI only ever saw about $130 million of it. And less than $45 million came directly from Musk himself. His back-and-forth on funding almost pushed the organization into crisis, forcing Altman to hunt down new sources of money.

3. The For-Profit Shift Was More About Survival Than Vision

In 2019, OpenAI unveiled its “capped-profit” structure, pitching it as an innovative way to balance mission and money. But the truth is far less glamorous: the nonprofit model wasn’t bringing in the billions needed to compete with tech giants. At one point, Brockman and Sutskever even discussed merging with a chip startup. Creating OpenAI LP wasn’t a bold visionit was a lifeline.

4. The “Capped-Profit” Model Looked Unlimited to Critics

Investors were told their returns would be capped at 100x. Sounds responsible, right? But do the math: a $10 million check could still turn into a $1 billion payout. Critics quickly called it “basically unlimited,” arguing the cap only looked meaningful until you saw the actual numbers.

5. GPT-2’s “Too Dangerous” Storyline Was a PR Masterstroke

In 2019, OpenAI said its GPT-2 model was so powerful it had to be withheld for safety reasons. Headlines exploded. But here’s the twist: many researchers thought the risk claims were overblown and saw the whole thing as a publicity stunt engineered by Jack Clark, OpenAI’s communications chief at the time. The stunt worked—the company was suddenly everywhere.

6. OpenAI’s Culture Had Clashing “Tribes”

Inside OpenAI, things weren’t exactly harmonious. Sam Altman himself described the organization as divided into three factions: research explorers, safety advocates, and startup minded builders. He even warned of “tribal warfare” if they couldn’t pull together. That’s not just workplace tension it’s a sign of deep conflict over the company’s direction.

7. ChatGPT’s Global Debut Was Basically an Accident

Think ChatGPT’s launch was carefully choreographed? Not at all. The product that made OpenAI a household name was released in just two weeks as a “research preview,” right after Thanksgiving 2022. The rush was partly to get ahead of a rumored chatbot from Anthropic. Even Microsoft OpenAI’s biggest partner was caught off guard and reportedly annoyed.

8. Training Data Included Pirated Books and YouTube Videos

Where do you get enough data to train something like GPT-3 or GPT-4? In OpenAI’s case, by scraping almost everything it could. GPT-3 used a secret dataset nicknamed “Books2,” which reportedly included pirated works from Library Genesis. GPT-4 went even further, with employees transcribing YouTube videos and scooping up anything online without explicit “do not scrape” warnings.

9. “AI Safety” Initially Ignored Social Harms

OpenAI loves to talk about AI safety now. But early on, executives resisted calls to broaden the term to include real-world harms like discrimination and bias. When pressed, one leader bluntly said, “That’s not our role.” The message was clear: safety meant existential risks, not everyday impacts.

10. Scaling Up Came with Hidden Environmental Costs

Bigger models require more compute and more resources. Training GPT-4 in Microsoft’s Iowa data centers consumed roughly 11.5 million gallons of water in a single month, during a drought. Strikingly, Altman and other leaders reportedly never discussed these environmental costs in company-wide meetings.

11. “SummerSafe LP” Had a Dark Inspiration

Before OpenAI LP had its public name, it was secretly incorporated as “SummerSafe LP.” The reference? An episode of Rick and Morty where a car, tasked with keeping Summer safe, resorts to murder and torture. Internally, it was an ironic nod to how AI systems can twist well-meaning goals into dangerous outcomes.

12. Departing Employees Faced Equity Pressure

Leaked documents revealed OpenAI used a hardball tactic with departing employees: sign a strict nondisparagement agreement or risk losing vested equity. This essentially forced people into lifelong silence. Altman later said he didn’t know this was happening and was embarrassed, but records show he had signed paperwork granting the company those rights a year earlier.

Final Thoughts

OpenAI’s story is anything but straightforward. From broken promises and internal clashes to controversial data practices, the company has often operated in ways that don’t match its public messaging. Whether you see that as savvy strategy, messy growing pains, or something more troubling depends on your perspective.

But one thing’s clear: the “open” in OpenAI has always been complicated.

Rust Tools That Made Our Dev Team Productive Again

Boopathi — Sat, 09 Aug 2025 02:58:09 +0000

As regular readers of my blog may know, our primary technology stack is the MERN stack MongoDB, Express, React, and Node.js. On the frontend, we use React with TypeScript; on the backend, Node.js with TypeScript, and MongoDB serves as our database.

While this stack has served us well, we encountered significant challenges as our application scaled particularly around build times, memory usage, and developer experience. In this post, I will outline two key areas where Rust-based tools helped us resolve these issues and substantially improved our team’s development velocity.

Improving Frontend Performance

The Problem: Slow Builds and Poor Developer Experience

As our frontend codebase grew, we began facing several recurring issues:

Local development startup times became painfully slow.
Build processes consumed large amounts of memory.
On lower-end machines, builds caused systems to hang or crash.
Developers regularly raised concerns about delays and performance bottlenecks.

These issues were primarily due to our use of Create React App (CRA) with an ejected Webpack configuration. While powerful, this setup became increasingly inefficient for our scale and complexity.

First Attempt: Migrating to Vite

In search of a solution, I explored Vite, a build tool known for its speed and modern architecture.

Benefits:

Faster initial load times due to native ES module imports.
Noticeable improvement in development server startup.

Challenges:

Migrating from an ejected CRA setup was complex due to custom Webpack configurations.
Issues arose with lazy-loaded routes, SVG assets, and ESLint/type-checking delays.
Certain runtime errors occurred during navigation, likely due to missing or incorrect Vite configurations.

Ultimately, while Vite offered some performance benefits, it did not fully resolve our problems and introduced new complications.

Final Solution: Adopting Rspack

After further research, we came across Rspack, a high-performance Webpack-compatible bundler written in Rust. What caught my attention was its focus on performance and ease of migration.

Key advantages of Rspack:

Significantly faster build times up to 70% improvement in our case.
Reduced memory consumption during both build and development.
Compatibility with existing Webpack plugins and configurations, which simplified migration.
Designed as a drop-in replacement for Webpack.

After resolving a few initial issues, we successfully integrated Rspack into our frontend build system. The migration resulted in substantial improvements in build speed and developer satisfaction. The system is now in production with no reported issues, and developers are once again comfortable working on the frontend.

Accelerating Backend Testing

The Problem: Slow Kubernetes-Based Testing Cycle

Our backend uses Kubernetes for deployment and testing. The typical development workflow looked like this:

A developer makes code changes.
A Docker image is built and pushed to a registry using github action.
The updated image is deployed to the Kubernetes cluster.
Testers verify the changes.

This process, while standard, became inefficient. Even small changes (such as adding a log statement) required a full image build and redeployment, resulting in delays of 15 minutes or more per test cycle.

Optimization: Runtime Code Sync

To address this, we have written the shell script that will first run when the pod starts or restart which will pull the latest changes from github and run the code.

git reset --hard origin/$BRANCH_NAME
git pull origin $BRANCH_NAME

This significantly reduced testing turnaround time for JavaScript-based services.

The TypeScript Bottleneck

However, for services written in TypeScript, the situation was more complex. After pulling the latest code, we needed to transpile TypeScript to JavaScript using tsc or npm run build. Unfortunately, this process:

Consumed excessive memory.
Took too long to complete.
Caused pods to crash, especially in test environments with limited resources.

Solution: Integrating SWC

To solve this, we adopted SWC, a Rust-based TypeScript compiler. Unlike tsc, SWC focuses on speed and performance.

Results after integrating SWC:

Compilation time reduced to approximately 250 milliseconds.
Memory usage dropped significantly.
Allowed us to support live code updates without full builds or redeployments.

Because SWC does not perform type checking, we use it only in test environments. This tradeoff allows testers to verify code changes rapidly, without impacting our production pipeline.

Conclusion: Rust’s Impact on Team Efficiency

In both our frontend and backend workflows, Rust-based tools Rspack and SWCdelivered substantial improvements:

Frontend build times were reduced by more than 70%, with better memory efficiency.
Testing cycles became significantly faster, especially for TypeScript services.
Developer experience improved across the board, reducing frustration and increasing velocity.

Rust’s performance characteristics, coupled with thoughtful tool design, played a critical role in resolving bottlenecks in our JavaScript-based systems. For teams facing similar challenges, especially around build performance and scalability, we strongly recommend exploring Rust-powered tools as a viable solution.

This AI Interview Assistant Chrome Extension Was My Weekend Project

Boopathi — Sat, 24 May 2025 05:23:09 +0000

Hiring has always been one of those tasks that seems easy until you’re knee deep in resumes and trying to remember who did what, and more importantly what to ask them during the interview._

A few weeks ago, I had this exact moment. I was preparing for an interview and had a resume open in one tab, a notepad in another, and ChatGPT somewhere in the background trying to help me brainstorm questions. That's when the idea hit me:

“Why am I juggling between tabs? What if this entire process could live in a single Chrome extension?”

So I built one. It’s called HireZen.

The Problem I Kept Running Into

Every time I had to take an interview, I’d start by opening the candidate’s resume. But even after reading it top to bottom, I wasn’t always sure:

What’s the best way to dig deeper into their projects?
Are they really comfortable with the tools they listed?
What kinds of behavioral or situational questions would be relevant?

I’d often resort to generic questions or spend too much time prepping just one resume. It felt repetitive and inefficient and I knew there had to be a better way.

Enter HireZen

HireZen is a Chrome extension that does one simple thing:

You upload a resume, and it generates personalized interview questions for you using AI.

That’s it. No over engineering. No login required. Just upload, generate, copy or print done.

Here’s what it currently supports:

🧠 Reads and parses PDF resumes
🤖 Uses LLMs (like GPT-4) to generate questions based on the candidate’s experience
🖨️ Lets you print the generated questions or share them with HR

The idea is to take the mental load off interviewers and let AI handle the repetitive thinking.

How It Works

By default, when you visit Google Meet, HireZen will auto-open as a sidebar so you can prep questions while you’re on the call.

Press Ctrl + M to hide or show the extension anytime (toggle view).
Click the Settings icon to:
Choose your LLM provider (OpenAI, Claude, etc.)
Enter your API key
Select the model you prefer (e.g., GPT-4, GPT-3.5)

Everything is stored securely inside your browser. Once configured, just upload a resume and start generating questions instantly.

Tech Behind It

Initially, I was using a GitHub-hosted API to call OpenAI’s models. It worked well, but obviously, not scalable for others. So I added a Settings page where anyone using the extension can:

Choose their LLM provider (e.g., OpenAI or others)
Set their preferred model
Enter their own API key, which is stored securely in the browser (not sent to me or any server)

No backend. No database. Just local storage via Chrome's storage.local API.

It’s simple, and more importantly safe.

On Security

One thing I was very cautious about was handling the API key. I didn’t want to mess around with storing sensitive data anywhere outside the user’s browser. So everything model, provider, key is stored locally and only accessible to the extension.

You control your own usage. You bring your own key. I never see it.

What’s Next?

This is just the beginning. I’m planning to:

Add support for exporting question sets as PDF
Build a small feedback form to help interviewers leave notes
Eventually list it on the Chrome Web Store

Right now, it’s all open and available to try.

Try It Out

Here’s the link

It works on Chrome and Chromium-based browsers. Just open it, upload a resume, and let it do the rest.

Why I’m Sharing This

I’m a solo developer. I build things out of curiosity and real-world pain points I face at work. HireZen is one of those small tools I wish I had earlier, so I built it and put it out there.

If it saves you time or makes your interviews a little smoother that’s all I hoped for.

And hey, if you found it helpful and want to support my work...

☕️ You can buy me a coffee – it helps me keep building little tools like this and pushing updates.

Thanks for reading!

Prototyping AI Agents with GitHub Models (for Free!) 🤖💸

Boopathi — Sun, 20 Apr 2025 06:48:47 +0000

I love when companies roll out generous free tiers. It feels like they’re saying, “Hey, go build your thing — we’ve got your back.” And if you’re a student, between jobs, or just tired of racking up charges for every API call (yep, been there too), free tiers can be a total game-changer.

That’s exactly why GitHub Models stood out to me — it’s like an AI candy shop, completely free to explore, as long as you have a GitHub account.

Here’s what’s on the shelf:

🔮 OpenAI models like gpt-4o, o3-mini

🧠 Research favorites like Phi and LLaMA

🌍 Multimodal models like llama-vision-instruct

📚 Embeddings from Cohere and OpenAI

⚡ Plus providers like Mistral, Jamba, Codestral and more

Oh, and the best part? Many of these models support function calling, making them perfect for agent-style apps.

Now here’s the real kicker:

GitHub Models speak OpenAI-compatible API — which means any Python framework that already works with OpenAI's ChatCompletion API... just works out of the box.

Example 1: Connecting `openai` SDK to GitHub Models

import openai
import os

client = openai.OpenAI(
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com"
)

Now go ahead and use it like you would with OpenAI:

client.chat.completions.create(...)

Simple, clean, no surprises.

Example 2: Running AutoGen with GitHub Models

import autogen_ext.models.openai
import autogen_agentchat.agents
import os

client = autogen_ext.models.openai.OpenAIChatCompletionClient(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com"
)

math_teacher = autogen_agentchat.agents.AssistantAgent(
    name="Math_teacher",
    model_client=client,
    system_message="You only teach maths."
)

Just like that, your agent is ready to go.

You can plug GitHub Models into tons of other Python libraries too — LangGraph, PydanticAI, LlamaIndex, you name it.

Go build something fun. Happy tinkering!