Forem: Mart Schweiger

AssemblyAI Voice Agent API vs OpenAI Realtime API: Which should you use?

Mart Schweiger — Thu, 07 May 2026 17:37:03 +0000

OpenAI's Realtime API was one of the first products to make building voice agents feel accessible. Stream audio in, get audio back—simple idea, big impact. But as developers move from prototype to production, a different set of requirements kicks in: speech accuracy on real-world entities, cost predictability, and a developer experience that doesn't fight you.

AssemblyAI's Voice Agent API launched in April 2026 as a direct alternative—same simplicity, fundamentally different architecture. Here's an honest comparison of the two.

Feature	AssemblyAI Voice Agent API	OpenAI Realtime API
Pricing	$4.50/hr flat	~$18/hr (per-token)
ASR model	Universal-3 Pro Streaming (#1 WER)	GPT-4o multimodal
Word accuracy	94.07% (6.3% mean WER)	93.13%
Missed entity rate (emails, phones, names)	16.7%	23.3%
End-to-end latency	~1 second (~150ms P50 STT)	~1 second
Languages	EN, ES, FR, DE, IT, PT	99+ (lower accuracy)
Turn detection	Speech-aware VAD (semantic + neural)	Basic VAD
Mid-session updates	Prompt + voice + tools + turn detection	Prompt + tools only
Session resumption	30-second reconnect window	Not supported
Tool calling behavior	Handles with intermediate speech	Goes silent
Compliance	HIPAA (BAA), SOC 2 Type 2, ISO 27001	SOC 2

Architecture: multimodal vs. dedicated pipeline

This is the foundational difference, and everything else flows from it.

OpenAI's Realtime API uses GPT-4o as a single multimodal model. Audio goes in, audio comes out, and one model handles speech understanding, reasoning, and voice generation all at once. It's architecturally clean, and if you're already building on OpenAI, it feels like a natural extension.

AssemblyAI uses a dedicated pipeline: Universal-3 Pro Streaming for speech-to-text (ranked #1 on the Hugging Face Open ASR Leaderboard), a separate LLM for reasoning, and purpose-built TTS for voice generation. Each model is optimized for its specific job.

Why does this matter? Because a model that also handles text, images, and video isn't specifically tuned for the nuances of speech. Turn detection, entity accuracy on names and numbers, handling disfluencies and accented speech—these improve when the model is purpose-built for listening, not doing everything at once.

Speech accuracy: where the gap is real

Voice agents live or die by what they hear. If the speech-to-text layer gets an email address wrong, the agent sends a confirmation to the wrong inbox. If it misses a digit in an account number, the agent looks up the wrong record.

AssemblyAI's Universal-3 Pro Streaming model achieves 94.07% word accuracy on real-world audio, with a 6.3% mean word error rate across English domains—the lowest among major providers in independent benchmarks. On entity accuracy specifically, Universal-3 Pro Streaming has a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers. Compare that to OpenAI GPT-4o Transcribe at 23.3% and Deepgram Nova-3 at 25.2%. In head-to-head benchmarks, Universal-3 Pro Streaming shows a 59.64% lower missed email rate and 34.79% lower missed phone rate compared to competitors.

That gap might look small on paper. In practice, it's the difference between an agent that captures "RX-7704132" correctly and one that transcribes "dash seven seven zero four one three two." When your agent acts on that transcript, accuracy isn't a nice-to-have—it's the whole game.

For specialized domains, AssemblyAI also offers Medical Mode that enhances accuracy on clinical terminology—medication names, procedures, dosages. OpenAI doesn't have an equivalent specialized mode.

Pricing: $4.50/hr vs. ~$18/hr

This is the most straightforward comparison.

AssemblyAI's Voice Agent API costs $4.50/hr flat. That covers everything—speech understanding, LLM reasoning, and voice generation. No per-token math, no separate input/output charges. Your monthly cost is simply: hours of agent usage times $4.50.

OpenAI's Realtime API uses per-token billing for both input and output audio tokens. When you do the math on a typical voice agent conversation, that comes out to roughly $18/hr—about 4x the cost.

At prototype scale, both are affordable. At production scale, the difference compounds fast. A customer support operation running 5,000 hours of agent conversations per month is looking at $22,500/month with AssemblyAI vs. $90,000/month with OpenAI. That's $67,500/month in savings—enough to fund an entire engineering team.

Developer experience

OpenAI's Realtime API has over 30 event types to handle. For developers who want fine-grained control over every aspect of the audio stream, that's powerful. For developers who want to get a voice agent running quickly, it's a lot of surface area.

AssemblyAI's approach is deliberately minimalist. One WebSocket, a handful of JSON message types, and the entire API reference is readable in about 10 minutes. Most developers ship a working agent the same day. No SDK required—it works with any WebSocket client and integrates natively with tools like Claude Code.

The standout feature is live mid-conversation updates. You can change the system prompt, swap voices, add or remove tools, and adjust VAD settings—all via a JSON message without dropping the connection. OpenAI's Realtime API supports some mid-session configuration, but AssemblyAI's implementation is more comprehensive.

Tool calling works similarly in both: you register functions with JSON Schema, and the agent calls them when appropriate. AssemblyAI's agent speaks a natural transition while waiting for tool results, which keeps the conversation flowing.

Session resumption is another practical difference. If the WebSocket drops (mobile networks, flaky connections), AssemblyAI preserves context and lets you reconnect within 30 seconds. That's critical for phone-based voice agents where network reliability isn't guaranteed.

Turn detection and conversation flow

This is where you really feel the architectural difference.

AssemblyAI's Voice Agent API uses speech-aware Voice Activity Detection built into the Universal-3 Pro Streaming model. Rather than relying solely on silence thresholds, Universal-3 Pro Streaming uses audio-contextual signals—tonality, pacing, and speech patterns—to determine when a speaker is done. It distinguishes between a thoughtful pause and a conversation ending—so the agent doesn't cut you off mid-thought, and it doesn't sit in dead air for three seconds waiting for you to continue. When you interrupt, it stops and listens naturally.

OpenAI's turn detection relies on more basic VAD that lacks the acoustic sophistication of a purpose-built speech model. The conversation intelligence that comes from a dedicated STT layer shows up most clearly in turn-taking quality.

The honest test: have a real conversation with both. The difference in how natural the interaction feels is something you notice immediately, even if it's hard to put numbers on it.

When to choose each

Choose AssemblyAI's Voice Agent API when:

You need the highest speech accuracy on entities (names, numbers, addresses). You want predictable flat-rate pricing. You want the simplest possible developer experience. You're building for production scale where cost matters. You need healthcare compliance with Medical Mode.

Choose OpenAI's Realtime API when:

You're deeply embedded in OpenAI's ecosystem and value API consistency across products. You need broader language support beyond 6 languages. You want a single-model architecture for simplicity (at the cost of speech-specific optimization). Cost isn't the primary concern.

For most production voice agent use cases—customer support, phone agents, clinical workflows, coaching platforms—AssemblyAI's combination of superior speech accuracy, 4x lower cost, and simpler developer experience makes it the stronger choice.

Frequently asked questions

How much cheaper is AssemblyAI's Voice Agent API compared to OpenAI's Realtime API?

AssemblyAI's Voice Agent API costs $4.50/hr flat, covering speech understanding, LLM reasoning, and voice generation. OpenAI's Realtime API costs approximately $18/hr with per-token billing. That's roughly a 4x cost difference—at 5,000 hours/month of agent usage, the savings exceed $67,000/month.

Which has better speech accuracy for voice agents—AssemblyAI or OpenAI?

AssemblyAI's Universal-3 Pro Streaming model achieves 94.07% word accuracy and a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers. OpenAI GPT-4o Transcribe scores 93.13% word accuracy with a 23.3% missed entity rate. Universal-3 Pro Streaming also shows 59.64% fewer missed emails and 34.79% fewer missed phone numbers—the entities voice agents actually act on.

What's the difference between a multimodal and a cascaded voice agent architecture?

OpenAI uses a single multimodal model (GPT-4o) for speech understanding, reasoning, and voice generation. AssemblyAI uses a cascaded pipeline with dedicated models for each step—Universal-3 Pro Streaming for STT, a separate LLM for reasoning, and purpose-built TTS. The cascaded approach delivers better speech accuracy because each model is optimized for its specific task.

Can I migrate from OpenAI's Realtime API to AssemblyAI?

Yes. AssemblyAI's Voice Agent API uses a standard WebSocket with JSON messages—most developers get a working agent running in an afternoon, even when migrating from a different provider. Tool definitions transfer easily since both use JSON Schema. The main adjustment is the simpler event model (fewer event types to handle).

Does AssemblyAI's Voice Agent API support the same languages as OpenAI?

AssemblyAI currently supports 6 languages: English, Spanish, French, German, Italian, and Portuguese. OpenAI's Realtime API supports a broader set through GPT-4o's multilingual capabilities. If your use case requires languages beyond AssemblyAI's current 6, check both providers' documentation as language support is actively expanding.

Which voice agent API is better for healthcare applications?

AssemblyAI has a clear advantage for healthcare with Medical Mode—a dedicated add-on that enhances accuracy on medical terminology (medication names, procedures, dosages). AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI), with a Business Associate Addendum (BAA) available. Combined with SOC 2 Type 2 certification, it's purpose-built for clinical voice agent workflows. OpenAI offers SOC 2 compliance but no equivalent medical speech optimization.

AssemblyAI Voice Agent API vs ElevenLabs Conversational AI: Which is better for voice agents?

Mart Schweiger — Thu, 07 May 2026 17:36:19 +0000

ElevenLabs started as a text-to-speech company. Their Conversational AI product (sometimes called Eleven Agents) extends that TTS focus into the voice agent space—combining speech input, reasoning, and voice output into a single managed platform.

AssemblyAI's Voice Agent API was built for production voice agents from the ground up. Powered by Universal-3 Pro Streaming—the #1 model on the Hugging Face Open ASR Leaderboard—it starts with world-class speech understanding and builds the rest of the pipeline around getting the input right.

One of these approaches is purpose-built for voice agents that need to complete real tasks. The other is a TTS company expanding into a space that demands much more than good-sounding output. Here's how they compare.

Feature	AssemblyAI Voice Agent API	ElevenLabs Conversational AI
Pricing	$4.50/hr flat	Per-character + platform fees
ASR model	Universal-3 Pro Streaming (#1 WER)	Scribe v2
Word accuracy	94.07%	93.48%
Missed entity rate (emails, phones, names)	16.7%	Higher (25.2%+)
Concurrency	Unlimited (autoscaling)	30-agent cap
Languages	EN, ES, FR, DE, IT, PT	29+
Architecture	Raw API (one WebSocket, JSON)	Managed platform
Mid-session updates	Prompt + voice + tools + VAD	Limited
Session resumption	30-second reconnect window	Not available
Tool calling	Custom functions via JSON Schema	Custom functions supported
Compliance	HIPAA (BAA), SOC 2 Type 2, ISO 27001	SOC 2
Medical terminology	Medical Mode	No equivalent

The core trade-off: input accuracy vs. output quality

This is the question at the heart of the comparison—and it's not as close as it might seem.

ElevenLabs has historically been known for voice synthesis quality. But voice quality across the industry has converged significantly. The TTS outputs from dedicated voice agent APIs—including AssemblyAI's—are natural, professional, and indistinguishable from ElevenLabs in most business contexts. Voice output is no longer the differentiator it was two years ago.

What hasn't converged is speech understanding. AssemblyAI's Universal-3 Pro Streaming achieves 94.07% word accuracy with a 6.3% mean word error rate across English domains. On entity accuracy, Universal-3 Pro Streaming has a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers—with a 59.64% lower missed email rate and 34.79% lower missed phone rate compared to competitors including ElevenLabs Scribe v2 (93.48% word accuracy).

For production voice agent use cases—customer support, phone agents, clinical workflows, order processing—input accuracy is the foundation everything else depends on. An agent that misheard your prescription number or email address fails the task regardless of how its voice sounds. The agent that captures "RX-7704132" correctly every time will outperform the one with marginally different TTS but weaker speech understanding.

This is where ElevenLabs' origins as a TTS company become a liability. Building great voice agents requires purpose-built speech understanding—not a synthesis engine with transcription bolted on.

Platform vs. API: a fundamental design difference

ElevenLabs' Conversational AI is a managed platform. You configure agents through their interface, use their pre-built conversation flows, and deploy within their ecosystem. That might seem like faster setup—until you hit the walls.

And you hit them quickly. ElevenLabs is opinionated about conversation design, which means less control over how your agent behaves in edge cases, how tools integrate, and how the conversation flows. Need a support agent that switches languages mid-conversation? A coaching app that adapts its personality? An agent connected to a niche CRM? You're working within the platform's boundaries—and those boundaries are tight.

AssemblyAI's Voice Agent API is infrastructure, not a platform. One WebSocket, JSON messages, no SDK required. You build the product on top; the API is invisible to your end users. As the team puts it: "Your customers should feel like you built it from scratch."

Full control means full control. System prompt, voice, tools, VAD settings, turn detection timing, interruption behavior—all configurable via JSON and all updateable mid-conversation without dropping the connection. Your voice agent can be genuinely unique because nothing about the API imposes a default behavior pattern.

The developer experience reflects this philosophy. Most developers get a working agent running in an afternoon. The API reference takes about 10 minutes to read. It works natively with Claude Code—you can copy the docs, paste them in, and scaffold a working integration.

Scaling: concurrency limits vs. unlimited sessions

This is a practical difference that matters at production scale.

ElevenLabs caps concurrent agents at 30. That's not a soft guideline—it's a hard ceiling. For a customer support operation that handles hundreds of simultaneous calls during peak hours, 30 concurrent agents isn't a limitation you can work around. It's a dealbreaker.

AssemblyAI's Voice Agent API has no concurrency limits. Autoscaling is included—the infrastructure handles traffic spikes without manual intervention. Combined with flat $4.50/hr pricing (no concurrency metering), costs scale linearly and predictably.

This alone disqualifies ElevenLabs for most serious production deployments. If you're choosing infrastructure you'll grow with, hitting a 30-agent ceiling six months in means a painful and expensive re-platforming.

Pricing

AssemblyAI charges $4.50/hr flat. That covers the full pipeline—speech-to-text, LLM reasoning, and voice generation. No per-token math, no separate charges for input vs. output, no concurrency metering. One bill.

ElevenLabs' pricing is more complex and significantly higher at scale. Their per-character TTS pricing and platform fees add up quickly, especially at production volumes. For high-volume voice agent workloads, AssemblyAI's flat rate is consistently more cost-effective—and the pricing gap widens as you scale.

Use case fit

Choose AssemblyAI when:

Your voice agent needs to capture entities accurately (account numbers, emails, addresses, medical terms). You need API-level control over conversation behavior. You're building for production scale without concurrency limits. Cost predictability matters. You need healthcare compliance with Medical Mode or voice agent solutions built for regulated industries.

Choose ElevenLabs when:

You're building a small-scale prototype that won't exceed 30 concurrent agents. You need 29+ language support. You prefer a managed platform and are willing to trade control and accuracy for faster initial setup.

For production voice agents—the ones that need to hear correctly, complete tasks, and scale—AssemblyAI's combination of superior speech understanding, unlimited concurrency, flat pricing, and full API control makes it the clear choice.

ElevenLabs built a strong TTS product. But voice agents aren't a TTS problem. They're a speech understanding problem, a scaling problem, and a developer control problem. On every dimension that matters for production deployment, AssemblyAI leads.

Frequently asked questions

Is ElevenLabs or AssemblyAI better for building voice agents?

For production voice agents, AssemblyAI is the stronger choice. Universal-3 Pro Streaming achieves 94.07% word accuracy with a 16.7% average missed entity rate—the foundation voice agents need to complete real tasks. ElevenLabs started as a TTS company, and while their voice output sounds good, their speech understanding, scalability (capped at 30 concurrent agents), and developer control all trail AssemblyAI significantly.

What is ElevenLabs' concurrency limit for Conversational AI?

ElevenLabs caps Conversational AI at 30 concurrent agents. This is fine for development and small deployments but becomes a hard scaling constraint for production workloads. AssemblyAI's Voice Agent API has no concurrency limits—autoscaling is included, and pricing remains flat at $4.50/hr regardless of how many simultaneous sessions you run.

How does speech accuracy compare between AssemblyAI and ElevenLabs?

AssemblyAI's Universal-3 Pro Streaming achieves 94.07% word accuracy (6.3% mean WER) compared to ElevenLabs Scribe v2 at 93.48%. The bigger difference is in entity accuracy: Universal-3 Pro Streaming has a 16.7% average missed entity rate, with 59.64% fewer missed emails and 34.79% fewer missed phone numbers compared to competitors. For voice agents that need to capture and act on specific details, this accuracy gap directly impacts task completion rates.

How does AssemblyAI's voice output compare to ElevenLabs?

AssemblyAI's Voice Agent API produces natural, professional voice output that's on par with ElevenLabs for business and production applications. The TTS quality gap that existed across the industry two years ago has largely closed. ElevenLabs offers more voice variety, but for customer support, clinical workflows, and business voice agents, the voice output is not a meaningful differentiator—speech understanding accuracy is.

Which is more cost-effective for production voice agents—AssemblyAI or ElevenLabs?

AssemblyAI's $4.50/hr flat rate covering the full pipeline (STT, LLM, TTS) is consistently more cost-effective for production workloads than ElevenLabs' per-character TTS pricing combined with platform fees. The flat-rate model also makes budgeting straightforward—your cost is simply hours of usage times $4.50, with no concurrency metering or per-token math.

Does AssemblyAI's Voice Agent API support as many languages as ElevenLabs?

AssemblyAI currently supports 6 languages (English, Spanish, French, German, Italian, Portuguese) compared to ElevenLabs' 29+ languages. If your use case requires broad multilingual support beyond these 6, ElevenLabs has an advantage. For the most common production voice agent deployments in North America and Western Europe, AssemblyAI's language coverage handles the majority of use cases.

Build a Voice Agent in 5 Minutes with AssemblyAI’s Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:36:10 +0000

No separate STT, LLM, or TTS services to wire up. The AssemblyAI Voice Agent API handles the entire pipeline server-side: speech recognition, the language model that decides what to say, and the voice that speaks it back. Turn detection, barge-in, and tool calling are built in.

Why one WebSocket beats a multi-service pipeline

A traditional voice agent needs you to wire up at least three providers — a streaming STT, an LLM, and a TTS — and orchestrate the audio routing between them yourself. Every hop adds latency, every provider adds an API key, and every glue layer adds a place for the conversation to fall apart.

The Voice Agent API collapses all of that into a single connection:

Multi-service pipeline	Voice Agent API
Services to wire up	STT + LLM + TTS (3+ vendors)
API keys to manage	3+
Round trips per turn	3 (mic→STT→LLM→TTS→speaker)
Turn detection	Configure separately
Barge-in / interruption	Implement yourself
Tool calling	Wire LLM tools manually

The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send PCM16 audio, get PCM16 audio back. That’s it.

Architecture

The system is a single Python script that opens a WebSocket to the Voice Agent API:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Lower = more sensitive to speech. Raise for noisy environments.
`min_silence`	ms	Minimum silence duration before a confident end-of-turn check fires.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn. Raise for deliberate speech (healthcare, eldercare).
`interrupt_response`	boolean	Set to `False` to disable barge-in entirely.

Prerequisites

Python 3.10+
A microphone — headphones strongly recommended (terminal apps don’t get OS-level echo cancellation, so the agent will interrupt itself if the mic picks up its own voice)
An AssemblyAI API key — free tier available

On macOS, install PortAudio so sounddevice can access your mic:

brew install portaudio

Quick start

1. Clone and Install

 git clone https://github.com/kelsey-aai/voice-agent-5min
cd voice-agent-5min

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure Your API Key

 cp .env.example .env
# Edit .env — drop in your AssemblyAI API key

3. Run the Agent

 python agent.py

Plug in your headphones, wait for Connected (session ...), and start talking. You’ll see your transcript and the agent’s replies stream to the terminal in real time.

That’s the whole thing. Five minutes from clone to conversation.

How it works

The full file is under 100 lines. Three pieces do the actual work.

1. Connect and Configure

 URL = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {API_KEY}"}

async with websockets.connect(URL, additional_headers=headers) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": "You are a friendly voice assistant.",
            "greeting": "Hi there — what can I help you with?",
            "output": {"voice": "ivy"},
        },
    }))

session.update is the first message you send. It sets the agent’s personality (system_prompt), what it says when the user picks up (greeting), and which voice it speaks in (voice: "ivy"). Every field is optional — you can update any of them mid-conversation by sending another session.update.

2. Stream Microphone Audio

 def mic_callback(indata, *_):
    if session_ready.is_set():
        loop.call_soon_threadsafe(mic_queue.put_nowait, bytes(indata))

async def send_audio() -> None:
    while True:
        chunk = await mic_queue.get()
        await ws.send(json.dumps({
            "type": "input.audio",
            "audio": base64.b64encode(chunk).decode(),
        }))

sounddevice calls mic_callback on its own thread every 50 ms with a fresh PCM16 chunk. We hand it off to the asyncio event loop, base64-encode it, and ship it as an input.audio event. The session_ready gate keeps us from sending audio before the server says it’s ready.

3. Play the Agent’s Response

 elif t == "reply.audio":
    pcm = np.frombuffer(base64.b64decode(event["data"]), dtype=np.int16)
    speaker.write(pcm)

elif t == "reply.done" and event.get("status") == "interrupted":
    speaker.abort()
    speaker.start()

The server streams reply.audio chunks as the LLM generates the response — you don’t wait for the full reply to start playing. speaker.write() copies samples into the OS audio buffer; the hardware drains them at 24 kHz on its own clock.

When the user interrupts the agent mid-reply (barge-in), the server emits reply.done with status: "interrupted". We flush the speaker buffer with abort() then start() so the user doesn’t hear stale audio.

What you get for free

These are all handled by the API — you don’t write any code for them:

Neural turn detection. The server decides when the user has finished speaking using both acoustic and linguistic signals, so it knows the difference between a thinking pause and an actual end-of-turn.
Barge-in. When the user speaks over the agent, the server stops generating, sends reply.done with status: "interrupted", and trims the agent transcript to what was actually spoken.
Real-time partial transcripts. transcript.user.delta events stream as the user talks, so you can show what they’re saying live.
Final transcripts for both speakers. transcript.user and transcript.agent events arrive after each turn — perfect for logging or displaying chat history.

Tuning the agent

Pick a different voice

Eighteen English voices and 16 multilingual voices are available. Drop any voice ID into session.output.voice:

"output": {"voice": "james"}    # conversational US male
"output": {"voice": "sophie"}   # clear UK female
"output": {"voice": "diego"}    # Latin American Spanish
"output": {"voice": "arjun"}    # Hindi/Hinglish

See the Voices catalog for samples of each voice. Multilingual voices code-switch with English automatically.

Adjust turn detection

Default settings work well for most apps. Override anything you want under session.input.turn_detection:

"input": {
    "turn_detection": {
        "vad_threshold": 0.5,        # 0.0–1.0; lower = more sensitive
        "min_silence": 600,          # ms; min silence before end-of-turn
        "max_silence": 1500,         # ms; max silence before forcing end-of-turn
        "interrupt_response": True,  # set False to disable barge-in
    }
}

For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence. Settings can be updated mid-session.

Boost domain-specific terms

If your conversation involves rare words — product names, medical terms, customer names — add them to session.input.keyterms to bias the speech recognition model toward them:

"input": {

    "keyterms": ["Ozempic", "Salesforce", "AssemblyAI"]

}

Common issues

The agent keeps interrupting itself. Your microphone is picking up the agent’s TTS output through the speakers. Use headphones, or switch to a browser-based client which gets free echo cancellation from getUserMedia.

Audio sounds garbled or pitched up/down. The Voice Agent API uses 24 kHz by default. Make sure both sd.InputStream and sd.OutputStream are set to samplerate=24000.

UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and double-check .env.

The full troubleshooting guide is in the Voice Agent API docs.

Where to go next

Once you’ve got the basic agent talking, layer in capabilities:

Add tools — give the agent the ability to look up information, hit APIs, or trigger workflows.
Move to the browser — generate temporary tokens server-side so users can talk to the agent from a webpage with built-in echo cancellation.

Resume sessions — preserve conversation context across dropped connections.

Frequently asked questions

What is AssemblyAI’s Voice Agent API?

AssemblyAI’s Voice Agent API is a single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate STT, LLM, and TTS providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.

How is it different from streaming speech-to-text?

Streaming speech-to-text only gives you transcripts — you still have to send those transcripts to an LLM, take the LLM’s response, and send it through a TTS service before playing it back. The Voice Agent API does all of that for you: you stream microphone audio in and you get the agent’s spoken audio back.

How do I authenticate?

Pass your AssemblyAI API key as a Bearer token in the Authorization header during the WebSocket upgrade. For browser apps where you can’t expose your API key, generate a short-lived temporary token on your server and pass it as a ?token= query parameter instead. Each token is single-use.

What audio format does it expect?

By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. For telephony integrations (like Twilio), you can switch to audio/pcmu (G.711 μ-law, 8 kHz) or audio/pcma (G.711 A-law, 8 kHz) under session.input.format and session.output.format.

Can it call tools or functions?

Yes. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Execute the tool in your client code, then send back a tool.result event when you receive the next reply.done. See the tool calling guide for the full pattern.

Why does my voice agent keep interrupting itself?

The most common cause is acoustic echo: the microphone picks up the agent’s TTS output through your speakers. The fix is either headphones or moving the client into a browser, where getUserMedia({ audio: { echoCancellation: true } }) gives you OS-level acoustic echo cancellation for free.

How much does it cost?

AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.

Build a voice agent that can make outbound calls with AssemblyAI

Mart Schweiger — Thu, 07 May 2026 17:35:26 +0000

Why outbound voice agents matter

A voice agent that can dial out, not just answer, unlocks workflows that text channels drop the ball on:

Use case	What the agent does	Why outbound beats inbound
Appointment reminders	Calls the patient 24 h before, confirms or reschedules	Reaches people who never read the SMS
Lead qualification	Calls a fresh inbound lead, qualifies, books with sales	Engages while interest is still hot
Survey + NPS	Reads the prompt, captures freeform answers	Higher response rate than email
Past-due collections	Calls account, takes payment via tool call	Lower agent cost than a human dialer
Recall and renewal	Notifies of a recall, prescription refill, or expiring policy	Cuts through inbox noise
Customer winback	Reaches lapsed customers with a personalized offer	More personal than a marketing email

In every case the win is the same: the agent reaches the customer through the channel they actually pick up, holds a real conversation, and writes the outcome to your system of record.

Architecture

The system has three components connected by two WebSockets:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Raise for noisy phone lines.
`min_silence`	ms	Minimum silence before the end-of-turn check fires. Raise for deliberate speech.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn.
`interrupt_response`	boolean	Set to `false` to disable barge-in entirely.

The key insight: both legs use audio/pcmu (G.711 μ-law at 8 kHz). Twilio Media Streams already deliver base64-encoded μ-law audio, and the Voice Agent API accepts and emits the same format natively. That means zero resampling — bytes pass through end-to-end.

Prerequisites

Node.js 18+ and npm
An AssemblyAI API key — free tier available
A Twilio account plus a voice-capable phone number in your console
ngrok (or any public HTTPS tunnel) so Twilio can reach your dev machine

Consent matters. Automated outbound calls are regulated almost everywhere — TCPA in the US, the various state DNC registries, GDPR in the EU, two-party-consent rules for recording, and more. Disclose that the call is automated in the opener, honor “remove me from the list” requests, and consult counsel before dialing real prospects.

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-outbound-calls
cd voice-agent-outbound-calls
npm install

2. Configure your environment

 cp .env.example .env
# Fill in:
#   ASSEMBLYAI_API_KEY     — from the AssemblyAI dashboard
#   TWILIO_ACCOUNT_SID     — from console.twilio.com
#   TWILIO_AUTH_TOKEN      — from console.twilio.com
#   TWILIO_FROM_NUMBER     — your Twilio voice number, e.g. +15551234567
#   PUBLIC_URL             — leave blank for now; we'll fill it after ngrok

3. Run the server

 npm start
# → Listening on http://localhost:3000

4. Expose it with ngrok

In a second terminal:

ngrok http 3000

Copy the HTTPS forwarding URL (e.g. https://ab12cd34.ngrok-free.app) and paste it into .env as PUBLIC_URL. Restart npm start.

5. Place a call

 curl -X POST http://localhost:3000/call \
  -H 'content-type: application/json' \
  -d '{"to":"+15551234567"}''

Use your own phone number for the first call so you become the prospect. The phone rings, the agent greets you with the disclosure, and you can talk to it like a human.

How it works

1. Place the call

POST /call receives a JSON body with the target number and asks Twilio to dial it. Twilio's Calls API does the actual dialing and, when the recipient picks up, fetches the URL we passed as url: to get TwiML instructions for the call.

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
});

2. Return TwiML that opens a media stream

When Twilio fetches /twiml, the server returns a tiny piece of XML that wraps the live call in a verb. That verb tells Twilio to open a WebSocket back to our server and pipe the call audio over it.

app.post("/twiml", (_req, res) => {
  const wsUrl = PUBLIC_URL.replace(/^http/, "ws") + "/twilio-stream";
  res.type("text/xml").send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="${wsUrl}" />
  </Connect>
</Response>`);
});

3. Bridge two WebSockets

When Twilio connects to /twilio-stream, we open a second WebSocket to AssemblyAI and shuttle messages between them. The first message we send to AssemblyAI is session.update — it configures the agent's personality, voice, and audio formats.

aaiWs.send(JSON.stringify({
  type: "session.update",
  session: {
    system_prompt: SYSTEM_PROMPT,
    greeting: GREETING,
    input:  { format: { encoding: "audio/pcmu" } },
    output: { voice: "ivy", format: { encoding: "audio/pcmu" } },
  },
}));

Both formats are audio/pcmu. Twilio Media Streams already deliver base64-encoded μ-law 8 kHz audio. AssemblyAI accepts that format natively and can emit it back, which means we never decode, resample, or re-encode any audio in this server.

Greeting is set in session.update. Outbound calls need the agent to speak first — the prospect has no idea why their phone is ringing. Setting session.greeting makes the agent open the conversation as soon as the session is ready.

4. Forward audio in both directions

The Twilio side emits connected, start, media, and stop events. We capture streamSid from start, forward media payloads to AssemblyAI as input.audio events, and close the AAI socket on stop.

case "media": {
  const payload = msg.media.payload;  // already base64 μ-law 8 kHz
  aaiWs.send(JSON.stringify({ type: "input.audio", audio: payload }));
  break;
}

Each reply.audio chunk from AssemblyAI is base64 μ-law that we wrap in a Twilio media event and ship straight back to the call:

case "reply.audio":
  twilioWs.send(JSON.stringify({
    event: "media",
    streamSid,
    media: { payload: evt.data },
  }));
  break;

5. Handle barge-in cleanly

When the user speaks while the agent is talking, AssemblyAI emits reply.done with status: "interrupted". On a phone call we also need to flush whatever audio Twilio still has buffered. Twilio supports a clear event for exactly this:

case "reply.done":
  if (evt.status === "interrupted" && streamSid) {
    twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
  }
  break;

6. Echo cancellation is the carrier's job

On a phone call you don't have to think about acoustic echo cancellation — the carrier and the handset handle it. That's a meaningful difference from terminal-based clients, which need headphones to keep the agent from interrupting itself.

Tuning the agent

Voice

Drop any voice ID from the Voices catalog into session.output.voice. Eighteen English voices and 16 multilingual voices are available; multilingual voices code-switch with English automatically.

output: { voice: "james",  format: { encoding: "audio/pcmu" } }
output: { voice: "sophie", format: { encoding: "audio/pcmu" } }
output: { voice: "diego",  format: { encoding: "audio/pcmu" } }

System prompt and greeting

Both live near the top of server.js. Keep them short — phone-call replies should be one or two sentences. Always disclose that the call is automated in the first sentence; several US states require it.

Turn detection

Outbound calls often run on noisier lines than browser-based agents. The defaults in server.js are tuned a little tighter:

turn_detection: {
  vad_threshold: 0.5,        // 0.0–1.0; raise for noisy lines
  min_silence: 400,          // ms; raise for deliberate speech
  max_silence: 1200,         // ms; max wait before forcing end-of-turn
  interrupt_response: true,  // false to disable barge-in
}

See the session configuration reference for every available knob.

Recording, machine detection, and time limits

Twilio's Calls API takes optional flags that you almost certainly want in production:

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
  record: true,
  machineDetection: "Enable",
  timeLimit: 600,  // hard cap in seconds
});

record: true saves the call to Twilio's media store. machineDetection: "Enable" lets you branch on voicemail vs. live human. timeLimit puts a ceiling on a single call so a stuck LLM can't burn budget.

Tools (Function Calling)

Once the conversation works, add tools to let the agent do things — book a meeting, look up an account, mark the lead as DNC. Tools register on the same session.update you already send. The full pattern is covered in the tool-calling guide.

Troubleshooting

The phone rings but the call drops immediately. Check the Twilio console call log. Most often it's a TwiML fetch failure — Twilio couldn't reach PUBLIC_URL/twiml because ngrok died, the URL still says localhost, or the protocol is http:// instead of https://.

Twilio connects but the agent never speaks. Look for [aai] session.ready in your server logs. If you see UNAUTHORIZED, your AssemblyAI key is wrong. If you see no AAI logs at all, your environment variables aren't loaded — confirm .env is next to server.js.

The agent's voice sounds chipmunky or muffled. Both session.input.format.encoding and session.output.format.encoding must be audio/pcmu. If either is left at the default audio/pcm (24 kHz), the formats won't match and Twilio will play the audio at the wrong rate.

The agent keeps talking over me after I interrupt. Make sure you forward the clear event to Twilio when you receive reply.done with status: "interrupted". Without it, Twilio plays out the rest of its buffered audio.

Twilio trial accounts only call verified numbers. That's a Twilio limitation, not a bug in this code. Verify the recipient number in the Twilio console, or upgrade the account.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

How do I make a voice agent that places outbound phone calls?

Use Twilio's Calls API to dial the target number and pass it TwiML that opens a to your server. On your server, accept the resulting Media Streams WebSocket and bridge it to AssemblyAI's Voice Agent API. Configure both session.input.format.encoding and session.output.format.encoding as audio/pcmu so Twilio's μ-law 8 kHz audio passes through without resampling.

What audio format should I use for a Twilio voice agent?

Use audio/pcmu (G.711 μ-law, 8 kHz) on both the input and output of the Voice Agent API. Twilio Media Streams emit base64-encoded μ-law 8 kHz audio natively, and the Voice Agent API accepts and emits the same format. That means no decoding, no resampling, and no re-encoding.

How does the Voice Agent API handle barge-in over a phone call?

When the user speaks while the agent is talking, the Voice Agent API emits reply.done with status: "interrupted". On a Twilio call you also need to flush Twilio's outbound buffer by sending {event: "clear", streamSid} over the Media Streams WebSocket.

Do I need separate STT, LLM, and TTS for an outbound voice agent?

No. The AssemblyAI Voice Agent API bundles speech recognition, the language model, and text-to-speech behind a single WebSocket. You stream telephony audio in and get the agent's spoken audio back, with neural turn detection, barge-in, and tool calling built in.

How do I authenticate from a Node.js server?

Pass your AssemblyAI API key as a Bearer token in the Authorization HTTP header on the WebSocket upgrade request: new WebSocket(url, { headers: { Authorization: "Bearer YOUR_API_KEY" } }).

Is it legal to call prospects with an AI voice agent?

It depends on jurisdiction and use case. In the US, the TCPA and state DNC registries restrict automated calls. Several states require AI disclosure in the opener. The EU's GDPR and ePrivacy rules add their own requirements. Disclose that the call is automated, honor opt-out requests, and consult counsel before dialing real prospects.

How much does it cost?

AssemblyAI offers a free tier so you can prototype without a credit card. For current pricing, see the AssemblyAI pricing page. Twilio bills separately for outbound minutes and the phone number itself.

Build a voice agent with LiveKit and AssemblyAI’s Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:35:18 +0000

Why combine LiveKit and the Voice Agent API

WebRTC and AI are different problems with different best-in-class solutions:

LiveKit is the easiest way to ship production-grade real-time audio. SDKs for Web, iOS, Android, React Native, Flutter, and Unity. Built-in recording, simulcast, adaptive bitrate, and end-to-end encryption. A managed cloud and a self-hostable open-source server.
AssemblyAI’s Voice Agent API is the easiest way to ship a voice agent. One WebSocket gives you Universal-3 Pro Streaming for speech-to-text, an LLM, a TTS engine with 30+ voices, plus neural turn detection, barge-in, and tool calling — all server-side.

Use them together and you get multi-user voice rooms with a real AI agent inside, without writing a STT/LLM/TTS orchestration layer or building your own WebRTC stack.

How this differs from the LiveKit Agents framework

LiveKit Agents framework	This tutorial (Voice Agent API + LiveKit transport)
Where the AI lives	You configure STT, LLM, and TTS plugins separately
Services to wire up	3+ (one per plugin)
API keys to manage	3+
Turn detection	Plugin-dependent; configure VAD + endpointing
Barge-in	Framework handles it across plugins
Tool calling	LLM-plugin-specific
What LiveKit does	Transport + agent runtime

Architecture

The system has four layers:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Higher = ignore more background noise.
`min_silence`	ms	Minimum silence before a confident end-of-turn. Drop to 300 for fast-paced conversation.
`max_silence`	ms	Hard ceiling on silence before forcing end-of-turn. Raise to 2500 for deliberate speech (eldercare, healthcare).
`interrupt_response`	boolean	Set to `False` to disable barge-in entirely.

Audio flows at 24 kHz mono PCM16 between the worker and the Voice Agent API. LiveKit’s native FFI resampler handles the conversion between WebRTC’s internal 48 kHz and the 24 kHz the API expects.

Prerequisites

Python 3.10+
An AssemblyAI API key — free tier available, no credit card
A LiveKit server — either a free LiveKit Cloud project or a self-hosted livekit-server
A LiveKit client to talk to the agent — the fastest path is the hosted LiveKit Agents Playground

You don’t need a microphone or speakers on the worker machine — the worker is a server-side participant. All audio I/O happens in the browser/mobile client.

Quick start

1. Clone and Install

 git clone https://github.com/kelsey-aai/voice-agent-livekit
cd voice-agent-livekit

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure Environment

 cp .env.example .env

Fill in .env:

ASSEMBLYAI_API_KEY=           # https://www.assemblyai.com/dashboard/signup
LIVEKIT_URL=wss://<project>.livekit.cloud
LIVEKIT_API_KEY=              # LiveKit Cloud → Settings → Keys
LIVEKIT_API_SECRET=
ROOM_NAME=voice-agent-demo

For self-hosted LiveKit, run livekit-server --dev and use LIVEKIT_URL=ws://localhost:7880.

3. Run the Worker

 python worker.py

4. Connect a Client

The fastest way is the LiveKit Agents Playground:

Open the playground.
Paste your LIVEKIT_URL and a token. Generate a token from the LiveKit Cloud dashboard, set the room to voice-agent-demo and the identity to anything other than voice-agent.
Click Connect , allow microphone access, and start talking.

How it works

The worker is one file (worker.py) and roughly 250 lines. Six steps do the actual work.

1. Mint a LiveKit Token and Join the Room

 from livekit import api, rtc

token = (
    api.AccessToken(LIVEKIT_API_KEY, LIVEKIT_API_SECRET)
    .with_identity("voice-agent")
    .with_grants(api.VideoGrants(
        room_join=True, room=ROOM_NAME,
        can_publish=True, can_subscribe=True,
    ))
    .to_jwt()
)

room = rtc.Room()
await room.connect(LIVEKIT_URL, token)

AccessToken builds a signed JWT with the grants the worker needs: subscribe to incoming audio, publish a reply track. room.connect() opens the WebRTC signaling and media path.

2. Publish a Local Audio Track for the Agent’s Voice

 audio_source = rtc.AudioSource(sample_rate=24_000, num_channels=1)
local_track = rtc.LocalAudioTrack.create_audio_track("agent-voice", audio_source)

await room.local_participant.publish_track(
    local_track,
    rtc.TrackPublishOptions(source=rtc.TrackSource.SOURCE_MICROPHONE),
)

AudioSource is LiveKit’s pump for sending audio into a room. We configure it at 24 kHz mono — the Voice Agent API’s default format — so reply audio goes straight in without resampling.

3. Subscribe to the User’s Audio Track

 @room.on("track_subscribed")
def on_track_subscribed(track, publication, participant):
    if track.kind == rtc.TrackKind.KIND_AUDIO:
        asyncio.create_task(bridge_to_voice_agent(track))

LiveKit emits track_subscribed when a remote participant publishes a track and it gets routed to us. We only care about audio.

4. Forward Microphone Audio to the Voice Agent API

 stream = rtc.AudioStream.from_track(
    track=mic_track,
    sample_rate=24_000,    # ask LiveKit to resample to 24 kHz
    num_channels=1,
)

async for event in stream:
    pcm16_bytes = bytes(event.frame.data)
    await ws.send(json.dumps({
        "type": "input.audio",
        "audio": base64.b64encode(pcm16_bytes).decode("ascii"),
    }))

AudioStream does the resampling. WebRTC carries audio at 48 kHz internally, but we ask for 24 kHz mono and the LiveKit FFI resampler handles the conversion. Each AudioFrame exposes data as a memoryview of int16 samples — base64-encode and ship as input.audio.

5. Play the Agent’s Reply Back into the Room

 elif t == "reply.audio":
    pcm = base64.b64decode(event["data"])
    samples = len(pcm) // 2  # 2 bytes per int16, mono
    frame = rtc.AudioFrame(
        data=pcm,
        sample_rate=24_000,
        num_channels=1,
        samples_per_channel=samples,
    )
    await audio_source.capture_frame(frame)

The agent streams reply.audio events as soon as the LLM begins generating. Each chunk is wrapped in an AudioFrame and pushed into the AudioSource, which queues it up to 1 second deep and drains at 24 kHz on its own clock.

6. Handle Barge-In

 elif t == "input.speech.started":
    # User started talking; stop playback.
    audio_source.clear_queue()

elif t == "reply.done":
    if event.get("status") == "interrupted":
        audio_source.clear_queue()

AudioSource.clear_queue() immediately discards every queued frame so the user doesn’t hear stale agent audio after they’ve spoken over it.

Tuning the agent

Pick a Voice

 "output": {"voice": "james"}     # conversational US male
"output": {"voice": "sophie"}    # clear UK female
"output": {"voice": "diego"}     # Latin American Spanish
"output": {"voice": "arjun"}     # Hindi/Hinglish

See the Voices catalog for samples. Multilingual voices code-switch automatically.

Adjust the System Prompt and Greeting

 "session": {
    "system_prompt": (
        "You are a customer support agent for Acme. Speak in 1–2 short "
        "sentences. Confirm the user's question before answering."
    ),
    "greeting": "Hi, this is Acme support — what's going on?",
}

You can re-send session.update mid-conversation to swap the prompt or voice. greeting is locked once spoken, but system_prompt and voice are not.

Tune Turn Detection

 "input": {
    "turn_detection": {
        "vad_threshold": 0.5,        # 0.0–1.0; higher = ignore more noise
        "min_silence": 600,          # ms before confident end-of-turn
        "max_silence": 1500,         # ms hard ceiling
        "interrupt_response": True,  # set False to disable barge-in
    }
}

For deliberate speech (eldercare, healthcare), raise max_silence to 2500. For fast-paced conversation, drop min_silence to 300.

Boost domain-specific terms

If your conversation includes product names, medical terms, or rare proper nouns, add them to session.input.keyterms:

"input": { "keyterms": ["Universal-3 Pro Streaming", "AssemblyAI", "LiveKit"] }

Multiple participants in one room

This worker bridges one remote audio track to the Voice Agent API. Two ways to scale:

One agent per room. Spin up a separate worker process per room. Best for 1-on-1 use cases like phone-style support agents.
Mix participants before sending. If you want a meeting-style multi-talker agent, mix all remote audio with rtc.AudioMixer and send the mix to one Voice Agent API session.

Troubleshooting

The worker connects but the client never hears the agent. Make sure your client subscribed to the agent’s track. Confirm can_subscribe=True on the client’s token.

UNAUTHORIZED close on the AssemblyAI WebSocket. Your ASSEMBLYAI_API_KEY is missing, expired, or pasted with whitespace. Grab a fresh key from the AssemblyAI dashboard.

LiveKit ConnectError: invalid token. The JWT signature didn’t validate against the LIVEKIT_API_SECRET. Check that the URL, key, and secret all come from the same LiveKit project.

Audio is choppy or robotic. Almost always the audio buffer running dry. Run the worker close to your network egress. Inside AudioSource(... queue_size_ms=1000) you have one second of headroom; raise it to 2000 if you see transient stalls.

Audio sounds pitched up or down. Sample-rate mismatch. Both AudioSource and AudioStream.from_track must be configured at sample_rate=24_000, num_channels=1.

Agent keeps interrupting itself. Browser clients with getUserMedia({ audio: { echoCancellation: true } }) handle this automatically. On custom mobile clients, make sure AEC is enabled on the capture side.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI’s Voice Agent API?

A single WebSocket endpoint that handles the full voice agent pipeline server-side: speech-to-text via Universal-3 Pro Streaming, an LLM, and a TTS engine with 30+ voices. It includes neural turn detection, barge-in, and tool calling out of the box.

Why use LiveKit with the Voice Agent API instead of going direct?

LiveKit handles real-time audio transport (WebRTC, mobile and browser SDKs, recording, scaling, and global edge distribution). The Voice Agent API handles the AI. Combining them gives you multi-user voice rooms, mobile clients, and recording without building a WebRTC stack.

Is this the LiveKit Agents framework?

No. The LiveKit Agents framework expects separate STT, LLM, and TTS plugins. This tutorial uses livekit-rtc directly to join a room as a server-side participant, then forwards audio to the Voice Agent API, which replaces all three.

What audio format does the Voice Agent API expect?

By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. This worker configures both LiveKit AudioStream and AudioSource at 24 kHz mono so no manual resampling is needed.

Can the Voice Agent API call tools from inside a LiveKit room?

Yes. Register tool definitions in session.tools on session.update. When the agent decides to invoke one, the server emits a tool.call event. Run the tool in your worker, then send back a tool.result after receiving reply.done.

How do I scale to many concurrent rooms?

Run one worker per room. LiveKit Cloud’s agent dispatch can spin up a worker per active room, and each worker holds one Voice Agent API WebSocket. Both scale horizontally.

How much does it cost?

AssemblyAI offers a free tier. For current pricing, see the AssemblyAI pricing page and the LiveKit Cloud pricing page.

Build an Agora voice agent with AssemblyAI's Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:34:33 +0000

Why combine Agora and the Voice Agent API

Agora gives you battle-tested WebRTC: low-latency audio routing across 200+ countries, automatic codec negotiation, jitter buffers, NAT traversal, and SDKs for every client platform. The Voice Agent API is the AI brain in one connection.

DIY pipeline behind Agora	Voice Agent API behind Agora
Vendors to integrate	STT + LLM + TTS (3+)
API keys to rotate	3+
Round trips per turn	3
Turn detection	Plug in a separate VAD
Barge-in	Implement yourself
Tool calling	Wire LLM tools yourself
Voices	Pick a TTS vendor

Architecture

The system has three layers:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Raise for noisy call environments.
`min_silence`	ms	Minimum silence before a confident end-of-turn check fires.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn. Raise for deliberate speakers.
`interrupt_response`	boolean	Set to `False` to disable barge-in entirely.

The bot resamples between Agora's 16 kHz and the Voice Agent API's 24 kHz using SciPy's polyphase filter. Both sides use PCM16 mono.

Prerequisites

Python 3.10+
An Agora project with an App ID (and App Certificate if enabled)
An AssemblyAI API key — free tier available
Linux or macOS (the Agora native server SDK does not officially ship Windows wheels; use WSL2 or a Linux container on Windows)

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-agora
cd voice-agent-agora

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure credentials

 cp .env.example .env

Edit .env:

ASSEMBLYAI_API_KEY=your_assemblyai_key
AGORA_APP_ID=your_agora_app_id
AGORA_APP_CERTIFICATE=your_agora_app_certificate
AGORA_CHANNEL=voice-agent-demo
AGORA_BOT_UID=9999

If your Agora project has App Certificate disabled, leave AGORA_APP_CERTIFICATE blank.

3. Run the bot

 python bot.py --channel voice-agent-demo

4. Connect a client

Open Agora's Web demo, enter your App ID, the same channel name, a different UID, and click Join. Speak — the bot transcribes you live, the LLM replies, and the synthesized voice plays back through your browser.

How it works

The bridge is two cooperating asyncio tasks — one pulling caller audio out of Agora and pushing it to AssemblyAI, the other pulling reply audio out of AssemblyAI and pushing it back into Agora.

1. Connect to the Voice Agent API

 URL = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {API_KEY}"}

async with websockets.connect(URL, additional_headers=headers) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": "You are a friendly voice assistant.",
            "greeting": "Hi — I just joined the call.",
            "input": {"format": {"encoding": "audio/pcm"}},
            "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}},
        },
    }))

\\

session.update is the first message and configures personality, greeting, and voice. The default audio format is audio/pcm — 24 kHz, 16-bit signed LE, mono.

2. Pull caller audio out of Agora

The bot registers an IAudioFrameObserver whose on_playback_audio_frame_before_mixing hook fires every 10 ms with one participant's audio frame. We resample 16 kHz → 24 kHz with SciPy's polyphase filter:

def on_playback_audio_frame_before_mixing(self, channel_id, uid, frame):
    pcm16 = bytes(frame.buffer)              # 16 kHz PCM16
    pcm24 = resample_pcm16(pcm16, 16_000, 24_000)
    loop.call_soon_threadsafe(agent.inbound_audio.put_nowait, pcm24)
    return 0

call_soon_threadsafe is required because Agora's observer runs on a native C++ thread, not the asyncio loop.

3. Stream audio to AssemblyAI

 chunk = await mic_queue.get()
await ws.send(json.dumps({
    "type": "input.audio",
    "audio": base64.b64encode(chunk).decode(),
}))

4. Publish the reply back into Agora

When reply.audio events arrive, we decode the base64 PCM, resample 24 kHz → 16 kHz, and hand it to AudioPcmDataSender:

elif t == "reply.audio":
    pcm = base64.b64decode(event["data"])
    await self.outbound_audio.put(pcm)

pcm16 = resample_pcm16(pcm24, 24_000, 16_000)
self.pcm_sender.send_audio_pcm_data(
    pcm16, 0, len(pcm16)//2, 2, 1, 16_000,
)

We pace the pushes to wall-clock time so a long reply doesn't blast into Agora's buffer in one go — that keeps barge-in responsive.

5. Handle barge-in

 elif t == "reply.done" and event.get("status") == "interrupted":
    while not outbound_audio.empty():
        outbound_audio.get_nowait()

The Voice Agent API also trims the transcript.agent event to what the bot actually got out before it was cut off — useful for accurate logging.

Tuning

Pick a different voice

 "output": {"voice": "james"}    # conversational US male
"output": {"voice": "sophie"}   # clear UK female
"output": {"voice": "diego"}    # Latin American Spanish
"output": {"voice": "arjun"}    # Hindi/Hinglish

Browse the full Voices catalog. Multilingual voices code-switch with English automatically.

Adjust turn detection

 "input": {

    "turn_detection": {

        "vad_threshold": 0.5,

        "min_silence": 600,

        "max_silence": 1500,

        "interrupt_response": True,

    }

}

Boost domain-specific words

 "input": {"keyterms": ["AssemblyAI", "Agora", "Universal-3"]}

Add tools

Register functions on session.tools to let the agent look up data, hit APIs, or trigger workflows. Full pattern in the tool calling docs.

Troubleshooting

agora-python-server-sdk install fails on macOS. The package ships pre-built C++ wheels for Linux and macOS. If pip falls back to source build, install Xcode command-line tools (xcode-select --install) or run the bot in a Linux container.

Bot joins but stays silent. Check that your client connected with the same AGORA_CHANNEL name and a different UID than AGORA_BOT_UID. Agora rejects duplicate UIDs.

UNAUTHORIZED close from AssemblyAI. API key missing, expired, or wrong. Pull a fresh one from the AssemblyAI dashboard.

Audio sounds chipmunky or sluggish. Sample-rate mismatch. Confirm set_playback_audio_frame_before_mixing_parameters(channels=1, sample_rate_hz=16000) and that resampling is on between Agora's 16 kHz and the API's 24 kHz.

Bot interrupts itself. Acoustic loop somewhere — usually one client has speakers + mic open without echo cancellation. Browser clients should request getUserMedia({ audio: { echoCancellation: true } }).

Token errors from Agora. If your project has App Certificate enabled, AGORA_APP_CERTIFICATE must be set and the bot UID + channel name must match what you signed.

Full troubleshooting guide: Voice Agent API docs.

Known limitations

agora-python-server-sdk is a beta wrapper around Agora's native C++ SDK. Class layouts have moved between minor versions. We pin 2.2.4 and document the exact API surface the bot uses.
Agora's recommended path for new voice-agent projects is the Conversational AI Engine — a hosted REST service. Use this tutorial when you want the full AI pipeline on AssemblyAI's Voice Agent API.

No Windows wheels. Run inside WSL2 or a Linux Docker container.

Frequently asked questions

What is the AssemblyAI Voice Agent API?

A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech recognition on Universal-3 Pro Streaming, LLM reasoning, and TTS with 30+ voices. It includes neural turn detection, barge-in, and tool calling.

How do I connect the Voice Agent API to Agora?

Run a server-side bot with agora-python-server-sdk. The bot joins the Agora channel, registers an IAudioFrameObserver to capture caller audio (16 kHz PCM), resamples to 24 kHz, and forwards each chunk to the Voice Agent API. Reply audio comes back, gets resampled to 16 kHz, and is published via AudioPcmDataSender.

Can I use Agora's Conversational AI Engine instead?

Yes — it supports AssemblyAI as the STT provider, but uses Agora's LLM and TTS layers. Use this tutorial when you want the full AI pipeline on AssemblyAI's Voice Agent API.

What audio format does it use with Agora?

The Voice Agent API defaults to audio/pcm at 24 kHz. Agora delivers 16 kHz PCM, so the bot resamples 16 kHz ↔ 24 kHz on each side using SciPy's polyphase filter.

How does barge-in work?

The Voice Agent API emits reply.done with status: "interrupted". The bridge flushes its outbound audio queue so the bot stops talking immediately.

Do I need an Agora App Certificate?

Only if your Agora project has it enabled. If so, set AGORA_APP_CERTIFICATE in .env. If disabled, leave it blank.

How much does it cost?

AssemblyAI offers a free tier. For current pricing, see the AssemblyAI pricing page.

Build a Daily.co voice agent with AssemblyAI's Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:34:25 +0000

Daily.co handles the WebRTC plumbing — rooms, participants, audio tracks, NAT traversal — across web, mobile, and SIP. AssemblyAI's Voice Agent API handles the AI: speech recognition, the LLM that decides what to say, and the voice that speaks it back, all over a single connection. This tutorial bridges the two with the daily-python SDK.

Why combine Daily.co with the Voice Agent API

Most voice-agent stacks pick a transport and then bolt on a pipeline of AI services behind it. With Daily.co + the Voice Agent API, both halves collapse to a single managed dependency each.

Self-hosted WebRTC + multi-vendor AI	Daily.co + Voice Agent API
Transport infrastructure	Run your own SFU/TURN servers
Cross-platform clients	Build native WebRTC clients per platform
AI services to wire	Streaming STT + LLM + TTS (3+ vendors)
API keys	4+ (TURN + 3 AI services)
Round trips per turn	4 (WebRTC→STT→LLM→TTS→WebRTC)
Turn detection / barge-in	Implement yourself
Tool calling	Wire LLM tools manually

Architecture

The system has three layers:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Raise for noisy call environments.
`min_silence`	ms	Minimum silence before a confident end-of-turn check fires.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn. Raise for deliberate speakers.
`interrupt_response`	boolean	Set to `False` to disable barge-in entirely.

The bot resamples between Daily's 16 kHz and the Voice Agent API's 24 kHz. Both sides use PCM16 mono.

Prerequisites

Python 3.10+
A Daily.co account with an API key — free tier includes 10,000 participant minutes/month
An AssemblyAI API key — free tier available, no credit card
A Daily.co room URL — create in the dashboard or with the script shown below

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-daily
cd voice-agent-daily

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure your keys

 cp .env.example .env

Fill in .env:

ASSEMBLYAI_API_KEY=...   # https://www.assemblyai.com/dashboard/signup
DAILY_API_KEY=...        # https://dashboard.daily.co/developers
DAILY_ROOM_URL=https://yourname.daily.co/voice-agent

3. Create a daily room (Optional)

If you don't have a room URL, create one via the Daily REST API:

import os, requests
from dotenv import load_dotenv

load_dotenv()
HEADERS = {"Authorization": f"Bearer {os.environ['DAILY_API_KEY']}"}

r = requests.post(
    "https://api.daily.co/v1/rooms",
    headers=HEADERS,
    json={"properties": {"enable_prejoin_ui": False, "exp": 3600}},
)
r.raise_for_status()
print("Room URL:", r.json()["url"])

4. Run the bot

 python bot.py

Open the same room URL in a browser. Speak — you'll see your transcript and the agent's reply stream to the terminal. The agent's voice plays back through the room.

How it works

The whole bot is in bot.py, in four pieces.

1. Initialize Daily and join the room

 import daily

daily.Daily.init()
client = daily.CallClient(event_handler=self)

client.join(
    DAILY_ROOM_URL,
    meeting_token=DAILY_TOKEN,
    client_settings=daily.ClientSettings(
        inputs=daily.InputSettings(
            microphone=daily.MicrophoneSettings(
                is_enabled=True, device_id="vaa-mic",
            ),
            camera=daily.CameraSettings(is_enabled=False),
        ),
        publishing=daily.PublishingSettings(
            microphone=daily.MicrophonePublishingSettings(is_enabled=True),
        ),
    ),
)

Three things happen: daily.Daily.init() boots the WebRTC stack. CallClient is the per-room handle. The client_settings say: don't open a real mic or camera, but publish a virtual mic named vaa-mic where we'll write the agent's reply audio.

2. Subscribe to remote audio

 client.update_subscriptions(
    participant_settings={
        "*": daily.ParticipantSubscriptionSettings(
            media=daily.MediaSubscription.SUBSCRIBED_ALL,
        )
    }
)

client.set_participant_audio_renderer(
    participant_id,
    callback=self.on_audio_data,
    sample_rate=16_000,
    num_channels=1,
)

update_subscriptions tells Daily to pull all media from every remote participant. set_participant_audio_renderer converts the remote audio track into mono PCM16 at 16 kHz and calls your callback for each chunk.

3. Bridge to the Voice Agent API

 async with websockets.connect(
    "wss://agents.assemblyai.com/v1/ws",
    additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_API_KEY}"},
) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": "You are a friendly voice assistant.",
            "greeting": "Hi - I'm joining the call.",
            "input":  {"format": {"encoding": "audio/pcm"}},
            "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}},
        },
    }))

After session.ready, every chunk of room audio gets resampled to 24 kHz, base64-encoded, and shipped as input.audio.

4. Publish reply audio back into the room

 elif t == "reply.audio":
    pcm24 = base64.b64decode(event["data"])
    pcm_out = resample_pcm16(pcm24, 24_000, 16_000)
    await asyncio.to_thread(mic_device.write_frames, pcm_out)

mic_device is the virtual microphone registered with daily.Daily.create_microphone_device(...). Anything you write into it gets published into the Daily room as if a real human's mic produced it.

Interruption (Barge-In) handling

When a participant speaks while the agent is replying, the Voice Agent API emits reply.done with status: "interrupted". The bot doesn't need to flush anything on the Daily side — mic_device.write_frames only plays what you hand it, so stopping writes stops the agent.

elif t == "reply.done" and event.get("status") == "interrupted":
    pending_tools.clear()

Tuning

Pick a different voice

 "output": {"voice": "james"}    # conversational US male
"output": {"voice": "sophie"}   # UK female
"output": {"voice": "diego"}    # Latin American Spanish
"output": {"voice": "arjun"}    # Hindi/Hinglish

Browse the Voices catalog. Multilingual voices code-switch automatically.

Tune the system prompt

 "session": {

    "system_prompt": (

        "You are a sales-qualifying agent for Acme Corp on a video call. "

        "Ask one question at a time. Keep answers under two short sentences."

    ),

    "greeting": "Hey - thanks for hopping on. What brings you in today?",

Multi-participant calls

update_subscriptions with "*" subscribes the bot to every remote participant. The audio renderer fires per-participant. Today the bot mixes everyone into a single stream — for cleaner handling, either subscribe only to a designated primary participant or tag transcripts with the participant ID at the bridge.

Telephony in the same room

Daily supports SIP and PSTN dial-in, so phone callers can join the same room. Daily transcodes the carrier's 8 kHz G.711 audio for you — your renderer callback still gets PCM16 at whatever sample rate you asked for. No extra code.

Troubleshooting

Bot connects but never speaks. Check that MicrophonePublishingSettings(is_enabled=True) is set, and that you're calling mic_device.write_frames(...) with non-empty bytes.

Agent transcript is garbled. Wrong sample rate. The API expects 24 kHz for audio/pcm, not 16 kHz. Confirm your resample step is running.

Agent keeps interrupting itself. Two causes: (1) The bot is subscribing to its own virtual mic — filter self.client.local_participant().id. (2) Real humans without echo cancellation — plug in headphones.

UNAUTHORIZED close on the Voice Agent API socket. Bad or missing AssemblyAI key — check .env.

Daily join fails with 403. Room URL is wrong, the room expired, or it requires a meeting token.

Full troubleshooting guide: Voice Agent API docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

A single WebSocket endpoint that handles the entire voice-agent pipeline server-side: speech recognition, LLM reasoning, and TTS. You send PCM audio in and get PCM audio back, with neural turn detection, barge-in, and tool calling built in.

How does the Daily.co Python SDK send audio to a voice agent?

The daily-python SDK exposes per-participant audio renderers via set_participant_audio_renderer. Daily decodes the remote WebRTC track to mono PCM16 and invokes your callback. You forward those bytes as input.audio events to the Voice Agent API.

What sample rate does it use?

The Voice Agent API defaults to 24 kHz PCM16 in both directions. Daily's renderer delivers 16 kHz. The bot resamples 16 kHz ↔ 24 kHz at the bridge.

How do I publish reply audio back into a Daily room?

Register a virtual mic with daily.Daily.create_microphone_device(...), enable it on join, and call mic_device.write_frames(pcm_bytes) with each reply chunk. Daily publishes it like a normal participant's mic.

Can the bot handle multiple humans?

Yes. update_subscriptions(participant_settings={"*": ...}) subscribes to everyone. The audio renderer fires per-participant so you know who's speaking.

Does this work with phone callers?

Yes. Daily supports SIP/PSTN dial-in. Daily transcodes the carrier audio for you — no extra code on your side.

How much does it cost?

AssemblyAI offers a free tier. Daily offers 10,000 free participant minutes/month. See the AssemblyAI pricing page.

AssemblyAI vs Deepgram: what's the best voice agent API?

Mart Schweiger — Thu, 07 May 2026 17:33:47 +0000

Both AssemblyAI and Deepgram now offer dedicated voice agent APIs. Both use a cascaded architecture—separate STT, LLM, and TTS models working in sequence rather than a single multimodal model. Both charge around $4.50/hr. On the surface, they look pretty similar.

But when you dig into the details that actually matter for production voice agents—speech accuracy on real-world entities, developer experience, and mid-conversation flexibility—meaningful differences emerge. Here's an honest comparison.

Feature	AssemblyAI Voice Agent API	Deepgram Voice Agent API
Pricing	$4.50/hr flat	~$4.50/hr + concurrency metering
ASR model	Universal-3 Pro Streaming (#1 WER)	Nova-3
Word accuracy	94.07% (6.3% mean WER)	92.10%
Missed entity rate (emails, phones, names)	16.7%	25.5%
End-to-end latency	~1 second	~1–1.5 seconds
Languages	EN, ES, FR, DE, IT, PT	EN, ES, NL, FR, DE, IT, JA
Turn detection	Speech-aware VAD (semantic + neural)	Traditional VAD
Mid-session updates	Prompt + voice + tools + VAD	Prompt + voice only
Session resumption	30-second reconnect window	Not available
Billing model	Flat per-minute	Concurrency metering
Tool calling	Custom functions via JSON Schema	Custom functions supported
Compliance	HIPAA (BAA), SOC 2 Type 2, ISO 27001	SOC 2
Medical terminology	Medical Mode	No equivalent

The architecture: similar approach, different foundations

AssemblyAI and Deepgram both chose the cascaded pipeline architecture for their voice agent APIs: speech-to-text feeds into an LLM, which feeds into TTS. This is the right call for most production voice agents. Dedicated models for each step outperform multimodal approaches (like OpenAI's Realtime API) on speech understanding tasks because each model is optimized for its specific job.

The key difference is the STT foundation. AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming—the #1-ranked model on the Hugging Face Open ASR Leaderboard. Deepgram uses their Nova-3 model. Both are capable speech models, but they perform very differently where it counts most for voice agents: entity accuracy.

Speech accuracy: where the gap shows up

Here's the thing about voice agents: they're not just transcribing for the record. The transcript feeds directly into the LLM that decides what to do next. If the speech-to-text layer gets an email address wrong, the agent sends a confirmation to the wrong person. If it misses a digit in an account number, the agent looks up the wrong account.

AssemblyAI's Universal-3 Pro Streaming achieves 94.07% word accuracy with a 6.3% mean word error rate across English domains. On entity accuracy specifically, Universal-3 Pro Streaming has a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers. Deepgram Nova-3's missed entity rate on the same type of content runs at 25.5%. That's not a small gap—it's the difference between an agent that completes tasks on the first try and one that needs to ask "could you repeat that?" regularly.

The Voice Agent API product page includes a side-by-side comparison on a pharmacy refill scenario. The results are telling: AssemblyAI correctly transcribes the RX number, medication dosage, street address, and phone number. Deepgram's transcription misses formatting on the date of birth, drops the "RX-" prefix, and garbles the medication dosage format.

This isn't cherry-picked—it reflects the systematic accuracy advantage that comes from building the entire voice agent pipeline around a purpose-built speech model.

Developer experience

Both APIs use WebSocket connections and JSON messages—the fundamentals are similar. (For a deeper look at choosing an STT API for voice agents, we've covered that separately.) But the details of the developer experience differ.

AssemblyAI's approach is deliberately minimalist. A handful of JSON message types, no SDK required, and the entire API reference is readable in about 10 minutes. The team designed the API so it works natively with tools like Claude Code—you can literally copy the docs, paste them in, and scaffold a working integration. Most developers get a working agent running the same afternoon.

AssemblyAI also supports live mid-conversation updates. You can change the system prompt, swap voices, add or remove tools, and adjust VAD settings—all via a JSON message without dropping the connection. For applications that need dynamic behavior (a support agent escalating from English to Spanish, a coaching app switching modes), this is a major advantage.

Deepgram's developer experience is solid but more conventional. Their documentation is well-organized, and the API follows patterns familiar to developers who've used their transcription products. If you're already building on Deepgram, adding their voice agent API is a natural extension.

Pricing and scaling

Both APIs come in at roughly $4.50/hr for the full pipeline (STT + LLM + TTS). But the billing models have differences worth understanding.

AssemblyAI uses straightforward per-minute billing with a flat rate. $4.50/hr covers everything—speech understanding, LLM reasoning, and voice generation. No separate input/output token charges, no per-feature add-ons. Your cost model is simple: hours of usage times $4.50.

Deepgram uses concurrency metering alongside usage-based pricing. This means your costs depend not just on total usage but on how many simultaneous sessions you're running. For applications with bursty traffic patterns—a customer support center during peak hours, for instance—concurrency metering can make costs harder to predict and potentially more expensive during spikes.

Neither platform charges for the initial API key or free tier usage, making it easy to evaluate both before committing.

Turn detection and conversation flow

This is where voice agents live or die in production—and it's hard to evaluate from docs alone.

AssemblyAI's Voice Agent API uses a speech-aware Voice Activity Detection (VAD) system that distinguishes between a thoughtful pause and a conversation ending. The turn detection is baked into Universal-3 Pro Streaming, which means it benefits from the same acoustic understanding that drives transcription accuracy. Interruption handling supports natural barge-in—when someone cuts in, the agent stops and listens.

Deepgram also offers turn detection and VAD, but developers have reported that AssemblyAI's implementation feels more natural in practice—particularly around the "pause vs. done talking" distinction that makes or breaks conversation flow.

The honest recommendation: try both. Have a real conversation with agents built on each platform. The difference in conversational feel is something you notice immediately, even if it's hard to quantify in a feature comparison.

Which should you choose?

If speech accuracy is your top priority—and for most production voice agents, it should be—AssemblyAI's Voice Agent API has a clear advantage. The gap in entity accuracy is real and it directly impacts whether your agent can complete tasks on the first try.

For teams starting fresh, AssemblyAI's combination of best-in-class speech understanding, simpler pricing, and richer mid-conversation controls makes it the stronger foundation for building voice agents that work reliably in production.

Frequently asked questions

Which has better speech accuracy for voice agents—AssemblyAI or Deepgram?

AssemblyAI's Universal-3 Pro Streaming achieves 94.07% word accuracy (6.3% mean WER) with a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers. Deepgram Nova-3 achieves 92.10% word accuracy with a 25.5% missed entity rate on the same content types. This means AssemblyAI captures significantly more entities correctly, which directly impacts whether a voice agent can complete tasks on the first try without asking users to repeat themselves.

How does pricing compare between AssemblyAI and Deepgram voice agent APIs?

Both APIs cost approximately $4.50/hr for the full pipeline (STT, LLM, TTS). The key difference is billing structure: AssemblyAI uses flat per-minute pricing with no concurrency metering, making costs straightforward to predict. Deepgram adds concurrency metering, which can increase costs during traffic spikes and makes budgeting slightly more complex.

Can I migrate from Deepgram's voice agent API to AssemblyAI?

Yes. Both use WebSocket connections and JSON messages, so the architectural patterns are similar. AssemblyAI's API is designed to be simple enough that most developers get a working agent running in an afternoon—even when migrating from a different provider. The main integration work is mapping your existing tool definitions and system prompts to AssemblyAI's JSON format.

What is a cascaded voice agent architecture and why does it matter?

A cascaded architecture uses separate, specialized models for speech-to-text, LLM reasoning, and text-to-speech rather than one multimodal model doing everything. Both AssemblyAI and Deepgram use this approach because it allows each model to be optimized for its specific task, resulting in better accuracy and more predictable behavior than multimodal alternatives like OpenAI's Realtime API.

Which voice agent API is better for healthcare applications?

AssemblyAI has an edge for healthcare use cases thanks to Medical Mode—a paid add-on that enhances accuracy for medical terminology like medication names, procedures, and conditions. AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI), with a Business Associate Addendum (BAA) available. Combined with SOC 2 Type 2 certification and the lowest missed entity rate in the market, it's purpose-built for clinical voice agent workflows.

Does AssemblyAI's Voice Agent API support the same languages as Deepgram?

AssemblyAI's Voice Agent API currently supports six languages: English, Spanish, French, German, Italian, and Portuguese. Deepgram supports a similar set including English, Spanish, Dutch, French, German, Italian, and Japanese. For applications requiring broader language support, check both providers' current documentation as language support is actively expanding.

Build a voice assistant app with AssemblyAI’s Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:33:42 +0000

This is the “real app” version of the 5-minute quickstart: a polished UI, AudioWorklet mic capture, temporary-token auth, and full barge-in handling. The AssemblyAI Voice Agent API does the speech recognition, the LLM, and the TTS server-side — you’re just shuttling audio bytes.

Why One WebSocket Beats a Multi-Service Pipeline

A traditional browser voice agent needs you to wire up streaming STT, an LLM, and a TTS provider, then orchestrate audio routing between them in the browser. Every hop adds latency, every provider needs a key, and every glue layer adds a failure mode.

Multi-service browser pipeline	Voice Agent API
Services to wire up	STT + LLM + TTS (3+ vendors)
API keys to manage	3+
Round trips per turn	3 (mic→STT→LLM→TTS→speaker)
Browser key exposure	Hard to avoid
Turn detection	Configure separately
Barge-in / interruption	Implement yourself
Tool calling	Wire LLM tools manually

The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send 24 kHz PCM, get 24 kHz PCM back. That’s it.

Architecture

The system has two halves: a browser client and a lightweight Node server.

Event	What we do with it
`session.ready`	Save the `session_id`, start sending audio
`transcript.user.delta`	Render a partial bubble (italic, low-opacity) as the user speaks
`transcript.user`	Promote the partial to a final user message
`reply.audio`	Decode base64 PCM, schedule playback (see below)
`transcript.agent`	Render the agent’s final reply (with `interrupted` if applicable)
`reply.done` (with `status: "interrupted"`)	Flush queued audio — user barged in
`session.error`	Surface error code in the status indicator

Data flow: The browser gets a token from the Node server, opens a WebSocket to the Voice Agent API with that token, streams mic PCM up, and receives reply PCM + transcript events back.

Prerequisites

Node.js 18+ (uses native fetch and ES modules)
A modern browser (Chrome 66+, Firefox 76+, Safari 14.1+ — anything with AudioWorklet)
An AssemblyAI API key — free tier available

The browser needs a secure origin to access the mic. http://localhost counts as secure, so you can develop locally without TLS. If you deploy elsewhere, serve over HTTPS.

Quick Start

1. Clone and Install

 git clone https://github.com/kelsey-aai/voice-assistant-app
cd voice-assistant-app

npm install

2. Configure Your API Key

 cp .env.example .env
# Edit .env — drop in your AssemblyAI API key

3. Run the App

 npm start

Open http://localhost:3000, pick a voice, hit Connect , grant mic permission, and start talking. You’ll see your speech transcribed live as a partial bubble, then committed to a final bubble, with the agent’s reply streaming back in audio and text.

How It Works

There are four moving parts: the token mint, the AudioWorklet that captures mic audio, the WebSocket loop that drives the conversation, and the playback scheduler that turns reply.audio events back into sound.

1. The Server Mints a Temporary Token

Your AssemblyAI API key never leaves the server. The browser asks /api/voice-token for a single-use token, valid for 5 minutes:

// server.js
app.get("/api/voice-token", async (_req, res) => {
  const url = new URL("https://agents.assemblyai.com/v1/token");
  url.searchParams.set("expires_in_seconds", "300");

  const response = await fetch(url, {
    headers: { Authorization: `Bearer ${API_KEY}` },
  });
  const { token } = await response.json();
  res.json({ token });
});

Tokens are single-use — you fetch a fresh one for every connection. The browser then opens the WebSocket with the token as a query parameter:

ws = new WebSocket(`wss://agents.assemblyai.com/v1/ws?token=${token}`);

2. The AudioWorklet Captures Mic Audio

AudioWorklet runs your PCM conversion off the main thread, which keeps it glitch-free. The worklet receives Float32 samples, clips them to range, and posts them back as a transferable Int16Array buffer:

// public/pcm-processor.js
class PCMProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const channel = inputs[0]?.[0];
    if (!channel) return true;

    const pcm = new Int16Array(channel.length);
    for (let i = 0; i < channel.length; i++) {
      const s = Math.max(-1, Math.min(1, channel[i]));
      pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    this.port.postMessage(pcm.buffer, [pcm.buffer]);
    return true;
  }
}

The Voice Agent API expects 24 kHz PCM by default, so we force the entire AudioContext to 24 kHz on creation — no resampling needed:

audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE }); // 24000

We also enable browser-level acoustic echo cancellation when grabbing the mic, so the agent doesn’t interrupt itself by hearing its own TTS through the speakers:

navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
    channelCount: 1,
  },
});

3. The WebSocket Drives the Conversation

session.update is the first message — it configures system prompt, greeting, and voice. After that, you stream input.audio events whenever the worklet hands you a frame:

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      system_prompt: prompt.value,
      greeting: "Hi there — what can I help you with?",
      output: { voice: "ivy" },
    },
  }));
};

workletNode.port.onmessage = (e) => {
  ws.send(JSON.stringify({
    type: "input.audio",
    audio: arrayBufferToBase64(e.data),
  }));
};

The server replies with a stream of events. The ones we care about for UI:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Higher = less sensitive. Raise for noisy environments.
`min_silence`	ms	Minimum silence duration before the end-of-turn check fires.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn. Raise for deliberate speech (healthcare, eldercare).
`interrupt_response`	boolean	Set to `false` to disable barge-in entirely.

4. Reply Playback Uses a Scheduling Cursor

reply.audio chunks arrive faster than they play — sometimes the whole reply is buffered before the first sample hits the speaker. Naively calling source.start(0) would overlap the chunks. Instead, we keep a playbackTime cursor and schedule each chunk back-to-back:

function playPCM(b64) {
  const bytes  = Uint8Array.from(atob(b64), (c) => c.charCodeAt(0));
  const int16  = new Int16Array(bytes.buffer, bytes.byteOffset,
                                bytes.byteLength / 2);
  const float  = new Float32Array(int16.length);
  for (let i = 0; i < int16.length; i++)
    float[i] = int16[i] / 0x8000;

  const buffer = audioCtx.createBuffer(1, float.length, SAMPLE_RATE);
  buffer.getChannelData(0).set(float);

  const source = audioCtx.createBufferSource();
  source.buffer = buffer;
  source.connect(audioCtx.destination);

  const now = audioCtx.currentTime;
  if (playbackTime < now) playbackTime = now;
  source.start(playbackTime);
  playbackTime += buffer.duration;

  scheduledSources.push(source);
}

5. Barge-In Stops Scheduled Audio Cold

When the user speaks over the agent, the server emits reply.done with status: "interrupted" and trims transcript.agent to what was actually spoken. The client’s job is to drop any audio that was scheduled but hasn’t played yet:

function flushPlayback() {
  for (const src of scheduledSources) {
    try { src.stop(); } catch (_) {}
  }
  scheduledSources = [];
  playbackTime = audioCtx.currentTime;
}

That’s the entire interruption story: stop every scheduled AudioBufferSourceNode, reset the cursor to “now.”

Customization

Pick a Different Voice

The voice picker is wired to session.output.voice. The dropdown ships 16 popular options. Eighteen English voices and 16 multilingual voices are available in total. See the Voices catalog for samples of each. Multilingual voices code-switch with English automatically.

Change the Personality with the System Prompt

The textarea is bound to session.system_prompt. Tighten it for shorter replies, give it a persona, or scope it to a specific use case:

You are a customer-support agent for Acme Cloud Storage. Only answer
questions about Acme’s products, plans, and account billing. If the user
asks about anything else, politely redirect.

You can also re-send session.update mid-conversation to swap personas live. Note: greeting and output are immutable after the first apply — only system_prompt, tools, and input can change mid-session.

Tune Turn Detection

Add input.turn_detection to the session.update payload to control how patient the agent feels:

session: {
  input: {
    turn_detection: {
      vad_threshold: 0.5,        // 0.0–1.0; higher = less sensitive
      min_silence: 600,          // ms; min silence before end-of-turn
      max_silence: 1500,         // ms; hard cap before forcing end-of-turn
      interrupt_response: true,  // false to disable barge-in entirely
    },
  },
}

For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare) raise max_silence.

Troubleshooting

Mic blocked or no audio going up. Browsers only allow getUserMedia on secure origins. http://localhost counts; http://your-laptop.local doesn’t. If you’re testing across devices, terminate TLS in front of the Node server (Caddy, Cloudflare Tunnel, ngrok).

Agent keeps interrupting itself. Acoustic echo — the mic is picking up the speakers. Use headphones, or confirm echoCancellation: true is set on getUserMedia.

Audio sounds chipmunky or slowed-down. Sample-rate mismatch. The AudioContext must be created with { sampleRate: 24000 }. If you skip that, the browser creates the context at 44.1 or 48 kHz and the math falls apart.

UNAUTHORIZED close on connect. Token wasn’t included, expired, or was already used. Tokens are single-use — fetch a fresh one for every connection. Confirm ASSEMBLYAI_API_KEY is set on the server.

WebSocket closes with code 1006 and no error. Pre-handshake failure. In browsers, that’s usually a stale or invalid token. Re-fetch the token before reconnecting.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI’s Voice Agent API?

How do I use the Voice Agent API from the browser without exposing my API key?

Run a small backend that mints short-lived temporary tokens by calling GET https://agents.assemblyai.com/v1/token with your API key in an Authorization: Bearer header. The browser fetches a fresh token before each WebSocket connection and passes it as ?token= in the URL. Tokens are single-use and expire in 1–600 seconds.

What audio format does the browser need to send?

By default, the Voice Agent API expects audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. Create the AudioContext with { sampleRate: 24000 } so no resampling is needed, then use an AudioWorklet to convert Float32 mic samples to Int16Array and base64-encode the buffer.

How do I handle interruption (barge-in)?

When the user speaks while the agent is replying, the server emits reply.done with status: "interrupted". The browser must stop any scheduled AudioBufferSourceNodes and reset its playbackTime cursor to audioCtx.currentTime so the next reply starts cleanly.

Why does my voice agent keep interrupting itself?

Almost always acoustic echo: the mic is picking up the agent’s TTS output through the speakers. Pass echoCancellation: true to getUserMedia to enable the browser’s OS-level acoustic echo cancellation, and prefer headphones during development.

Can the Voice Agent API call tools or functions from the browser?

Yes — tool calling works the same way client-side. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Execute the tool in your client code, then send a tool.result event when you receive the next reply.done. See the tool calling guide for the full pattern.

How much does the Voice Agent API cost?

AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.

Node.js voice agent with AssemblyAI's Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:33:16 +0000

Why Node.js + the Voice Agent API

Most JavaScript voice agent tutorials are stitched together but every layer adds latency, cost, and a place to fail.

The Voice Agent API replaces that pipeline with a single WebSocket:

Multi-vendor JS pipeline	Voice Agent API in Node.js
npm packages for AI	3+ (one per vendor)
API keys to manage	3+
Round trips per turn	3 (mic→STT→LLM→TTS→speaker)
Turn detection	Wire up VAD or LLM endpointing
Barge-in	Implement yourself
Tool calling	Bridge LLM tool defs to your runtime

The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send PCM16, get PCM16. That's the whole protocol surface.

Architecture

 Microphone (mic + sox/arecord, 24 kHz PCM16)
    │
    │  base64-encoded chunks
    │  { type: "input.audio", audio: "..." }
    ▼
┌─────────────────────────────────────────────────┐
│  wss://agents.assemblyai.com/v1/ws              │
│                                                 │
│  AssemblyAI Voice Agent API                     │
│  ├── Universal-3 Pro Streaming  (speech → text) │
│  ├── LLM                        (text → reply)  │
│  └── TTS                        (reply → audio) │
│                                                 │
│  + neural turn detection                        │
│  + barge-in                                     │
│  + tool calling                                 │
└─────────────────────────────────────────────────┘
    │
    │  base64-encoded chunks
    │  { type: "reply.audio", data: "..." }
    ▼
Speakers (speaker, 24 kHz PCM16)

Prerequisites

Node.js 20+
A microphone — headphones strongly recommended (terminal apps don't get OS-level echo cancellation, so mic-to-speaker feedback will trigger barge-in)
An AssemblyAI API key — free tier available

The mic and speaker packages call into native audio backends:

macOS: brew install sox
Debian/Ubuntu: sudo apt-get install sox libsox-fmt-all libasound2-dev
Windows: install SoX and the Visual Studio C++ build tools

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-nodejs
cd voice-agent-nodejs

npm install

2. Configure your API key

 cp .env.example .env
# Edit .env — drop in your AssemblyAI API key

3. Run the agent

 npm start

Plug in your headphones, wait for Connected (session ...), and start talking.

Connected (session sess_abc123).
Speak now. Press Ctrl+C to quit.

You:   What can you do?
Agent: I can chat with you, answer questions, or help work through ideas — whatever you'd like.

That's the whole thing. Under 100 lines of Node.js.

How it works

1. Open the WebSocket and configure the session

 import "dotenv/config";
import WebSocket from "ws";

const ws = new WebSocket("wss://agents.assemblyai.com/v1/ws", {
  headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
});

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      system_prompt: "You are a friendly voice assistant. Keep replies short.",
      greeting: "Hi there — what can I help you with?",
      output: { voice: "ivy" },
    },
  }));
});

session.update is always your first message. It sets the agent's personality (system_prompt), what it says when the call connects (greeting), and the voice it speaks in (voice). All fields are optional, and you can re-send session.update at any time during the conversation to change them.

2. Stream microphone audio

 import mic from "mic";

const micInstance = mic({
  rate: "24000", channels: "1", encoding: "signed-integer",
  bitwidth: "16", endian: "little",
});
const micStream = micInstance.getAudioStream();

micStream.on("data", (chunk) => {
  if (sessionReady && ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({
      type: "input.audio",
      audio: chunk.toString("base64"),
    }));
  }
});

mic shells out to sox (or arecord on Linux) and emits raw PCM16 buffers as data events. We base64-encode each buffer and ship it as an input.audio event. The sessionReady gate keeps us from sending audio before the server has acknowledged the configuration with session.ready.

3. Play the agent's response

 import Speaker from "speaker";

let speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === "reply.audio") {
    speaker.write(Buffer.from(event.data, "base64"));
  } else if (event.type === "reply.done" && event.status === "interrupted") {
    speaker.end();
    speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });
  }
});

reply.audio events stream as the agent generates each phrase, so playback starts within hundreds of milliseconds of the end of the user's turn. Each chunk is base64-decoded and written straight to the speaker stream, which feeds the OS audio buffer.

When the user interrupts the agent mid-reply (barge-in), the server emits reply.done with status: "interrupted". Node's speaker package doesn't expose a clean flush, so the simplest reliable pattern is to end the current speaker stream and create a fresh one for the next reply.

What you get for free

These are all handled by the API — you write zero code for them:

Neural turn detection. The server combines acoustic and linguistic signals to decide when the user has finished speaking, so it knows the difference between a thinking pause and an actual end-of-turn.
Barge-in. When the user speaks over the agent, the server stops generating, sends reply.done with status: "interrupted", and trims the agent transcript to what was actually spoken.
Real-time partial transcripts. transcript.user.delta events stream as the user talks, so you can show what they're saying live.
Final transcripts both ways. transcript.user and transcript.agent events arrive after each turn — perfect for logging, chat history, or moderation.

Customizing the agent

Pick a Different Voice

Eighteen English voices and 16 multilingual voices are available. Drop any voice ID into session.output.voice:

output: { voice: "james" }    // conversational US male
output: { voice: "sophie" }   // clear UK female
output: { voice: "diego" }    // Latin American Spanish
output: { voice: "arjun" }    // Hindi/Hinglish

See the Voices catalog for samples. Multilingual voices code-switch with English automatically.

Tune turn detection

Defaults work well for most apps. Override anything you want under session.input.turn_detection:

input: {
  turn_detection: {
    vad_threshold: 0.5,        // 0.0–1.0; lower = more sensitive
    min_silence: 600,          // ms; min silence before confident end-of-turn
    max_silence: 1500,         // ms; max silence before forcing end-of-turn
    interrupt_response: true,  // false to disable barge-in
  }
}

For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence.

Boost domain-specific terms

If your conversation involves rare words — product names, medical terms, customer names — add them to session.input.keyterms to bias speech recognition toward them:

input: { keyterms: ["Ozempic", "Salesforce", "AssemblyAI"] }

Troubleshooting

The agent keeps interrupting itself. Your microphone is picking up the agent's TTS output. Use headphones, or move to a browser-based client which gets free echo cancellation from getUserMedia.

speaker install fails on Linux. Install ALSA dev headers: sudo apt-get install libasound2-dev.

speaker install fails on macOS with Node 22+. Some node-gyp-based packages lag the latest Node major release. Use Node 20 LTS, or replace speaker with the bundled wav decoder + play-sound if you don't need streaming playback.

mic produces silence. Check that sox (macOS) or arecord (Linux) is on your PATH and that your terminal has microphone permission (macOS: System Settings → Privacy & Security → Microphone).

UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and re-check .env.

The full guide is in the Voice Agent API troubleshooting docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.

How do I build a voice agent in Node.js?

The simplest pattern is ws + mic + speaker: open a WebSocket to the Voice Agent API, send a session.update with your system prompt and voice, then pipe microphone audio in and speaker audio out. Under 100 lines. No LLM or TTS SDK needed.

What audio format does the Voice Agent API expect from a Node.js client?

By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. Configure the mic package with rate: "24000", channels: "1", encoding: "signed-integer", bitwidth: "16". For telephony integrations you'd switch to audio/pcmu (G.711 μ-law, 8 kHz).

How do I authenticate to the Voice Agent API from Node.js?

Pass your AssemblyAI API key as a Bearer token in the Authorization header during the WebSocket upgrade. For browser apps where you can't expose your API key, mint a short-lived temporary token on your Node server and pass it as a ?token= query parameter from the browser instead.

Why is the agent interrupting itself in my Node.js terminal app?

The microphone is picking up the agent's TTS output through your speakers, the server interprets that as the user speaking, and barge-in fires. The fix is either headphones or running the client in a browser, where getUserMedia({ audio: { echoCancellation: true } }) gives you OS-level acoustic echo cancellation for free.

Can the Voice Agent API call functions from Node.js?

Yes. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Run your function in Node, then send a tool.result event when you receive the next reply.done.

How much does the AssemblyAI Voice Agent API cost?

AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.

Raw WebSocket Voice Agent with AssemblyAI's Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:33:11 +0000

What "raw" means here

5-minute quickstart	Raw WebSocket (this tutorial)
Lines of Python	~80
Events handled	6
Partial transcripts (`transcript.user.delta`)	✖
Tool calling	✖
Session resume on reconnect	✖
Speech start/stop logging	✖
Error code handling	Minimal

If you want the fastest path to a working agent, start with the 5-minute quickstart. If you want to ship the Voice Agent API into a real product, build on this one — every edge case the protocol expresses is already in here.

Architecture

 Microphone (sounddevice, 24 kHz PCM16)
    │
    │  ┌──── client → server ────┐
    │  │  session.update         │  config (1st message)
    │  │  session.resume         │  reconnect within 30s
    │  │  input.audio            │  base64 PCM16 chunks
    │  │  tool.result            │  send on next reply.done
    │  └────────────────────────┘
    ▼
wss://agents.assemblyai.com/v1/ws
    ▲
    │  ┌──── server → client ────┐
    │  │  session.ready          │  save session_id
    │  │  session.updated        │
    │  │  input.speech.started   │
    │  │  input.speech.stopped   │
    │  │  transcript.user.delta  │  partial — live transcript
    │  │  transcript.user        │  final user transcript
    │  │  reply.started          │
    │  │  reply.audio            │  base64 PCM16 chunks
    │  │  transcript.agent       │  full agent transcript
    │  │  reply.done             │  status: "interrupted" on barge-in
    │  │  tool.call              │  arguments is a dict
    │  │  session.error          │  code + message
    │  └────────────────────────┘
Speakers (sounddevice, 24 kHz PCM16)

Prerequisites

Python 3.10+
A microphone — headphones strongly recommended (terminal apps don't get OS-level echo cancellation)
An AssemblyAI API key — free tier available

On macOS, install PortAudio for sounddevice:

brew install portaudio

Quick start

 git clone https://github.com/kelsey-aai/voice-agent-raw-websocket
cd voice-agent-raw-websocket

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your AssemblyAI API key

python agent.py

You'll see every event flow through the terminal as you talk:

[client→server] session.update (initial config)
[server→client] session.ready (sess_abc123)
Speak now. Press Ctrl+C to quit.

[server→client] input.speech.started
  …  what's the weather in tok
You:   What's the weather in Tokyo?
[server→client] reply.started (reply_xyz)
[server→client] tool.call get_weather({'location': 'Tokyo'}) id=call_abc
[server→client] reply.done
[client→server] tool.result id=call_abc
[server→client] reply.started (reply_xyz2)
Agent: It's currently 22 degrees and sunny in Tokyo.
[server→client] reply.done

Every event, explained

Client → Server

session.update — First Message, Also Re-sendable

 await ws.send(json.dumps({
    "type": "session.update",
    "session": {
        "system_prompt": "...",
        "greeting": "...",
        "output": {"voice": "ivy"},
        "tools": [...],
    },
}))

Always your first message. Configures the agent's personality, the spoken greeting, the voice, registered tools, turn detection sensitivity, and audio format. All fields are optional — re-send session.update mid-conversation to update any of them. greeting and the output block are immutable after the first apply.

input.audio — Stream Microphone Audio

 await ws.send(json.dumps({
    "type": "input.audio",
    "audio": base64.b64encode(pcm_bytes).decode(),
}))

Send only after session.ready. ~50 ms chunks at 24 kHz PCM16 mono. The server buffers across chunks, so chunk size isn't strict.

tool.result — Answer a tool.call

 await ws.send(json.dumps({
    "type": "tool.result",
    "call_id": "call_abc123",
    "result": json.dumps({"temp_c": 22}),
}))

The critical pattern: accumulate tool.result events and only send them when you receive reply.done for the turn that contained the tool.call. Sending early creates timing issues. If the user interrupts mid-turn (reply.done with status: "interrupted"), discard the accumulated results.

session.resume — Reconnect Within 30 Seconds

 await ws.send(json.dumps({
    "type": "session.resume",
    "session_id": "sess_abc123",
}))

If your WebSocket drops and you reconnect within 30 seconds with the previous session_id, the server preserves conversation context. Past 30 seconds you'll get session_not_found and need to start fresh.

Server → Client

Event	Carries	What to do
`session.ready`	`session_id`	Save the id, start sending `input.audio`
`session.updated`	—	A `session.update` was applied
`input.speech.started`	—	VAD detected speech onset
`input.speech.stopped`	—	VAD detected end of speech
`transcript.user.delta`	`text`	Live partial — overwrite a single UI line
`transcript.user`	`text`, `item_id`	Final user transcript for the turn
`reply.started`	`reply_id`	Agent began generating
`reply.audio`	`data`	Base64 PCM chunk — decode and play immediately
`transcript.agent`	`text`, `interrupted`	Full agent transcript (trimmed if interrupted)
`reply.done`	optional `status`	Reply complete; on `"interrupted"`, flush speaker and discard pending tool results
`tool.call`	`call_id`, `name`, `arguments`	Run the tool, accumulate the result, send on next `reply.done`
`session.error`	`code`, `message`, `timestamp`	See error table below

Error codes

Code	Recovery
`UNAUTHORIZED`	Bad/missing API key — fetch a fresh one
`FORBIDDEN`	Token valid but lacks permission
`session_not_found` / `session_forbidden` / `session_expired`	`session.resume` failed — start a fresh session
`agent_init_failed` / `agent_timeout`	Server-side worker issue — retry
`invalid_format`	Bad JSON or unknown `type` field
`invalid_audio`	base64 decode or PCM conversion failed
`invalid_value`	Bad voice or wrong field type in `session.update`
`immutable_field`	Tried to change `greeting` or `output` after first apply
`server_error`	At capacity — exponential backoff and retry

How the tool-calling pattern works

 pending_tools: list[dict] = []

if t == "tool.call":
    # 1. Accumulate — don't send a tool.result yet.
    if event["name"] == "get_weather":
        result = fake_get_weather(event["arguments"]["location"])
    pending_tools.append({"call_id": event["call_id"], "result": result})

elif t == "reply.done":
    if event.get("status") == "interrupted":
        # 2a. Barge-in: discard pending results.
        pending_tools.clear()
    else:
        # 2b. Reply finished cleanly: send all accumulated results.
        for tool in pending_tools:
            await ws.send(json.dumps({
                "type": "tool.result",
                "call_id": tool["call_id"],
                "result": json.dumps(tool["result"]),
            }))
        pending_tools.clear()

While the server waits for your tool.result, the agent speaks a transition phrase ("Let me check that for you") so the conversation doesn't go silent. You can steer that phrase by including instructions in the system prompt.

How session resume works

 session_id: str | None = None
while True:
    try:
        session_id = await run_session(session_id)
    except websockets.ConnectionClosed:
        # Reconnect within 30s — server preserves conversation.
        await asyncio.sleep(1)

run_session returns the session_id it received from the most recent session.ready. On the next iteration, that id is passed back as prev_session_id and the first message becomes session.resume instead of session.update. If the server replies session_not_found or session_expired, we clear the saved id and start a fresh session.

Tuning the agent

Voice, system prompt, turn detection, and key terms all configure under session.update. See Session configuration for every field. The same tuning options apply to every Voice Agent API tutorial — pick the one closest to your stack and the protocol underneath stays the same.

Common issues

The agent keeps interrupting itself. Mic feedback. Use headphones, or move to a browser-based client which gets free echo cancellation.

Audio sounds garbled or pitched up/down. The Voice Agent API uses 24 kHz by default. Make sure both sd.InputStream and sd.OutputStream are set to samplerate=24000.

session_not_found immediately after reconnect. The 30 s grace window expired or your network dropped for too long. Clear the saved session_id and start fresh.

UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and re-check .env.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

How many events does the Voice Agent API protocol have?

Four client-to-server events (session.update, input.audio, session.resume, tool.result) and twelve server-to-client events. All sixteen are documented in the events reference and handled explicitly in this tutorial's agent.py.

When should I send a tool.result to the Voice Agent API?

Always wait until you receive reply.done for the turn that contained the tool.call — never immediately. The recommended pattern is to accumulate tool results in a list as tool.call events arrive, then send them all in the reply.done handler. If reply.done arrives with status: "interrupted", discard the pending results.

How does session resume work?

Sessions are preserved for 30 seconds after disconnection. Reconnect within that window with {"type": "session.resume", "session_id": ""} as your first message. Past 30 seconds, you'll get session_not_found and need to send a fresh session.update.

Can I show live partial transcripts to the user?

Yes — that's what transcript.user.delta is for. The server emits a delta event every few hundred milliseconds while the user is speaking, then a final transcript.user event when the turn ends. Overwrite a single line on each delta and commit it when the final arrives.

What happens if the user interrupts the agent?

The server stops generating, emits reply.done with status: "interrupted", and emits transcript.agent with the text trimmed to what was actually spoken. Your client must flush its audio output buffer and discard any pending tool results.

How to vibe code a voice agent with AssemblyAI's Voice Agent API

Mart Schweiger — Thu, 07 May 2026 17:31:40 +0000

Vibe coding a voice agent used to mean wrangling three separate APIs—one for speech-to-text, one for the LLM, one for text-to-speech—and hoping your AI coding assistant could hold the glue logic together across all three. That's changed.

AssemblyAI's Voice Agent API collapses the entire pipeline into a single WebSocket. You send mic audio in, you get spoken audio back. Speech recognition, LLM reasoning, text-to-speech, turn detection, and barge-in handling all happen server-side. The integration surface is small enough that Claude Code, ChatGPT, or Cursor can generate a working voice agent on the first prompt.

This post is a prompt guide. You'll get the exact prompts to build different types of voice agents, the setup that keeps your AI coding agent grounded in current docs, and the follow-up prompts that customize behavior without you touching the code directly. There's a companion repo with the complete output if you want to see what these prompts produce.

Set up your AI coding agent first

AI coding assistants work best when they have current context about the APIs they're integrating with. AssemblyAI's API surface changes—model names, parameters, and SDK methods evolve—so agents that rely on training data alone will sometimes produce outdated code. There are three ways to ground your agent, ranked by how reliably they keep output correct.

Project instructions (most effective)

Add this to your project's CLAUDE.md, .cursorrules, or equivalent agent instructions file:

Always fetch https://www.assemblyai.com/docs/llms.txt before writing AssemblyAI code.
The API has changed — do not rely on memorized parameter names.

This runs on every prompt. The agent will check current docs before generating any AssemblyAI code, catching breaking changes automatically.

You can also filter the full docs by language to reduce token usage:

/docs/llms-full.txt?lang=python
/docs/llms-full.txt?lang=typescript

MCP server (tool-based doc access)

Connect the AssemblyAI docs MCP server to give your agent on-demand access to search and read documentation.

‍Claude Code:

 claude mcp add assemblyai-docs --transport http https://mcp.assemblyai.com/docs

Cursor (.cursor/mcp.json):

{
  "mcpServers": {
    "assemblyai-docs": {
      "url": "https://mcp.assemblyai.com/docs"
    }
  }
}

This gives your agent four tools: search_docs, get_pages, list_sections, and get_api_reference. Any MCP client that supports Streamable HTTP transport can connect using the server URL.

If your agent still relies on training data instead of looking up the docs, add this to your project instructions:

For anything AssemblyAI related, use the assemblyai-docs MCP tools first.
Do not rely on training data.

AssemblyAI skill (deep SDK context)

The AssemblyAI skill gives your AI coding assistant curated instructions and context for the Python and JavaScript SDKs, streaming, voice agents, audio intelligence, and more. It works with Claude Code, Cursor, Copilot, and 60+ other coding agents via the universal skills CLI:

npx skills add AssemblyAI/assemblyai-skill

You can also clone it directly into your skills directory. For Claude Code that's ~/.claude/skills/assemblyai-skill/.

Layer all three for best results. Project instructions catch every prompt, the MCP server provides on-demand lookups, and the skill gives deep SDK context. You don't need all three to get started—but if you're building anything beyond a proof-of-concept, using them together prevents the most common mistakes.

Why the Voice Agent API is ideal for vibe coding

Before the prompts, it helps to understand why this particular API plays so well with AI coding assistants.

A traditional voice agent needs you to wire up streaming STT, an LLM, and a TTS provider, then orchestrate audio routing between them. That's three different SDKs, three auth patterns, and audio format conversions at every boundary. AI coding assistants struggle with this because there are too many moving parts and too many ways to get the integration wrong.

The Voice Agent API is one WebSocket URL: wss://agents.assemblyai.com/v1/ws. You send 24 kHz PCM audio, you get 24 kHz PCM audio back. One API key, one auth pattern (temporary tokens for browser apps), and a handful of JSON event types. That's a small enough surface that an AI coding assistant can generate it correctly without follow-up corrections.

It also handles the hard stuff server-side—neural turn detection that knows the difference between a pause and the end of a sentence, barge-in that stops the agent immediately when the user interrupts, and tool calling that lets the agent execute real functions mid-conversation. These are features that would each take hundreds of lines to implement in a multi-service pipeline. With the Voice Agent API, they're just configuration.

Starter prompts

These prompts work with Claude Code, ChatGPT, Cursor, and any capable AI coding assistant. Each one builds a different type of voice agent. Copy them as-is, or use them as templates.

Browser voice assistant (full-featured)

Build a browser voice assistant using AssemblyAI's Voice Agent API.

Requirements:
- Node.js Express server that mints temporary tokens at /api/voice-token
  (GET https://agents.assemblyai.com/v1/token with Bearer auth)
- Browser client that opens a WebSocket to wss://agents.assemblyai.com/v1/ws?token=...
- AudioWorklet for mic capture at 24 kHz (Int16 PCM, base64-encoded)
- Chat-style transcript UI with user/agent bubbles
- Barge-in support: flush scheduled AudioBufferSourceNodes on reply.done interrupted
- Voice picker dropdown, system prompt textarea, greeting selector
- Dark theme, polished CSS
- Use .env for the API key. Never expose the key to the browser.

This prompt produced the companion app—about 70 lines of server code and ~400 lines of client code. It's specific enough that the AI gets the architecture right on the first try: temporary tokens on the server, AudioWorklet for glitch-free mic capture, a playback cursor for scheduling reply audio chunks, and barge-in handling that flushes queued audio when the user interrupts.

Minimal proof-of-concept

Build the simplest possible browser voice agent with AssemblyAI's Voice Agent API.

Single HTML file served by a tiny Express server. The server mints tokens at
/api/voice-token, the browser opens a WebSocket with the token, sends mic audio
as base64 PCM at 24 kHz, and plays reply.audio back through AudioBufferSourceNode.
Handle session.update, session.ready, reply.audio, and transcript events. No frameworks.

Good when you want to validate the concept in 15 minutes before committing to a polished build.

Customer-support agent with tool calling

Build a customer-support voice agent for a SaaS product using AssemblyAI's
Voice Agent API. Browser-based, with a Node.js backend for token minting.

The agent should:
- Greet callers professionally
- Answer questions about pricing, features, and account billing
- Use tool calling to look up account status (mock the tool handler)
- Escalate to a human when it can't help

System prompt should scope the agent to support-only conversations.
Include turn detection tuning for patient, deliberate speech.

This exercises tool calling and custom turn detection—two features the Voice Agent API handles natively that would take significant effort to wire up across three separate services.

Phone agent with Twilio

Build a voice agent that handles inbound phone calls using AssemblyAI's
Voice Agent API and Twilio Media Streams. Python backend.

- Twilio sends audio as PCMU (mulaw) at 8 kHz via WebSocket
- Bridge Twilio's Media Stream WebSocket to wss://agents.assemblyai.com/v1/ws
- Set BOTH input and output encoding to audio/pcmu in session.update
  (otherwise the agent's reply comes back as 24 kHz PCM that Twilio can't play
  without transcoding)
- Handle call events: connected, media, stop
- System prompt: appointment scheduling agent for a dental office
- Use tool calling to check available time slots (mock data)

Telephony audio is harder than microphone audio—8 kHz sampling, compression artifacts, background noise. This prompt tells the AI exactly which encoding to use on both sides of the bridge so it doesn't default to the browser's 24 kHz PCM format and force unnecessary transcoding.

Python terminal agent (no browser)

Build a terminal-based voice agent in Python using AssemblyAI's Voice Agent API.

- Use sounddevice for mic input and speaker output at 24 kHz
- Open a WebSocket to wss://agents.assemblyai.com/v1/ws with API key in
  Authorization header (no token needed for server-side apps)
- Send input.audio with base64-encoded PCM frames from the mic
- Play reply.audio PCM frames through the speaker
- Print transcript.user and transcript.agent events to the console
- Handle barge-in: stop playback on reply.done with status interrupted
- .env for the API key

No browser, no server, no frontend. Useful for testing prompts and voice configurations quickly from the terminal.

Follow-up prompts for customization

Once the base app is running, you iterate with follow-up prompts. This is where vibe coding really shines—you describe what you want changed and the AI handles the implementation.

Change the voice

Change the default voice to "james" and add these voices to the picker:
diego (Spanish), arjun (Hindi/Hinglish), pierre (French), ren (Japanese)

The Voice Agent API offers 18 English voices and 16 multilingual voices. Multilingual voices code-switch with English automatically. See the voices catalog for samples.

Tune turn detection

Add turn detection tuning to the session config so the agent is more patient
and doesn't cut people off mid-sentence. Raise min_silence to 800ms and
max_silence to 2500ms. Leave vad_threshold at the default of 0.5.

For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence. Set interrupt_response: false to disable barge-in entirely.

Add tool calling

Add a tool called "check_order_status" that takes an order_id string parameter.
Register it in session.tools on session.update. When the agent calls the tool,
return a mock response with order status "shipped" and tracking number
"1Z999AA10123456784". Send the tool.result on the next reply.done, not
immediately on tool.call.

Tool calling is what separates a voice agent from a voice chatbot. The Voice Agent API handles it natively—register tools with JSON Schema, and the agent calls your functions when appropriate, speaking a natural transition while it waits for results.

Add session resumption

If the WebSocket disconnects, automatically reconnect within 30 seconds using

session.resume. Fetch a new temporary token for the reconnection. Pass the

previous session_id in the session.resume event so context is preserved.

Swap the system prompt live

Add a button that sends a new session.update mid-conversation to change the

system prompt without disconnecting. Note: greeting and output.voice are

immutable after the first apply — only system_prompt, tools, and input can

change mid-session.

Prompting tips that prevent common mistakes

A few specific details in your prompts make the difference between code that works on the first run and code that needs debugging.

Specify the audio format. Include "24 kHz PCM, 16-bit signed, little-endian, mono, base64-encoded" in your prompt. Without this, AI coding assistants often generate code that captures audio at 44.1 or 48 kHz—which sounds chipmunky or slowed-down when the Voice Agent API interprets it at 24 kHz.

Name the auth pattern. "Temporary tokens minted server-side, passed as a query param to the WebSocket" prevents the AI from putting your API key in browser code. For server-side apps (Python, Node.js), specify "API key in Authorization header" instead—no token needed.

Mention barge-in explicitly. If you don't ask for interruption handling, most AI coding assistants skip it. Without barge-in handling, queued audio keeps playing after the user has already interrupted and moved on.

Ask for AudioWorklet by name (browser apps). The alternative is ScriptProcessorNode, which is deprecated and runs on the main thread. Specifying AudioWorklet gets you the modern, glitch-free approach.

List the event types you need handled. Including session.ready, transcript.user.delta, transcript.user, reply.audio, transcript.agent, reply.done, and session.error in your prompt means the AI covers the full event surface instead of guessing which events matter.

Tell the AI about echo cancellation. For browser apps, include "enable echoCancellation, noiseSuppression, and autoGainControl on getUserMedia." Without this, the mic picks up the agent's TTS through the speakers and the agent interrupts itself constantly.

For Twilio bridges, set encoding on both input and output. A common mistake is setting only input.format.encoding to audio/pcmu and leaving output at the default. The agent then replies in 24 kHz PCM, Twilio can't play it without transcoding, and you lose the zero-transcoding benefit of the integration. Set both to audio/pcmu.

Troubleshooting common issues

When your vibe-coded agent doesn't work on the first run, here are the most common causes and what to tell your AI coding assistant to fix them.

Problem	Cause	Follow-up prompt
Audio sounds chipmunky or slowed-down	Sample-rate mismatch—AudioContext not set to 24 kHz	"Create the AudioContext with { sampleRate: 24000 } so it matches the Voice Agent API's expected format"
Agent keeps interrupting itself	No echo cancellation—mic picks up speaker output	"Add echoCancellation: true to the getUserMedia audio constraints"
UNAUTHORIZED close on connect	Token expired, missing, or already used	"Fetch a fresh token immediately before each WebSocket connection. Tokens are single-use."
WebSocket closes with code 1006	Pre-handshake failure, usually a stale token	"Re-fetch the token if the WebSocket closes unexpectedly before session.ready"
Mic blocked error	Page not on a secure origin	"The page must be served over HTTPS or localhost. Use npx serve or Express on localhost:3000"
Queued audio plays after user interrupts	Missing barge-in handling	"On reply.done with status: 'interrupted', stop all scheduled AudioBufferSourceNodes and reset playbackTime to audioCtx.currentTime"
Twilio call has no audio on the agent side	Output encoding left at default 24 kHz PCM—Twilio can't play it without transcoding	"Set output.format.encoding to audio/pcmu in session.update so Twilio can play the reply without transcoding"

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

AssemblyAI's Voice Agent API is a single WebSocket endpoint that handles the entire voice agent pipeline server-side—speech-to-text with Universal-3 Pro Streaming, LLM reasoning, and text-to-speech—so you can build a conversational voice agent without wiring up separate STT, LLM, and TTS providers. It includes neural turn detection, barge-in, tool calling, session resumption, and 30+ voices out of the box. The endpoint is wss://agents.assemblyai.com/v1/ws, and it costs $4.50/hr all-in.

Can I vibe code a voice agent with Claude Code, ChatGPT, or Cursor?

Yes—and AI coding assistants do particularly well with the Voice Agent API because the integration surface is small. One WebSocket, a handful of JSON event types, and a simple auth pattern. The prompts in this guide work with Claude Code, ChatGPT, Cursor, and any capable coding assistant. For best results, set up the MCP server or project instructions so your agent pulls current docs instead of relying on training data.

How do I keep my AI coding agent up to date with AssemblyAI's API?

Three methods, ranked by effectiveness: (1) Add Always fetch https://www.assemblyai.com/docs/llms.txt before writing AssemblyAI code to your project instructions file. (2) Connect the MCP server at https://mcp.assemblyai.com/docs for on-demand doc access. (3) Install the AssemblyAI skill with npx skills add AssemblyAI/assemblyai-skill for deep SDK context. Layer all three for the best results. See AssemblyAI's coding agent prompts page for full setup instructions.

What's the most common mistake when vibe coding a voice agent?

Sample-rate mismatch. The Voice Agent API expects 24 kHz PCM audio. If your prompt doesn't specify this, AI coding assistants often generate code that captures audio at 44.1 or 48 kHz—the browser's default—which produces garbled audio. Include "24 kHz PCM, 16-bit signed, little-endian, mono, base64-encoded" in your prompt to avoid this entirely.

Do I need to understand WebSockets to vibe code a voice agent?

Not really. You need to understand the concept—a persistent two-way connection where you send audio in and get audio back—but the AI coding assistant handles the implementation details. The important thing is to include the right details in your prompt: the WebSocket URL (wss://agents.assemblyai.com/v1/ws), the auth pattern (temporary tokens for browser apps, API key header for server apps), and the event types you need handled.

How much does the Voice Agent API cost?

$4.50/hr all-in—speech recognition, LLM, and voice synthesis included. No per-token math across three separate invoices. AssemblyAI offers a free tier so you can build and test without a credit card. See the pricing page for current details.