<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Smallest AI</title>
    <description>The latest articles on Forem by Smallest AI (@smallestai-community).</description>
    <link>https://forem.com/smallestai-community</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855198%2Fadd9875a-d296-4e90-aa87-d43c66a27157.png</url>
      <title>Forem: Smallest AI</title>
      <link>https://forem.com/smallestai-community</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/smallestai-community"/>
    <language>en</language>
    <item>
      <title>What Speech Recognition APIs Get Wrong About Human Speech</title>
      <dc:creator>Smallest AI</dc:creator>
      <pubDate>Fri, 03 Apr 2026 10:39:29 +0000</pubDate>
      <link>https://forem.com/smallestai-community/what-speech-recognition-apis-get-wrong-about-human-speech-42n4</link>
      <guid>https://forem.com/smallestai-community/what-speech-recognition-apis-get-wrong-about-human-speech-42n4</guid>
      <description>&lt;p&gt;We've spent decades teaching computers to read. It took considerably longer to teach them to listen and if you have the wrong accent, or work in a noisy room, the honest answer is, we haven't managed it yet. &lt;strong&gt;AI speech recognition&lt;/strong&gt; is one of the most impressive technologies of the last decade and one of the most inconsistently experienced.&lt;/p&gt;

&lt;p&gt;That gap between what your voice says and what the machine hears is the subject of this piece. Not because the technology isn't impressive it genuinely is but because the conditions under which it impresses are far narrower than the marketing suggests. Background noise, regional accents, technical jargon, multiple languages switching mid-sentence, each one chips away at headline accuracy numbers until what's left barely resembles the promise.&lt;/p&gt;

&lt;p&gt;Understanding why this happens and what engineers are doing about it is worth the effort. Especially now, when voice commands are moving from novelty to infrastructure across &lt;strong&gt;healthcare, automotive, customer service, and industrial safety&lt;/strong&gt;. When these systems fail, they don't fail quietly.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How a Machine Learns to Listen&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we can understand why &lt;strong&gt;automatic speech recognition&lt;/strong&gt; fails, it helps to understand what it's actually doing because it's stranger and more impressive than most people realise.&lt;/p&gt;

&lt;p&gt;The process is not translation in the simple sense. It's closer to a high-frequency interpretation problem. A raw audio signal arrives as an analog sound wave. The system samples it digitally, then breaks it into tiny windows and converts each window into a visual representation called a &lt;strong&gt;log-Mel spectrogram&lt;/strong&gt;. This spectrogram maps the intensity of frequencies over time, mimicking the way the human inner ear processes sound. The machine isn't listening to your words. It's looking at pictures of your voice.&lt;/p&gt;

&lt;p&gt;In modern architectures like &lt;a href="https://docs.smallest.ai/waves/v-4-0-0/documentation/getting-started/models#speech-to-text-stt-models" rel="noopener noreferrer"&gt;Smallest.ai's Pulse STT&lt;/a&gt;, the system scans these pictures for patterns consonants, vowels, the edges between them before anything resembling a word takes shape.&lt;/p&gt;

&lt;p&gt;What comes next is the part that changed everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Encoder-Decoder Transformer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The heart of a modern &lt;strong&gt;ASR&lt;/strong&gt; system is an &lt;strong&gt;encoder-decoder transformer&lt;/strong&gt;, and understanding it explains both the power and the fragility of what these systems do.&lt;/p&gt;

&lt;p&gt;The encoder takes the sequence of audio features and transforms them into a &lt;em&gt;context vector,&lt;/em&gt; a rich mathematical blueprint of the entire audio window. The critical mechanism here is &lt;em&gt;self-attention&lt;/em&gt;, which lets the model look at the entire 30-second audio window simultaneously rather than processing it word by word. This global perspective matters, if a speaker says "bank" early in a sentence, the model uses context from the end of the sentence to determine whether the reference is financial or geographical.&lt;/p&gt;

&lt;p&gt;The decoder then writes the transcript one token at a time, using &lt;em&gt;cross-attention&lt;/em&gt; to refer back to specific parts of the audio blueprint as it goes. Each predicted word corresponds to an exact moment in the original sound.&lt;/p&gt;

&lt;p&gt;What made this architecture a step-change was what it replaced. Earlier systems needed separate &lt;strong&gt;acoustic modeling&lt;/strong&gt;, &lt;strong&gt;lexicon&lt;/strong&gt;, and &lt;strong&gt;language modeling&lt;/strong&gt; components each trained and maintained independently, each introducing its own failure modes. The &lt;strong&gt;encoder-decoder&lt;/strong&gt; approach collapses all of this into a single end-to-end system, reducing development complexity and dramatically improving performance on well-represented speech. The tradeoff is that failures are also more holistic when the model doesn't know how to handle something, there's no fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Accent Problem Is a Data Problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable truth about &lt;strong&gt;speech-to-text&lt;/strong&gt; accuracy statistics, they're almost always measured on audio that sounds like the training data.&lt;/p&gt;

&lt;p&gt;Accents and dialects are not minor stylistic variations. They're complex shifts in phonetics, intonation, rhythm, and timing. A speaker from West Africa may use fundamentally different vowel lengths than a speaker from Appalachia, even while saying identical words in the same language. The model's job, what researchers call &lt;em&gt;phonetic fuzzy matching,&lt;/em&gt; is to recognise that "savins" and "savings" are likely the same word despite a regional clip. When models aren't trained on sufficient diversity, they don't develop this tolerance.&lt;/p&gt;

&lt;p&gt;The numbers tell the story clearly. A well-resourced English model might achieve a Word Error Rate (WER) of 3–5% in ideal conditions. Put that same model in a real-world environment with a non-standard accent, and WER can climb past 25%. For low-resource languages like Hindi or Mizo, real-world error rates of 30–50% are not uncommon.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgopkfgv7ailnzjz02ss1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgopkfgv7ailnzjz02ss1.png" alt=" " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern &lt;strong&gt;neural networks&lt;/strong&gt; attempt to close this gap through continuous learning, feeding more diverse speech data into the system over time to expand its phonetic tolerance. Deep Neural Networks (DNNs) analyse audio signals for subtle variations in pitch and tone, learning to generalise across regional variation. The challenge is that this requires data, and collecting diverse, labelled speech data is expensive and slow. The communities most underserved by these systems are typically the communities least represented in training datasets. It's a self-reinforcing gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code-Switching and the Multilingual Problem&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The accent problem compounds significantly in &lt;strong&gt;multilingual recognition&lt;/strong&gt; environments. Code-switching, where a speaker moves between languages in the same sentence, as hundreds of millions of people do naturally every day, breaks most conventional ASR pipelines entirely. The model expects one language at a time; it gets two, mixed without warning.&lt;/p&gt;

&lt;p&gt;Modern systems like Smallest.ai's Pulse STT address this through auto-language detection and adaptive modeling, switching linguistic contexts mid-stream as evidence accumulates. The more advanced frontier is &lt;strong&gt;zero-shot performance&lt;/strong&gt;, a model that can recognise or translate a language it has never explicitly trained on.&lt;/p&gt;

&lt;p&gt;This is achieved by learning language-agnostic speech representations of the fundamental acoustic properties that all human speech shares regardless of language. By mapping these properties to a shared latent space, a model can extend support to new languages with minimal labelled data. &lt;strong&gt;Large Language Models (LLMs)&lt;/strong&gt; increasingly act as the reasoning engine for this acoustic output, applying contextual understanding to bridge gaps where phonetic training is sparse.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What This Looks Like in Practice: The Multilingual Translator&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Smallest.ai's &lt;a href="https://showcase.smallest.ai/projects/multilingual-translator" rel="noopener noreferrer"&gt;Multilingual Translator&lt;/a&gt; is a working demonstration of these principles. The system provides real-time translation and voice output across multiple languages, a meaningful feature for educators and travellers in low-connectivity environments.&lt;/p&gt;

&lt;p&gt;It's a useful case study because it makes the engineering tradeoffs visible. Supporting many languages isn't just a matter of adding more training data; it requires architectural decisions about how the model represents language, how it handles uncertainty, and how latency is managed when the system needs to &lt;strong&gt;detect, transcribe, and translate&lt;/strong&gt; in near real-time. Privacy is handled by keeping inference local, no audio leaves the device.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Background Noise Is Not a Special Case. It's the Default.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If the accent problem is about &lt;em&gt;variety&lt;/em&gt;, the noise problem is about &lt;em&gt;interference&lt;/em&gt;. And interference is not the exception in real-world audio, it's the condition.&lt;/p&gt;

&lt;p&gt;Traffic, machinery, HVAC systems, overlapping speakers, music bleeding from nearby rooms, these sounds contaminate almost every audio environment where &lt;strong&gt;voice-activated&lt;/strong&gt; systems are actually deployed. Noise breaks &lt;strong&gt;speech-to-text&lt;/strong&gt; by interfering with the acoustic cues a model depends on formants, pitch contours, the micro-pauses that signal word boundaries. At a Signal-to-Noise Ratio (SNR) below 10 dB, most conventionally-trained models begin to fail badly.&lt;/p&gt;

&lt;p&gt;The instinct is to clean the audio before transcribing it. Spectral subtraction, Wiener filtering, noise gates, decades of preprocessing research. The problem is what engineers have started calling the &lt;em&gt;noise reduction paradox&lt;/em&gt; where every filter designed to remove background hum also risks erasing the subtle speech harmonics the recogniser needs to identify a word. Spectral subtraction can improve SNR by 8 dB and simultaneously drive WER up by 15% through the distortion it introduces. You solve one problem and create another.&lt;/p&gt;

&lt;p&gt;Current best practice has shifted toward &lt;strong&gt;noise-trained models&lt;/strong&gt; systems trained on datasets that &lt;em&gt;deliberately&lt;/em&gt; include chaotic acoustic conditions, rather than clean recordings. Instead of preprocessing the audio into something more tractable, the model learns to find stable acoustic features that persist even under heavy noise. The architecture learns noise tolerance rather than having it bolted on afterward.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Noise Robustness Method&lt;/th&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;th&gt;Disadvantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Preprocessing (Denoising)&lt;/td&gt;
&lt;td&gt;Works with legacy ASR backends&lt;/td&gt;
&lt;td&gt;Can erase speech harmonics; adds latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noise-Trained Models&lt;/td&gt;
&lt;td&gt;Handles chaotic audio without cascade errors&lt;/td&gt;
&lt;td&gt;High training cost and data requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAD Buffering&lt;/td&gt;
&lt;td&gt;Trims 30–40% of compute costs&lt;/td&gt;
&lt;td&gt;Introduces 20–50ms of additional latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Channel Processing&lt;/td&gt;
&lt;td&gt;Uses microphone arrays to isolate voice&lt;/td&gt;
&lt;td&gt;Requires specialised hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Voice Activity Detection (VAD) plays a critical supporting role here identifying which segments of audio contain speech and which don't, reducing the computational load on the transcription model. But VAD introduces its own failure mode: if the frame window is too short, a low-energy consonant can be misclassified as silence, creating a deletion error in the final transcript that looks like a simple mishear but originates in preprocessing.&lt;/p&gt;

&lt;p&gt;The sector tables from real deployments underscore how high the stakes are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sector&lt;/th&gt;
&lt;th&gt;Primary Use Case&lt;/th&gt;
&lt;th&gt;Critical Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Healthcare&lt;/td&gt;
&lt;td&gt;Real-time patient monitoring and documentation&lt;/td&gt;
&lt;td&gt;High transcription accuracy for medical terms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automotive&lt;/td&gt;
&lt;td&gt;Voice-activated navigation and multimedia&lt;/td&gt;
&lt;td&gt;Robustness to background noise and engine hum&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer Service&lt;/td&gt;
&lt;td&gt;Virtual assistants and automated triage&lt;/td&gt;
&lt;td&gt;Low latency and accurate intent detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Industrial Safety&lt;/td&gt;
&lt;td&gt;Hands-free data collection and reporting&lt;/td&gt;
&lt;td&gt;Resilience to 90+ dBA acoustic environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Latency Problem Nobody Talks About Enough&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Accuracy is the metric people quote. Latency is the metric that determines whether anyone uses the product.&lt;/p&gt;

&lt;p&gt;A conversation feels natural only when response time stays under 300ms. For a developer building a voice agent, the pipeline is to capture audio, transcribe it, pass it to an &lt;strong&gt;NLU&lt;/strong&gt; layer, run it through an LLM, generate a response, synthesise speech, stream audio back. Every step costs time. The cumulative budget is brutal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4p1z9z74r810wznpkve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4p1z9z74r810wznpkve.png" alt=" " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern systems prioritise &lt;em&gt;Time to First Transcript&lt;/em&gt; (TTFT) the delay between a speaker stopping and the first words appearing as text. Pulse STT achieves a TTFT of 64ms, which creates the perceptual &lt;em&gt;illusion&lt;/em&gt; of real-time interaction by returning partial transcripts while the speaker is still talking. These partials update continuously until the model commits to a final transcript at a natural pause, a process called &lt;em&gt;endpointing&lt;/em&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Performance Dimension&lt;/th&gt;
&lt;th&gt;Goal for Natural Conversation&lt;/th&gt;
&lt;th&gt;Typical Cloud API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TTFT&lt;/td&gt;
&lt;td&gt;&amp;lt; 100ms&lt;/td&gt;
&lt;td&gt;200ms – 500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total Response Latency&lt;/td&gt;
&lt;td&gt;&amp;lt; 800ms&lt;/td&gt;
&lt;td&gt;1500ms – 3000ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transcription Accuracy&lt;/td&gt;
&lt;td&gt;&amp;gt; 95%&lt;/td&gt;
&lt;td&gt;80% – 90% (in noise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Endpointing Delay&lt;/td&gt;
&lt;td&gt;&amp;lt; 300ms&lt;/td&gt;
&lt;td&gt;500ms – 1000ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Streaming via WebSockets&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The architectural mechanism that makes low-latency &lt;strong&gt;real-time transcription&lt;/strong&gt; possible is the WebSocket connection. Unlike REST APIs which require a new handshake for every audio packet WebSockets maintain a persistent, bidirectional link between client and server. The server pushes transcript fragments back as soon as they're processed, rather than waiting for the full audio to arrive.&lt;/p&gt;

&lt;p&gt;A typical streaming architecture flows like this establish an authenticated WSS connection, stream 40ms audio packets (roughly 640 bytes at 8kHz sampling) at a continuous 1:1 real-time rate, then receive a stream of JSON objects containing partial results, final results, and word-level timestamps. The client gets a live view into what the model is thinking, not just a final answer. For a technical deep down refer to &lt;a href="https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text/realtime-web-socket/quickstart" rel="noopener noreferrer"&gt;realtime audio transcription&lt;/a&gt; guide&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Beyond Transcription: What Speech Intelligence Actually Means&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Transcription is the starting point, not the destination. The more interesting question is what you can &lt;em&gt;infer&lt;/em&gt; from speech that doesn't survive the conversion to text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker diarization:&lt;/strong&gt; Answering "who spoke when?" is one of the most practically valuable capabilities. It's an unsupervised clustering problem, the system segments the audio, converts each segment into a high-dimensional numerical embedding of the speaker's unique vocal characteristics, estimates how many distinct speakers are present, then assigns labels (Speaker 1, Speaker 2, etc.). The output transforms a raw transcript into a structured conversation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatekwx6ubipffapubmdc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatekwx6ubipffapubmdc.png" alt=" " width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Word-level confidence scores&lt;/strong&gt;: Each word in a transcript carries a probability score typically 0.0 to 1.0 representing how certain the model is about that prediction. A score of 0.95 is reliable; 0.60 is a flag. By setting a confidence threshold, an application can automatically route uncertain words to human review, ask the user for clarification, or simply annotate the output with uncertainty markers. In healthcare or legal contexts, where a single misheard word has real consequences, this metadata is not optional.&lt;/p&gt;

&lt;p&gt;More advanced uncertainty estimation uses entropy-based measures that provide more calibrated estimates of correctness than raw probability scores alone.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metadata Feature&lt;/th&gt;
&lt;th&gt;Data Content&lt;/th&gt;
&lt;th&gt;Key Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speaker ID&lt;/td&gt;
&lt;td&gt;Integer / label for unique voices&lt;/td&gt;
&lt;td&gt;Meeting minutes, interview archives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emotion Tag&lt;/td&gt;
&lt;td&gt;Sentiment (happy, angry, neutral, etc.)&lt;/td&gt;
&lt;td&gt;Call centre coaching, sentiment analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PII Detection&lt;/td&gt;
&lt;td&gt;Flagged sensitive data&lt;/td&gt;
&lt;td&gt;HIPAA, PCI, GDPR compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confidence Score&lt;/td&gt;
&lt;td&gt;Probability (0.0 – 1.0)&lt;/td&gt;
&lt;td&gt;Quality assurance and error correction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Happens When You Chain These Systems Together&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One of the more revealing experiments in &lt;strong&gt;ASR&lt;/strong&gt; research isn't a benchmark, it's a failure mode. Smallest.ai's &lt;a href="https://showcase.smallest.ai/projects/voice-chinese-whispers" rel="noopener noreferrer"&gt;Voice Chinese Whispers&lt;/a&gt; demonstrates what happens when you chain transcription, translation, and speech synthesis in repeated loops.&lt;/p&gt;

&lt;p&gt;In a single pass, a misheard word shifts meaning slightly. By the fifth iteration, the system is producing phrases that have no relationship to the original utterance. The model hasn't hallucinated in the classic LLM sense; it's been faithfully following the degraded output of the previous step. Each stage introduces a small amount of &lt;em&gt;acoustic drift&lt;/em&gt; or &lt;em&gt;contextual drift&lt;/em&gt;, and the errors compound geometrically.&lt;/p&gt;

&lt;p&gt;It's a useful demonstration because it makes visible something that's easy to miss in production systems; the output of an ASR model is not a stable foundation. It's a probabilistic estimate, and downstream systems that treat it as ground truth will inherit and amplify its errors. &lt;em&gt;Transcript stability&lt;/em&gt; ensuring that once a word is committed it stays committed, and that confidence scores accurately reflect uncertainty is an engineering discipline, not a given.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;From Transcription to Action: The Real Ambition&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The most significant shift happening in &lt;strong&gt;speech intelligence&lt;/strong&gt; right now isn't about accuracy or latency. It's about what the transcript &lt;em&gt;does&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In the &lt;em&gt;speech-to-action&lt;/em&gt; paradigm, the ASR transcript is fed directly into an LLM that can call external tools, query databases, trigger workflows, and manage complex dialogue. The voice interface becomes a reasoning interface. The gap between "I said a thing" and "something happened" collapses.&lt;/p&gt;

&lt;p&gt;This requires a level of integration between the speech layer and the reasoning layer that earlier architectures couldn't support. The emerging answer is &lt;em&gt;full-duplex multimodal models&lt;/em&gt; where a single model handles voice input, reasoning, and voice output in one pipeline, rather than piping data between separate ASR, LLM, and TTS services. &lt;a href="https://smallest.ai/speech-to-speech" rel="noopener noreferrer"&gt;Smallest.ai's Hydra&lt;/a&gt; takes this approach, handling intent detection and voice synthesis together to eliminate the inter-service latency that makes stitched-together pipelines feel unnatural.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Real-Time Voice AI Looks Like in Practice
&lt;/h3&gt;

&lt;p&gt;Smallest.ai's &lt;a href="https://showcase.smallest.ai/projects/debate-arena" rel="noopener noreferrer"&gt;Debate Arena&lt;/a&gt; is a working demonstration of how far orchestration has come. The system stages a philosophical debate between AI agents, Socrates arguing &lt;em&gt;for&lt;/em&gt; and Aristotle arguing &lt;em&gt;against&lt;/em&gt; any topic the user proposes with distinct voices, expressive vocal parameters (emotion, pitch, volume, prosody) predicted by the LLM each round, and an ancient Athenian judge scoring the exchange.&lt;/p&gt;

&lt;p&gt;For a system like this to work through voice, the ASR layer needs to maintain multi-speaker tracking, support adversarial turn-taking without the agents talking over each other, and do all of this at low enough latency that the conversation feels alive. The Debate Arena uses Lightning TTS v3.2 WebSocket streaming, with voice parameters generated dynamically per round by GPT-4o-mini. It supports two modes, Philosophical and Roast Battle with escalating arguments and audience voting.&lt;/p&gt;

&lt;p&gt;It's a playful project, but it demonstrates something serious: the engineering required to make multi-agent, multi-voice, real-time voice interaction work is now tractable. The primitives exist. The question is how to compose them well.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Where This Is Going&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The next decade of &lt;strong&gt;AI speech recognition&lt;/strong&gt; is likely to diverge along two paths that are pulling in opposite directions.&lt;/p&gt;

&lt;p&gt;The first is to scale a massive cloud model trained on ever-larger and more diverse datasets, capable of handling more languages, more accents, more acoustic conditions. The second is compression, hyper-efficient on-device models that run locally on a phone or an industrial edge device without sending audio to the cloud. Privacy, data sovereignty, and latency concerns are all pushing toward the second path, even as raw capability improvements come from the first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adaptive and personalised speech models&lt;/strong&gt; represent a third direction that cuts across both. Rather than building a single model that tries to be equally good at everything, future systems will adapt in real-time to an individual speaker's specific pitch, pace, and vocabulary. Zero-shot adaptation, learning to recognise a specific voice from a few seconds of reference audio makes this tractable without requiring per-user retraining at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Building Things That Actually Work&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For developers, the translation from research benchmarks to production systems requires moving past Word Error Rate as the primary metric. WER tells you how accurate the model is on a test set. It doesn't tell you whether users can trust it.&lt;/p&gt;

&lt;p&gt;The metrics that matter in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tail latency (P99):&lt;/strong&gt; Does the system respond quickly under heavy load, or does it occasionally spike in ways that break the conversation?
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibrated confidence:&lt;/strong&gt; When the model reports 90% certainty, is it actually right 90% of the time? Overconfident models are more dangerous than uncertain ones.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain-specific adaptation:&lt;/strong&gt; Does the system handle your vocabulary? Medical terms, product names, and technical jargon that don't appear in general training data can be addressed through word boosting and custom dictionaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Best Practice&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Handle low confidence&lt;/td&gt;
&lt;td&gt;Flag words below 0.90 for human review&lt;/td&gt;
&lt;td&gt;Reduced error rate in high-stakes documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use WebSockets&lt;/td&gt;
&lt;td&gt;Implement persistent WSS connections&lt;/td&gt;
&lt;td&gt;Sub-500ms response times for voice agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adopt noise-trained models&lt;/td&gt;
&lt;td&gt;Skip preprocessing in chaotic environments&lt;/td&gt;
&lt;td&gt;Better performance in factories and vehicles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitor RTF&lt;/td&gt;
&lt;td&gt;Track the Real-Time Factor of inference&lt;/td&gt;
&lt;td&gt;Guaranteed responsiveness under load&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Smallest.ai's ecosystem offers a set of tools built around these production constraints. Pulse STT delivers 64ms TTFT with built-in diarization across 30+ languages. &lt;a href="https://smallest.ai/blog/evaluating-lightning-asr-against-leading-streaming-speech-recognition-models" rel="noopener noreferrer"&gt;Lightning ASR&lt;/a&gt; is optimised for sub-300ms latency, with particular strength in non-English languages. &lt;a href="https://smallest.ai/" rel="noopener noreferrer"&gt;Hydra&lt;/a&gt; handles the full voice conversation pipeline such as input, reasoning, and output in a single model.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;A Note on What "Working" Really Means&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Market projections $19.5 billion by 2030, 27% of the global population already using voice commands tend to measure adoption, not satisfaction. A system that works for one speaker in a quiet room and fails another speaker in a noisy one is not a solved problem, even if it ships with impressive accuracy numbers.&lt;/p&gt;

&lt;p&gt;The history of &lt;strong&gt;automatic speech recognition&lt;/strong&gt; is a history of systems getting impressively good at well-resourced voices and incrementally better at everyone else. The architecture has genuinely improved encoder-decoder transformers, end-to-end training, and noise-robust learning are meaningful advances over the rule-based systems of the 1990s. But the generalisation gap that makes a 95% accuracy number in a lab become a 75% accuracy number in the field is not a technical afterthought. It's the central problem.&lt;/p&gt;

&lt;p&gt;Building voice interfaces that are worth trusting means taking that gap seriously in the training data you choose, the confidence metadata you expose, the noise conditions you test against, and the communities whose voices you treat as primary cases rather than edge cases.&lt;/p&gt;

&lt;p&gt;The era of voice-first interfaces hasn't simply arrived. It's arriving unevenly. And the engineers who understand why have a real opportunity to build something better.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tools referenced in this piece: &lt;a href="https://smallest.ai/" rel="noopener noreferrer"&gt;Pulse STT&lt;/a&gt;, &lt;a href="https://smallest.ai/blog/evaluating-lightning-asr-against-leading-streaming-speech-recognition-models" rel="noopener noreferrer"&gt;Lightning ASR&lt;/a&gt;, &lt;a href="https://smallest.ai/" rel="noopener noreferrer"&gt;Hydra&lt;/a&gt;, &lt;a href="https://showcase.smallest.ai/projects/multilingual-translator" rel="noopener noreferrer"&gt;Multilingual Translator&lt;/a&gt;, &lt;a href="https://showcase.smallest.ai/projects/voice-chinese-whispers" rel="noopener noreferrer"&gt;Voice Chinese Whispers&lt;/a&gt;, &lt;a href="https://showcase.smallest.ai/projects/debate-arena" rel="noopener noreferrer"&gt;Debate Arena&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>voiceagents</category>
      <category>smallestai</category>
    </item>
    <item>
      <title>What Speech Recognition APIs Get Wrong About Human Speech</title>
      <dc:creator>Smallest AI</dc:creator>
      <pubDate>Fri, 03 Apr 2026 10:39:29 +0000</pubDate>
      <link>https://forem.com/smallestai-community/what-speech-recognition-apis-get-wrong-about-human-speech-4op9</link>
      <guid>https://forem.com/smallestai-community/what-speech-recognition-apis-get-wrong-about-human-speech-4op9</guid>
      <description>&lt;p&gt;We've spent decades teaching computers to read. It took considerably longer to teach them to listen and if you have the wrong accent, or work in a noisy room, the honest answer is, we haven't managed it yet. &lt;strong&gt;AI speech recognition&lt;/strong&gt; is one of the most impressive technologies of the last decade and one of the most inconsistently experienced.&lt;/p&gt;

&lt;p&gt;That gap between what your voice says and what the machine hears is the subject of this piece. Not because the technology isn't impressive it genuinely is but because the conditions under which it impresses are far narrower than the marketing suggests. Background noise, regional accents, technical jargon, multiple languages switching mid-sentence, each one chips away at headline accuracy numbers until what's left barely resembles the promise.&lt;/p&gt;

&lt;p&gt;Understanding why this happens and what engineers are doing about it is worth the effort. Especially now, when voice commands are moving from novelty to infrastructure across &lt;strong&gt;healthcare, automotive, customer service, and industrial safety&lt;/strong&gt;. When these systems fail, they don't fail quietly.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How a Machine Learns to Listen&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we can understand why &lt;strong&gt;automatic speech recognition&lt;/strong&gt; fails, it helps to understand what it's actually doing because it's stranger and more impressive than most people realise.&lt;/p&gt;

&lt;p&gt;The process is not translation in the simple sense. It's closer to a high-frequency interpretation problem. A raw audio signal arrives as an analog sound wave. The system samples it digitally, then breaks it into tiny windows and converts each window into a visual representation called a &lt;strong&gt;log-Mel spectrogram&lt;/strong&gt;. This spectrogram maps the intensity of frequencies over time, mimicking the way the human inner ear processes sound. The machine isn't listening to your words. It's looking at pictures of your voice.&lt;/p&gt;

&lt;p&gt;In modern architectures like &lt;a href="https://docs.smallest.ai/waves/v-4-0-0/documentation/getting-started/models#speech-to-text-stt-models" rel="noopener noreferrer"&gt;Smallest.ai's Pulse STT&lt;/a&gt;, the system scans these pictures for patterns consonants, vowels, the edges between them before anything resembling a word takes shape.&lt;/p&gt;

&lt;p&gt;What comes next is the part that changed everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Encoder-Decoder Transformer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The heart of a modern &lt;strong&gt;ASR&lt;/strong&gt; system is an &lt;strong&gt;encoder-decoder transformer&lt;/strong&gt;, and understanding it explains both the power and the fragility of what these systems do.&lt;/p&gt;

&lt;p&gt;The encoder takes the sequence of audio features and transforms them into a &lt;em&gt;context vector,&lt;/em&gt; a rich mathematical blueprint of the entire audio window. The critical mechanism here is &lt;em&gt;self-attention&lt;/em&gt;, which lets the model look at the entire 30-second audio window simultaneously rather than processing it word by word. This global perspective matters, if a speaker says "bank" early in a sentence, the model uses context from the end of the sentence to determine whether the reference is financial or geographical.&lt;/p&gt;

&lt;p&gt;The decoder then writes the transcript one token at a time, using &lt;em&gt;cross-attention&lt;/em&gt; to refer back to specific parts of the audio blueprint as it goes. Each predicted word corresponds to an exact moment in the original sound.&lt;/p&gt;

&lt;p&gt;What made this architecture a step-change was what it replaced. Earlier systems needed separate &lt;strong&gt;acoustic modeling&lt;/strong&gt;, &lt;strong&gt;lexicon&lt;/strong&gt;, and &lt;strong&gt;language modeling&lt;/strong&gt; components each trained and maintained independently, each introducing its own failure modes. The &lt;strong&gt;encoder-decoder&lt;/strong&gt; approach collapses all of this into a single end-to-end system, reducing development complexity and dramatically improving performance on well-represented speech. The tradeoff is that failures are also more holistic when the model doesn't know how to handle something, there's no fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Accent Problem Is a Data Problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable truth about &lt;strong&gt;speech-to-text&lt;/strong&gt; accuracy statistics, they're almost always measured on audio that sounds like the training data.&lt;/p&gt;

&lt;p&gt;Accents and dialects are not minor stylistic variations. They're complex shifts in phonetics, intonation, rhythm, and timing. A speaker from West Africa may use fundamentally different vowel lengths than a speaker from Appalachia, even while saying identical words in the same language. The model's job, what researchers call &lt;em&gt;phonetic fuzzy matching,&lt;/em&gt; is to recognise that "savins" and "savings" are likely the same word despite a regional clip. When models aren't trained on sufficient diversity, they don't develop this tolerance.&lt;/p&gt;

&lt;p&gt;The numbers tell the story clearly. A well-resourced English model might achieve a Word Error Rate (WER) of 3–5% in ideal conditions. Put that same model in a real-world environment with a non-standard accent, and WER can climb past 25%. For low-resource languages like Hindi or Mizo, real-world error rates of 30–50% are not uncommon.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgopkfgv7ailnzjz02ss1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgopkfgv7ailnzjz02ss1.png" alt=" " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern &lt;strong&gt;neural networks&lt;/strong&gt; attempt to close this gap through continuous learning, feeding more diverse speech data into the system over time to expand its phonetic tolerance. Deep Neural Networks (DNNs) analyse audio signals for subtle variations in pitch and tone, learning to generalise across regional variation. The challenge is that this requires data, and collecting diverse, labelled speech data is expensive and slow. The communities most underserved by these systems are typically the communities least represented in training datasets. It's a self-reinforcing gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code-Switching and the Multilingual Problem&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The accent problem compounds significantly in &lt;strong&gt;multilingual recognition&lt;/strong&gt; environments. Code-switching, where a speaker moves between languages in the same sentence, as hundreds of millions of people do naturally every day, breaks most conventional ASR pipelines entirely. The model expects one language at a time; it gets two, mixed without warning.&lt;/p&gt;

&lt;p&gt;Modern systems like Smallest.ai's Pulse STT address this through auto-language detection and adaptive modeling, switching linguistic contexts mid-stream as evidence accumulates. The more advanced frontier is &lt;strong&gt;zero-shot performance&lt;/strong&gt;, a model that can recognise or translate a language it has never explicitly trained on.&lt;/p&gt;

&lt;p&gt;This is achieved by learning language-agnostic speech representations of the fundamental acoustic properties that all human speech shares regardless of language. By mapping these properties to a shared latent space, a model can extend support to new languages with minimal labelled data. &lt;strong&gt;Large Language Models (LLMs)&lt;/strong&gt; increasingly act as the reasoning engine for this acoustic output, applying contextual understanding to bridge gaps where phonetic training is sparse.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What This Looks Like in Practice: The Multilingual Translator&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Smallest.ai's &lt;a href="https://showcase.smallest.ai/projects/multilingual-translator" rel="noopener noreferrer"&gt;Multilingual Translator&lt;/a&gt; is a working demonstration of these principles. The system provides real-time translation and voice output across multiple languages, a meaningful feature for educators and travellers in low-connectivity environments.&lt;/p&gt;

&lt;p&gt;It's a useful case study because it makes the engineering tradeoffs visible. Supporting many languages isn't just a matter of adding more training data; it requires architectural decisions about how the model represents language, how it handles uncertainty, and how latency is managed when the system needs to &lt;strong&gt;detect, transcribe, and translate&lt;/strong&gt; in near real-time. Privacy is handled by keeping inference local, no audio leaves the device.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Background Noise Is Not a Special Case. It's the Default.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If the accent problem is about &lt;em&gt;variety&lt;/em&gt;, the noise problem is about &lt;em&gt;interference&lt;/em&gt;. And interference is not the exception in real-world audio, it's the condition.&lt;/p&gt;

&lt;p&gt;Traffic, machinery, HVAC systems, overlapping speakers, music bleeding from nearby rooms, these sounds contaminate almost every audio environment where &lt;strong&gt;voice-activated&lt;/strong&gt; systems are actually deployed. Noise breaks &lt;strong&gt;speech-to-text&lt;/strong&gt; by interfering with the acoustic cues a model depends on formants, pitch contours, the micro-pauses that signal word boundaries. At a Signal-to-Noise Ratio (SNR) below 10 dB, most conventionally-trained models begin to fail badly.&lt;/p&gt;

&lt;p&gt;The instinct is to clean the audio before transcribing it. Spectral subtraction, Wiener filtering, noise gates, decades of preprocessing research. The problem is what engineers have started calling the &lt;em&gt;noise reduction paradox&lt;/em&gt; where every filter designed to remove background hum also risks erasing the subtle speech harmonics the recogniser needs to identify a word. Spectral subtraction can improve SNR by 8 dB and simultaneously drive WER up by 15% through the distortion it introduces. You solve one problem and create another.&lt;/p&gt;

&lt;p&gt;Current best practice has shifted toward &lt;strong&gt;noise-trained models&lt;/strong&gt; systems trained on datasets that &lt;em&gt;deliberately&lt;/em&gt; include chaotic acoustic conditions, rather than clean recordings. Instead of preprocessing the audio into something more tractable, the model learns to find stable acoustic features that persist even under heavy noise. The architecture learns noise tolerance rather than having it bolted on afterward.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Noise Robustness Method&lt;/th&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;th&gt;Disadvantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Preprocessing (Denoising)&lt;/td&gt;
&lt;td&gt;Works with legacy ASR backends&lt;/td&gt;
&lt;td&gt;Can erase speech harmonics; adds latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noise-Trained Models&lt;/td&gt;
&lt;td&gt;Handles chaotic audio without cascade errors&lt;/td&gt;
&lt;td&gt;High training cost and data requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAD Buffering&lt;/td&gt;
&lt;td&gt;Trims 30–40% of compute costs&lt;/td&gt;
&lt;td&gt;Introduces 20–50ms of additional latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Channel Processing&lt;/td&gt;
&lt;td&gt;Uses microphone arrays to isolate voice&lt;/td&gt;
&lt;td&gt;Requires specialised hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Voice Activity Detection (VAD) plays a critical supporting role here identifying which segments of audio contain speech and which don't, reducing the computational load on the transcription model. But VAD introduces its own failure mode: if the frame window is too short, a low-energy consonant can be misclassified as silence, creating a deletion error in the final transcript that looks like a simple mishear but originates in preprocessing.&lt;/p&gt;

&lt;p&gt;The sector tables from real deployments underscore how high the stakes are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sector&lt;/th&gt;
&lt;th&gt;Primary Use Case&lt;/th&gt;
&lt;th&gt;Critical Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Healthcare&lt;/td&gt;
&lt;td&gt;Real-time patient monitoring and documentation&lt;/td&gt;
&lt;td&gt;High transcription accuracy for medical terms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automotive&lt;/td&gt;
&lt;td&gt;Voice-activated navigation and multimedia&lt;/td&gt;
&lt;td&gt;Robustness to background noise and engine hum&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer Service&lt;/td&gt;
&lt;td&gt;Virtual assistants and automated triage&lt;/td&gt;
&lt;td&gt;Low latency and accurate intent detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Industrial Safety&lt;/td&gt;
&lt;td&gt;Hands-free data collection and reporting&lt;/td&gt;
&lt;td&gt;Resilience to 90+ dBA acoustic environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Latency Problem Nobody Talks About Enough&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Accuracy is the metric people quote. Latency is the metric that determines whether anyone uses the product.&lt;/p&gt;

&lt;p&gt;A conversation feels natural only when response time stays under 300ms. For a developer building a voice agent, the pipeline is to capture audio, transcribe it, pass it to an &lt;strong&gt;NLU&lt;/strong&gt; layer, run it through an LLM, generate a response, synthesise speech, stream audio back. Every step costs time. The cumulative budget is brutal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4p1z9z74r810wznpkve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4p1z9z74r810wznpkve.png" alt=" " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern systems prioritise &lt;em&gt;Time to First Transcript&lt;/em&gt; (TTFT) the delay between a speaker stopping and the first words appearing as text. Pulse STT achieves a TTFT of 64ms, which creates the perceptual &lt;em&gt;illusion&lt;/em&gt; of real-time interaction by returning partial transcripts while the speaker is still talking. These partials update continuously until the model commits to a final transcript at a natural pause, a process called &lt;em&gt;endpointing&lt;/em&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Performance Dimension&lt;/th&gt;
&lt;th&gt;Goal for Natural Conversation&lt;/th&gt;
&lt;th&gt;Typical Cloud API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TTFT&lt;/td&gt;
&lt;td&gt;&amp;lt; 100ms&lt;/td&gt;
&lt;td&gt;200ms – 500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total Response Latency&lt;/td&gt;
&lt;td&gt;&amp;lt; 800ms&lt;/td&gt;
&lt;td&gt;1500ms – 3000ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transcription Accuracy&lt;/td&gt;
&lt;td&gt;&amp;gt; 95%&lt;/td&gt;
&lt;td&gt;80% – 90% (in noise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Endpointing Delay&lt;/td&gt;
&lt;td&gt;&amp;lt; 300ms&lt;/td&gt;
&lt;td&gt;500ms – 1000ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Streaming via WebSockets&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The architectural mechanism that makes low-latency &lt;strong&gt;real-time transcription&lt;/strong&gt; possible is the WebSocket connection. Unlike REST APIs which require a new handshake for every audio packet WebSockets maintain a persistent, bidirectional link between client and server. The server pushes transcript fragments back as soon as they're processed, rather than waiting for the full audio to arrive.&lt;/p&gt;

&lt;p&gt;A typical streaming architecture flows like this establish an authenticated WSS connection, stream 40ms audio packets (roughly 640 bytes at 8kHz sampling) at a continuous 1:1 real-time rate, then receive a stream of JSON objects containing partial results, final results, and word-level timestamps. The client gets a live view into what the model is thinking, not just a final answer. For a technical deep down refer to &lt;a href="https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text/realtime-web-socket/quickstart" rel="noopener noreferrer"&gt;realtime audio transcription&lt;/a&gt; guide&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Beyond Transcription: What Speech Intelligence Actually Means&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Transcription is the starting point, not the destination. The more interesting question is what you can &lt;em&gt;infer&lt;/em&gt; from speech that doesn't survive the conversion to text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker diarization:&lt;/strong&gt; Answering "who spoke when?" is one of the most practically valuable capabilities. It's an unsupervised clustering problem, the system segments the audio, converts each segment into a high-dimensional numerical embedding of the speaker's unique vocal characteristics, estimates how many distinct speakers are present, then assigns labels (Speaker 1, Speaker 2, etc.). The output transforms a raw transcript into a structured conversation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatekwx6ubipffapubmdc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatekwx6ubipffapubmdc.png" alt=" " width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Word-level confidence scores&lt;/strong&gt;: Each word in a transcript carries a probability score typically 0.0 to 1.0 representing how certain the model is about that prediction. A score of 0.95 is reliable; 0.60 is a flag. By setting a confidence threshold, an application can automatically route uncertain words to human review, ask the user for clarification, or simply annotate the output with uncertainty markers. In healthcare or legal contexts, where a single misheard word has real consequences, this metadata is not optional.&lt;/p&gt;

&lt;p&gt;More advanced uncertainty estimation uses entropy-based measures that provide more calibrated estimates of correctness than raw probability scores alone.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metadata Feature&lt;/th&gt;
&lt;th&gt;Data Content&lt;/th&gt;
&lt;th&gt;Key Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speaker ID&lt;/td&gt;
&lt;td&gt;Integer / label for unique voices&lt;/td&gt;
&lt;td&gt;Meeting minutes, interview archives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emotion Tag&lt;/td&gt;
&lt;td&gt;Sentiment (happy, angry, neutral, etc.)&lt;/td&gt;
&lt;td&gt;Call centre coaching, sentiment analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PII Detection&lt;/td&gt;
&lt;td&gt;Flagged sensitive data&lt;/td&gt;
&lt;td&gt;HIPAA, PCI, GDPR compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confidence Score&lt;/td&gt;
&lt;td&gt;Probability (0.0 – 1.0)&lt;/td&gt;
&lt;td&gt;Quality assurance and error correction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Happens When You Chain These Systems Together&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One of the more revealing experiments in &lt;strong&gt;ASR&lt;/strong&gt; research isn't a benchmark, it's a failure mode. Smallest.ai's &lt;a href="https://showcase.smallest.ai/projects/voice-chinese-whispers" rel="noopener noreferrer"&gt;Voice Chinese Whispers&lt;/a&gt; demonstrates what happens when you chain transcription, translation, and speech synthesis in repeated loops.&lt;/p&gt;

&lt;p&gt;In a single pass, a misheard word shifts meaning slightly. By the fifth iteration, the system is producing phrases that have no relationship to the original utterance. The model hasn't hallucinated in the classic LLM sense; it's been faithfully following the degraded output of the previous step. Each stage introduces a small amount of &lt;em&gt;acoustic drift&lt;/em&gt; or &lt;em&gt;contextual drift&lt;/em&gt;, and the errors compound geometrically.&lt;/p&gt;

&lt;p&gt;It's a useful demonstration because it makes visible something that's easy to miss in production systems; the output of an ASR model is not a stable foundation. It's a probabilistic estimate, and downstream systems that treat it as ground truth will inherit and amplify its errors. &lt;em&gt;Transcript stability&lt;/em&gt; ensuring that once a word is committed it stays committed, and that confidence scores accurately reflect uncertainty is an engineering discipline, not a given.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;From Transcription to Action: The Real Ambition&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The most significant shift happening in &lt;strong&gt;speech intelligence&lt;/strong&gt; right now isn't about accuracy or latency. It's about what the transcript &lt;em&gt;does&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In the &lt;em&gt;speech-to-action&lt;/em&gt; paradigm, the ASR transcript is fed directly into an LLM that can call external tools, query databases, trigger workflows, and manage complex dialogue. The voice interface becomes a reasoning interface. The gap between "I said a thing" and "something happened" collapses.&lt;/p&gt;

&lt;p&gt;This requires a level of integration between the speech layer and the reasoning layer that earlier architectures couldn't support. The emerging answer is &lt;em&gt;full-duplex multimodal models&lt;/em&gt; where a single model handles voice input, reasoning, and voice output in one pipeline, rather than piping data between separate ASR, LLM, and TTS services. &lt;a href="https://smallest.ai/speech-to-speech" rel="noopener noreferrer"&gt;Smallest.ai's Hydra&lt;/a&gt; takes this approach, handling intent detection and voice synthesis together to eliminate the inter-service latency that makes stitched-together pipelines feel unnatural.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Real-Time Voice AI Looks Like in Practice
&lt;/h3&gt;

&lt;p&gt;Smallest.ai's &lt;a href="https://showcase.smallest.ai/projects/debate-arena" rel="noopener noreferrer"&gt;Debate Arena&lt;/a&gt; is a working demonstration of how far orchestration has come. The system stages a philosophical debate between AI agents, Socrates arguing &lt;em&gt;for&lt;/em&gt; and Aristotle arguing &lt;em&gt;against&lt;/em&gt; any topic the user proposes with distinct voices, expressive vocal parameters (emotion, pitch, volume, prosody) predicted by the LLM each round, and an ancient Athenian judge scoring the exchange.&lt;/p&gt;

&lt;p&gt;For a system like this to work through voice, the ASR layer needs to maintain multi-speaker tracking, support adversarial turn-taking without the agents talking over each other, and do all of this at low enough latency that the conversation feels alive. The Debate Arena uses Lightning TTS v3.2 WebSocket streaming, with voice parameters generated dynamically per round by GPT-4o-mini. It supports two modes, Philosophical and Roast Battle with escalating arguments and audience voting.&lt;/p&gt;

&lt;p&gt;It's a playful project, but it demonstrates something serious: the engineering required to make multi-agent, multi-voice, real-time voice interaction work is now tractable. The primitives exist. The question is how to compose them well.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Where This Is Going&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The next decade of &lt;strong&gt;AI speech recognition&lt;/strong&gt; is likely to diverge along two paths that are pulling in opposite directions.&lt;/p&gt;

&lt;p&gt;The first is to scale a massive cloud model trained on ever-larger and more diverse datasets, capable of handling more languages, more accents, more acoustic conditions. The second is compression, hyper-efficient on-device models that run locally on a phone or an industrial edge device without sending audio to the cloud. Privacy, data sovereignty, and latency concerns are all pushing toward the second path, even as raw capability improvements come from the first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adaptive and personalised speech models&lt;/strong&gt; represent a third direction that cuts across both. Rather than building a single model that tries to be equally good at everything, future systems will adapt in real-time to an individual speaker's specific pitch, pace, and vocabulary. Zero-shot adaptation, learning to recognise a specific voice from a few seconds of reference audio makes this tractable without requiring per-user retraining at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Building Things That Actually Work&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For developers, the translation from research benchmarks to production systems requires moving past Word Error Rate as the primary metric. WER tells you how accurate the model is on a test set. It doesn't tell you whether users can trust it.&lt;/p&gt;

&lt;p&gt;The metrics that matter in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tail latency (P99):&lt;/strong&gt; Does the system respond quickly under heavy load, or does it occasionally spike in ways that break the conversation?
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibrated confidence:&lt;/strong&gt; When the model reports 90% certainty, is it actually right 90% of the time? Overconfident models are more dangerous than uncertain ones.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain-specific adaptation:&lt;/strong&gt; Does the system handle your vocabulary? Medical terms, product names, and technical jargon that don't appear in general training data can be addressed through word boosting and custom dictionaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Best Practice&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Handle low confidence&lt;/td&gt;
&lt;td&gt;Flag words below 0.90 for human review&lt;/td&gt;
&lt;td&gt;Reduced error rate in high-stakes documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use WebSockets&lt;/td&gt;
&lt;td&gt;Implement persistent WSS connections&lt;/td&gt;
&lt;td&gt;Sub-500ms response times for voice agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adopt noise-trained models&lt;/td&gt;
&lt;td&gt;Skip preprocessing in chaotic environments&lt;/td&gt;
&lt;td&gt;Better performance in factories and vehicles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitor RTF&lt;/td&gt;
&lt;td&gt;Track the Real-Time Factor of inference&lt;/td&gt;
&lt;td&gt;Guaranteed responsiveness under load&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Smallest.ai's ecosystem offers a set of tools built around these production constraints. Pulse STT delivers 64ms TTFT with built-in diarization across 30+ languages. &lt;a href="https://smallest.ai/blog/evaluating-lightning-asr-against-leading-streaming-speech-recognition-models" rel="noopener noreferrer"&gt;Lightning ASR&lt;/a&gt; is optimised for sub-300ms latency, with particular strength in non-English languages. &lt;a href="https://smallest.ai/" rel="noopener noreferrer"&gt;Hydra&lt;/a&gt; handles the full voice conversation pipeline such as input, reasoning, and output in a single model.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;A Note on What "Working" Really Means&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Market projections $19.5 billion by 2030, 27% of the global population already using voice commands tend to measure adoption, not satisfaction. A system that works for one speaker in a quiet room and fails another speaker in a noisy one is not a solved problem, even if it ships with impressive accuracy numbers.&lt;/p&gt;

&lt;p&gt;The history of &lt;strong&gt;automatic speech recognition&lt;/strong&gt; is a history of systems getting impressively good at well-resourced voices and incrementally better at everyone else. The architecture has genuinely improved encoder-decoder transformers, end-to-end training, and noise-robust learning are meaningful advances over the rule-based systems of the 1990s. But the generalisation gap that makes a 95% accuracy number in a lab become a 75% accuracy number in the field is not a technical afterthought. It's the central problem.&lt;/p&gt;

&lt;p&gt;Building voice interfaces that are worth trusting means taking that gap seriously in the training data you choose, the confidence metadata you expose, the noise conditions you test against, and the communities whose voices you treat as primary cases rather than edge cases.&lt;/p&gt;

&lt;p&gt;The era of voice-first interfaces hasn't simply arrived. It's arriving unevenly. And the engineers who understand why have a real opportunity to build something better.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tools referenced in this piece: &lt;a href="https://smallest.ai/" rel="noopener noreferrer"&gt;Pulse STT&lt;/a&gt;, &lt;a href="https://smallest.ai/blog/evaluating-lightning-asr-against-leading-streaming-speech-recognition-models" rel="noopener noreferrer"&gt;Lightning ASR&lt;/a&gt;, &lt;a href="https://smallest.ai/" rel="noopener noreferrer"&gt;Hydra&lt;/a&gt;, &lt;a href="https://showcase.smallest.ai/projects/multilingual-translator" rel="noopener noreferrer"&gt;Multilingual Translator&lt;/a&gt;, &lt;a href="https://showcase.smallest.ai/projects/voice-chinese-whispers" rel="noopener noreferrer"&gt;Voice Chinese Whispers&lt;/a&gt;, &lt;a href="https://showcase.smallest.ai/projects/debate-arena" rel="noopener noreferrer"&gt;Debate Arena&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>voiceagents</category>
      <category>smallestai</category>
    </item>
    <item>
      <title>Why Speech Recognition API Requires a Different Architecture</title>
      <dc:creator>Smallest AI</dc:creator>
      <pubDate>Wed, 01 Apr 2026 09:18:26 +0000</pubDate>
      <link>https://forem.com/smallestai-community/why-speech-recognition-api-requires-a-different-architecture-46ed</link>
      <guid>https://forem.com/smallestai-community/why-speech-recognition-api-requires-a-different-architecture-46ed</guid>
      <description>&lt;h1&gt;
  
  
  Speech Recognition API: Streaming, WebSockets and Latency
&lt;/h1&gt;

&lt;p&gt;A speech recognition API that accepts a file and returns a transcript is a solved problem.The architecture is simple because the constraints are simple.&lt;/p&gt;

&lt;p&gt;Real-time transcription is different. The audio doesn't exist yet when processing needs to begin. The user is still speaking while the system needs to be building a hypothesis about what they said. The application needs a partial answer now, not a complete answer in two seconds. These constraints change the architecture at every layer, from how audio is captured and transmitted to how the recognition model processes it and how results flow back to the client.&lt;/p&gt;

&lt;p&gt;This piece walks through that architecture end to end. Not as an API reference, but as an explanation of what is actually happening inside a streaming speech recognition system and why each component is designed the way it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fundamental problem with batch transcription for real-time use
&lt;/h2&gt;

&lt;p&gt;Before looking at how streaming ASR works, it helps to understand precisely why the batch approach breaks down when applied to real-time audio.&lt;/p&gt;

&lt;p&gt;In a batch system, the flow is straightforward. Audio is captured, buffered until complete, sent to a recognition service via an HTTP POST request, processed server-side, and a transcript is returned in the response body. The model sees the entire utterance before producing any output. This gives it full context, which tends to produce accurate results.&lt;/p&gt;

&lt;p&gt;The problem is time. If a user speaks for five seconds, the system cannot return any transcript until those five seconds of audio have been captured, transmitted, and processed. Even with a fast model, the user experiences a dead pause after finishing their sentence before anything happens. In a voice agent or real-time captioning system, that pause breaks the interaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flr6isplxifqdni7x58ey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flr6isplxifqdni7x58ey.png" alt="Speech Recognition API" width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The deeper problem is that buffering introduces a fundamental floor on latency that no amount of model optimization can eliminate. Even an infinitely fast model cannot return a transcript before the audio has been collected and sent. The latency is baked into the architecture.&lt;/p&gt;

&lt;p&gt;Streaming ASR removes this floor by changing the fundamental contract. Rather than collecting audio and then processing it, the system processes audio as it arrives.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a streaming ASR API receives audio
&lt;/h2&gt;

&lt;p&gt;The first architectural shift in a streaming system is the transport layer. HTTP request-response is the wrong shape for continuous audio delivery.&lt;/p&gt;

&lt;p&gt;A new HTTP connection carries significant overhead including DNS resolution, TCP handshake, TLS negotiation, and HTTP headers on every request. For a file upload, this overhead is negligible relative to the payload. For 20-millisecond audio packets arriving fifty times per second, it is prohibitive. The connection overhead would dominate the actual audio data.&lt;/p&gt;

&lt;p&gt;A WebSocket connection solves this by establishing a single persistent connection that remains open for the duration of the session. The initial handshake happens once. After that, both sides can send data at any time without per-message overhead. The client pushes audio packets as they arrive from the microphone. The server pushes transcript events as they are produced by the recognition model. Neither side waits for the other to finish.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python Batch Processing Example
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prerequisite: &lt;a href="https://app.smallest.ai/login?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_campaign=medium_article" rel="noopener noreferrer"&gt;SmallestAI's API Key&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SMALLEST_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;audio_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meeting_recording.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.smallest.ai/api/v1/pulse/get_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pulse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word_timestamps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diarization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;emotion_detection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;audio_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audio_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transcription:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
    &lt;span class="n"&gt;speaker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speaker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [Speaker &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;emotions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Emotions detected:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;emotion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;emotions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;emotion&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Output
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8okgvmss0j3bwqbmdswf.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8okgvmss0j3bwqbmdswf.gif" alt="smallest realtime terminal" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The audio capture and network transmission run concurrently. The recognition server receives a continuous stream of small packets rather than waiting for a complete file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inside the recognition model: how streaming inference works
&lt;/h2&gt;

&lt;p&gt;Once audio packets arrive at the recognition server, the ASR model needs to produce transcript output without waiting for the utterance to complete. This requires a different inference architecture from batch transcription.&lt;/p&gt;

&lt;p&gt;Modern streaming &lt;a href="https://smallest.ai/blog/what-makes-a-high-performance-real-time-asr-api" rel="noopener noreferrer"&gt;ASR systems&lt;/a&gt; use a buffer that accumulates incoming audio packets and runs the recognition model against overlapping windows of that buffer. The window size is typically 250 to 500 milliseconds, much longer than the 20ms packet size, because the model needs enough acoustic context to make meaningful predictions. Each time new audio arrives, the window advances and the model produces an updated hypothesis.&lt;/p&gt;

&lt;p&gt;The model's job at each step is to answer the same question. Given all the audio seen so far, what is the most likely transcript? The answer changes as more audio arrives. A word that looked like "their" in the first pass might resolve to "there" when the following words provide context. These updates produce the partial transcript stream.&lt;/p&gt;

&lt;p&gt;The internal architecture of the recognition model is typically an encoder-decoder transformer. The encoder converts the incoming audio frames into a sequence of dense vector representations capturing phonetic and prosodic features. The decoder attends to those representations to produce token predictions, one sub-word token at a time, building the transcript incrementally.&lt;/p&gt;

&lt;p&gt;What makes this work in streaming mode is a technique called chunked attention, where the encoder is constrained to attend only to audio within a rolling window rather than the full utterance. This means the model can produce outputs without waiting for the sentence to end, at the cost of slightly reduced accuracy on words near the end of the window where context is limited.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebSocket Connection Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# file name: websocket.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SMALLEST_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transcribe_stream&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Build the WebSocket URL with query parameters
&lt;/span&gt;    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;linear16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word_timestamps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;query_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://waves-api.smallest.ai/api/v1/pulse/get_text?&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Connect with Bearer token authentication header
&lt;/span&gt;    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connected to Pulse STT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Launch concurrent tasks: send audio &amp;amp; receive transcripts
&lt;/span&gt;        &lt;span class="n"&gt;send_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;send_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;recv_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;receive_transcripts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;send_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recv_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Read audio from a source and stream it to the WebSocket.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio_16k_mono.raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Recommended chunk size
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Pace stream to simulate real-time
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;receive_transcripts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Receive and process transcript responses from the server.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FINAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_final&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PARTIAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Access word timestamps if enabled
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;transcribe_stream&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Output
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqqhju04bdemehrfopza.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqqhju04bdemehrfopza.gif" alt="smallest-websocket-terminal" width="1920" height="1080"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Word timestamps are a byproduct of the encoder's attention alignment. The model learns which audio frames correspond to which output tokens during training, and this alignment is surfaced as timing metadata without additional inference cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpointing and deciding when an utterance ends
&lt;/h2&gt;

&lt;p&gt;One of the harder problems in streaming ASR is endpointing, which is detecting when the speaker has finished a turn rather than simply paused mid-sentence. This matters because the system needs to know when to commit a final transcript and when to keep accumulating audio for the current hypothesis.&lt;/p&gt;

&lt;p&gt;Getting endpointing wrong in either direction has visible consequences. An endpointer that fires too early cuts off sentences, producing truncated transcripts. One that fires too late adds perceptible delay after the speaker finishes, because the application has to wait for the endpointer before it can act on what was said.&lt;/p&gt;

&lt;p&gt;The simplest approach is energy-based voice activity detection. If the audio energy drops below a threshold for a fixed duration, the system assumes the speaker has finished. This works adequately in quiet environments but fails under noise, where energy never drops cleanly to silence, and for speakers who naturally pause mid-thought.&lt;/p&gt;

&lt;p&gt;Better endpointing systems combine acoustic signals with semantic signals. The acoustic layer watches for energy drops and spectral changes that characterize sentence endings. The semantic layer, often a small language model running on the partial transcript, checks whether the utterance is syntactically complete. A partial transcript ending mid-noun-phrase is unlikely to represent a complete turn. One ending with a complete declarative sentence is more likely to be a real turn boundary.&lt;/p&gt;

&lt;p&gt;The output of the endpointer determines when partial transcript events transition to final transcript events in the client. A partial event is a hypothesis update. A final event is a committed result that the application can act on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The transcript event stream and how to handle it
&lt;/h2&gt;

&lt;p&gt;A streaming ASR API produces a continuous stream of events rather than a single response. Each event carries a type field distinguishing partial from final results, the current transcript text, word-level metadata if requested, and a timestamp indicating where in the audio this result corresponds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Partial Response (&lt;code&gt;is_final: false&lt;/code&gt;)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sess_12345abcde&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the customer said they want&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_final&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_last&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.52&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;said&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.74&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;they&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.76&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.90&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;want&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Final Response (&lt;code&gt;is_final: true&lt;/code&gt;)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sess_12345abcde&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the customer said they want a refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_final&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_last&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.52&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;said&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.74&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;they&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.76&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.90&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;want&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.10&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.14&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.48&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that the word "want" had a confidence of 0.72 in the partial event. The model was uncertain whether more audio would follow and change the interpretation. In the final event, with the complete context, it scores 0.91. The word did not change, but the model's certainty about it did.&lt;/p&gt;

&lt;p&gt;This is why acting on partial transcript content is architecturally risky. A word at low confidence in a partial might resolve to something different in the final. Any downstream action triggered by the partial would have been based on an unstable input.&lt;/p&gt;

&lt;p&gt;The correct pattern is to use partial transcripts for user-facing display only, where visible corrections feel natural and expected, and to gate all application logic on final transcripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Confidence scores and what they actually measure
&lt;/h2&gt;

&lt;p&gt;Every word in a streaming transcript carries a confidence score, typically a probability between 0.0 and 1.0, representing how certain the model is about that prediction.&lt;/p&gt;

&lt;p&gt;The confidence score is not a measure of whether the word is correct. It measures how much probability mass the model assigned to this particular output versus the alternatives it considered. A score of 0.95 means the model strongly preferred this word over all others it evaluated. A score of 0.60 means there were plausible alternatives that the model considered seriously.&lt;/p&gt;

&lt;p&gt;Words with low confidence scores are disproportionately likely to be wrong, but the relationship is not one-to-one. A model can be highly confident and wrong, particularly on proper nouns or domain-specific terms not present in training data. And it can be somewhat uncertain and still produce the correct output.&lt;/p&gt;

&lt;p&gt;The most useful application of confidence scores is flagging rather than filtering. Rather than discarding low-confidence words, mark them for downstream attention. In a customer service context, a low-confidence stretch in a critical part of a call is a signal to route the transcript for human review. In a voice agent, a low-confidence final transcript is a signal to ask for clarification rather than proceeding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transcribe_with_flagging&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncWavesClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;word_timestamps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;


    &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Paralinguistic signals alongside the transcript
&lt;/h2&gt;

&lt;p&gt;The acoustic signal carries information that survives the conversion to text and information that does not. Tone, emotional register, estimated speaker age and gender are all present in the audio and absent from the transcript. A recognition service that surfaces these as structured metadata gives the application something the transcript alone cannot provide.&lt;/p&gt;

&lt;p&gt;Emotion detection tags the emotional register of each segment. Gender and age detection provide demographic signals. These are extracted during the same inference pass as the transcript, using acoustic features the encoder computes regardless. They come without the cost of a separate processing pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;full_acoustic_analysis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncWavesClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;word_timestamps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;age_detection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;gender_detection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;emotion_detection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transcript:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle emotions as a dictionary
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;emotions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Detected emotions:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;emotion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;emotions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;emotion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;capitalize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Detected gender:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gender&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated age range:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are probabilistic estimates. They work well in aggregate and should be treated as signals rather than ground truth, particularly for individual utterances where the acoustic evidence may be ambiguous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming TTS and closing the audio loop
&lt;/h2&gt;

&lt;p&gt;A streaming ASR system rarely lives alone. In a voice agent, the transcript from the recognition service feeds a language model, which generates a response that must be converted back to audio and played to the user. The latency of that full loop determines whether the agent feels conversational.&lt;/p&gt;

&lt;p&gt;The dominant contributor is LLM reasoning, typically 300 to 500ms even with a fast model. The highest-impact optimization is therefore starting TTS synthesis before the LLM has finished generating, feeding the streaming token output rather than waiting for the complete response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.smallest.ai/waves/documentation/text-to-speech-lightning/streaming" rel="noopener noreferrer"&gt;Smallest.ai's WavesStreamingTTS&lt;/a&gt; connects to the synthesis service via a persistent WebSocket and accepts text chunks as they arrive from the LLM token stream. Audio chunks come back as each sentence is ready, so playback can begin in under 100ms from the first LLM token.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;smallestai.waves&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TTSConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WavesStreamingTTS&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wave&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TTSConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;magnus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SMALLEST_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;speed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_buffer_flush_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;streaming_tts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WavesStreamingTTS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Standard Streaming
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Streaming delivers audio in real-time for voice assistants and chatbots.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;audio_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streaming_tts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synthesize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;wave&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamed.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setnchannels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setsampwidth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setframerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;24000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeframes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Multilingual streaming and code-switching
&lt;/h2&gt;

&lt;p&gt;Streaming ASR systems designed around a single language make assumptions that break in multilingual environments. A model trained primarily on English will produce poor results on Hindi and may fail on code-switching utterances that mix the two within a single sentence.&lt;/p&gt;

&lt;p&gt;Smallest.ai's Lightning ASR model supports 30 languages including Hindi, German, French, Spanish, Italian, Portuguese, Russian, Arabic, Polish, Dutch, Tamil, Bengali, Gujarati, Kannada and Malayalam, with a &lt;code&gt;multi&lt;/code&gt; mode for code-switching environments where speakers alternate between languages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multilingual_transcription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncWavesClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SMALLEST_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Auto Language Detection
&lt;/span&gt;            &lt;span class="n"&gt;word_timestamps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;emotion_detection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code-switching is architecturally harder than single-language recognition because the model cannot assume a stable phoneme inventory or language model distribution. The &lt;code&gt;multi&lt;/code&gt; mode handles detection and switching internally, removing the need for the application to route audio to different models based on detected language.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the architecture means for how you build
&lt;/h2&gt;

&lt;p&gt;Understanding how streaming ASR works at each layer changes what you build around it.&lt;/p&gt;

&lt;p&gt;Because partial transcripts are revised as more audio arrives, any application logic that needs to act on what the user said must wait for a final transcript event. Displaying partials in a UI is fine because visible corrections feel natural. Triggering downstream actions, database lookups, tool calls, or routing decisions on partial content is architecturally unsound. The final event is the contract.&lt;/p&gt;

&lt;p&gt;Because word confidence scores reflect model uncertainty rather than correctness, the right use is flagging rather than filtering. A word with 0.60 confidence is a candidate for human review or a clarification prompt, not something to silently drop from the transcript.&lt;/p&gt;

&lt;p&gt;Because the endpointer determines when final transcripts are emitted, the responsiveness of the system to turn endings is determined by endpointing quality, not model speed. A fast model behind a slow endpointer still feels slow. Endpointing latency is worth measuring explicitly and separately from overall model throughput.&lt;/p&gt;

&lt;p&gt;Because WebSocket connections carry state, connection management becomes an application concern. Dropped connections need reconnection logic. Audio buffered during a reconnect gap needs to either be replayed or explicitly acknowledged as lost. These failure modes do not exist in batch transcription and need to be designed for in streaming systems from the start.&lt;/p&gt;

&lt;p&gt;The technology is capable. The architecture around it determines whether that capability reaches the user.&lt;/p&gt;

&lt;p&gt;*Tools referenced in this piece: &lt;a href="https://github.com/smallest-inc/smallest-python-sdk" rel="noopener noreferrer"&gt;AsyncWavesClient&lt;/a&gt;, &lt;a href="https://docs.smallest.ai/waves/v-4-0-0/documentation/text-to-speech-lightning/streaming#websocket-streaming" rel="noopener noreferrer"&gt;WavesStreamingTTS&lt;/a&gt;, &lt;a href="https://docs.smallest.ai/waves/documentation/speech-to-text-pulse/quickstart" rel="noopener noreferrer"&gt;Pulse STT&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voice</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
