Forem: Elijah Atamas

Real-Time (S2S) vs Cascading (STT/TTS) Voice Agent Architecture

Elijah Atamas — Fri, 24 Oct 2025 00:00:00 +0000

Three architectural approaches exist for building voice agents:

Chained pipelines using separate speech recognition, language processing, and synthesis components
Speech-to-Speech (Half-Cascade) that processes native audio input, uses text-based language reasoning, and generates speech output
Native Audio models that reason directly in audio space within a single neural network

Each makes different tradeoffs between flexibility, latency, cost, production readiness, and audio quality preservation.

Understanding Speech-to-Speech Voice Agent Architecture

What Is Speech-to-Speech?

Speech-to-speech voice agents process audio with minimal delay – 200-300 milliseconds from user speech to agent response.

Two approaches exist: half-cascade systems use native audio input processing with text-based language model reasoning and speech synthesis output. Native audio models handle everything within a single neural network that reasons directly in audio space.

Both encode incoming sound into vectors capturing linguistic content, tone, and emotion. They begin generating responses while the user is speaking or immediately after. Native audio maintains more audio information throughout processing, while half-cascade systems balance modularity with lower latency than chained architectures.

Chained vs. Speech-to-Speech Voice Agent Architecture

Chained pipelines follow a sequential flow: Voice → STT → LLM → TTS → Voice. Each component waits for the previous one to finish before processing. Speech-to-speech architectures stream input and output concurrently across the stack, reducing perceived delay in scenarios that involve rapid turn-taking or mid-utterance interactions.

Aspect	Chained Voice Agent	Speech-to-Speech Voice Agent
STT Processing	Can stream partial transcripts, but waits for end-of-utterance to finalize	Continuously streams partial transcripts as user speaks
LLM Behavior	Waits for complete STT output before processing	Begins processing from partial input while user is still speaking
TTS Synthesis	Can stream audio chunks, but starts after LLM generates first chunks (TTFT)	Starts speaking immediately as first tokens are generated, fully streaming
Latency	Higher due to sequential handoffs between components	Lower – concurrent streaming across all components
Flexibility	High – easy to swap out STT, TTS, and LLM independently	Less flexible – components must support tight integration and real-time coordination
Risks / Challenges	Requires careful orchestration between components to minimize latency	Significantly higher cost (~10x chained pipeline); requires stream orchestration to avoid mishearing
User Experience	Structured and clear, but less dynamic; noticeable pauses between turns	Agent can begin replying before user finishes speaking; maintains emotional tone through audio processing
Best Use Cases	All use cases, especially when cost control and flexibility are priorities	Best when ultra-low latency is critical and budget allows (AI concierges, premium live support)
Technical Requirements	Moderate – most providers offer PaaS solutions; focus on linking components and fallback strategy	Moderate – cloud APIs handle infrastructure; high only if self-hosting open-source models

Core Architectures for Voice AI Agents

Three fundamental architectural approaches exist for building voice AI agents. Each has distinct trade-offs in latency, flexibility, and naturalness:

1. Chained Pipeline (Cascaded STT→LLM→TTS Architecture)

Schema: Voice → STT → LLM → TTS → Voice

How it works:
The system converts speech to text, processes it through a language model, and turns it back into audio.

Pros:

Easy to build and debug;
Works well with existing LLM APIs;
Reliable and predictable;
High flexibility – easy to swap out STT, TTS, and LLM independently.

Cons:

High latency since each component waits for the previous one to complete;
Loses tone and emotion when converting to text;
Less natural feel, limited interruptibility.

Example implementations: Deepgram STT + GPT-4.1 + Cartesia TTS, Gladia STT + Gemini 2.5 Flash + ElevenLabs TTS

2. Speech-to-Speech (Half-Cascade Architecture)

Schema: Voice → Audio Encoder → Text-based LLM → TTS → Voice

How it works:
The model processes audio input directly through an encoder, uses a text-based language model to reason and respond, then generates speech via synthesis. This combines native audio input with text-based reasoning and speech output.

Google and OpenAI use this half-cascade architecture, balancing speed, performance, and reliability. This works well for production use and tool integration.

Pros:

Lower latency with streaming capability;
Retains tone and prosody cues;
Natural conversational flow;
More interruptible than chained pipeline.

Cons:

Still has a separate LLM reasoning layer (text-native);
TTS quality is lower than specialized TTS models (e.g., ElevenLabs, Cartesia) – voice sounds less natural and expressive;
Less flexible than fully modular approach.

Example systems: Google Gemini Live 2.5 Flash, OpenAI Realtime API (gpt-realtime), Ultravox

3. Native Audio Model (End-to-End Speech-to-Speech AI)

Schema: Voice → Unified Model → Voice

How it works:
A single model listens, reasons, and speaks – all within one neural network. It encodes audio into latent vectors that capture meaning, emotion, and acoustic context, then directly generates output audio from those same representations.

Pros:

Very low latency (true real-time);
Maintains emotional tone and voice consistency;
Most natural conversational quality;
Supports full-duplex with natural interruptions.

Cons:

Hard to train and control;
Opaque reasoning (no clear text layer);
Needs huge, high-quality audio datasets;
Limited flexibility for voice customization.

Example systems: Gemini 2.5 Flash Native Audio, VITA-Audio, SALMONN-Omni, Moshi by Kyutai Labs

Gemini 2.5 Flash Native Audio (gemini-2.5-flash-native-audio-preview) provides true native audio processing – reasoning and generating speech natively in audio space. It includes affective (emotion-aware) dialogue, proactive audio capabilities, and "thinking" features. This represents Google's experimental approach to end-to-end audio reasoning without text intermediation.

Available Speech-to-Speech Models & Platforms

Commercial APIs and open-source projects provide speech-to-speech voice agents in 2025:

Leading Proprietary Platforms

OpenAI and Google offer three production-ready speech-to-speech voice models:

Feature	OpenAI Realtime API (gpt-realtime)	Google Gemini Live 2.5 Flash	Google Gemini 2.5 Flash Native Audio
Architecture Type	Half-Cascade (Speech-to-Speech)	Half-Cascade (Speech-to-Speech)	Native Audio (End-to-End)
Provider	OpenAI (also via Azure)	Google / DeepMind	Google / DeepMind
Model Type	Multimodal LLM with realtime audio streaming support	Multimodal flash LLM optimized for speed and interactivity	Native audio-to-audio model with affective dialogue and proactive audio capabilities
Latency – Time to First Token	~280 ms	~280 ms	~200-250 ms (experimental)
Audio Input	Streaming audio via WebRTC + WebSocket API	Streaming audio via Multimodal Live API (likely gRPC-based)	Native audio streaming via Multimodal Live API
Token Generation Speed	~70–100 tokens/second	~155–160 tokens/second	N/A (generates audio directly, not token-based)
Hosting / Access	Cloud only (OpenAI API / Azure OpenAI Service)	Cloud only (Google AI Studio / Vertex AI)	Cloud only (Google AI Studio / Vertex AI) – Preview only
Developer Integration	Open-source reference stack with LiveKit + WebRTC + OpenAI Streaming API	Access via Google's Vertex AI or AI Studio; endpoint: gemini-2.5-flash-live-001	Access via Google AI Studio; endpoint: gemini-2.5-flash-native-audio-preview
Multimodal Capabilities	Yes – audio input, speech output; also supports vision	Yes – audio, video, text input; supports images and rolling context in conversation	Yes – native audio reasoning with emotion awareness, "thinking" mode, proactive audio
Throughput Capacity	~800K tokens/min, ~1,000 req/min (Azure OpenAI, realtime mode)	N/A (not publicly specified, but optimized for high concurrency and streaming)	N/A (experimental preview, not intended for production scale)
Production Readiness	Generally Available	Generally Available	Experimental Preview only – not production-ready

Open-Source Alternatives

Two open-source projects offer alternatives to proprietary models:

Feature	Ultravox (by Fixie.ai)	Moshi (by Kyutai Labs)
Architecture Type	Half-Cascade (Speech-to-Speech)	Native Audio (End-to-End)
Model Type	Multimodal LLM (audio + text encoder, outputs text)	Audio-to-audio LLM (integrated STT and TTS – speech in, speech out)
Architecture	Voice → LLM → Text (planned speech output in future versions)	Voice → LLM → Voice (fully integrated speech-to-speech pipeline)
Streaming Support	Streaming text output with low latency	Full-duplex streaming (supports overlap and interruption)
Time to First Token (TTFT)	~190 ms (on smaller variant)	~160 ms
Token Generation Speed	~200+ tokens/sec	Not token-based; generates speech waveform directly
Base Models	Built on open LLMs (e.g., LLaMA 3 – 8B / 70B)	Proprietary foundation model trained by Kyutai
Audio Processing	Projects audio into same token space as text using custom audio encoder	End-to-end audio encoder and decoder (neural codec pipeline)
Output Type	Text (for now), with plans for speech token output	Audio (neural codec speech)
Hosting / Deployment	Self-hostable; requires GPU infra, especially for 70B variant	Self-hostable (heavy); public demo available at moshi.chat
Open-Source Status	Fully open: model weights, architecture, and code available on GitHub	Fully open: code and demos available; weights provided (early stage)
Extensibility	Can plug in any open-weight LLM; attach custom audio projector	Closed model structure for now; focused on turnkey audio-agent use
Use Case Fit	Voice-enabled bots with real-time understanding, using custom TTS for output	Full voice agents with natural interruptions and direct speech response

Integration Frameworks and Tools

Integration frameworks include Pipecat (vendor-agnostic voice agent framework used by Daily.co), LiveKit (WebRTC streaming, used by OpenAI), and FastRTC (Python streaming audio). For comprehensive platform comparisons including deployment options and integration approaches, see the voice agent platform guide. Developers can assemble open-source speech recognition (Vosk, NeMo) and TTS (VITS, FastSpeech) components into speech-to-speech agents without using end-to-end models.

Performance Metrics That Matter

Three metrics determine voice agent performance: speed (time to first token), accuracy (word error rate), and processing efficiency (real-time factor).

Time to First Token (TTFT)

Time to First Token (TTFT) measures latency from end-of-user-speech to start-of-agent-speech. Current models achieve TTFT in the 200-300 millisecond range: Google's Gemini Flash logs ~280 ms, OpenAI's GPT-4o realtime ~250-300 ms. Human response latencies in conversation average around 200 ms.

Network latency affects cloud API measurements, so real-world TTFT runs higher than lab values. Published TTFT may be measured in controlled settings or end-to-end.

Lower TTFT is better, though extremely low values may indicate the model responds before fully processing user intent.

Word Error Rate (WER)

Word Error Rate (WER) measures the percentage of words incorrectly recognized in the transcript. Lower WER means more accurate transcription. Meta AI's research on streaming LLM-based ASR achieved ~3.0% WER on Librispeech test-clean (~7.4% on test-other) in real-time mode, approaching offline model accuracy.

Recognition errors can lead the LLM astray. Cloud providers publish WER on benchmarks, but real-world WER runs higher. Real-time agents may correct some ASR errors via context, though lower baseline WER remains preferable.

Domain adaptation through custom vocabulary or fine-tuning helps with specialized terminology.

Real-Time Factor (RTF)

Real-Time Factor (RTF) measures processing speed relative to input duration. RTF < 1.0 means the system processes faster than real time. Each component has its own RTF: STT engines typically process at 0.2× real time, LLMs generate at 50+ tokens/sec, modern TTS synthesizes at RTF 0.1 or better (10 seconds of speech generated in 1 second).

Systems must maintain RTF < 1 under load to prevent latency accumulation. Smaller models often achieve better RTF at the cost of language quality, making token generation speed a determining factor for ultra-low latency requirements.

Cost Analysis and Scalability for Speech-to-Speech Voice Agents

Speech-to-speech voice agent costs break down into five categories: cloud API usage, self-hosting compute, scalability limits, bandwidth, and enterprise overhead.

Cost Category	Description	Examples / Benchmarks	Key Considerations
Usage-Based Pricing (Cloud APIs)	Pay-per-token/minute for any architecture (STT, LLM, TTS, or integrated multimodal)	OpenAI Realtime: ~$0.30/min (baseline), increases significantly with turns Gemini Live: ~$0.22/min (baseline), increases with turns Gemini Native Audio: ~$0.50/min with typical conversation turns (experimental) Chained pipeline: ~$0.15/min (no context accumulation)	Speech-to-speech models have context accumulation that dramatically increases costs with conversation turns; chained pipelines maintain consistent per-minute pricing
Compute Costs (Self-Hosting)	Run open-source models like Ultravox/Moshi on your own infra	Hosting Ultravox 70B may need A100/H100 GPU per concurrent session; GPU costs: ~$2–$3/hr (cloud)	Lower marginal cost at scale; Requires infra & DevOps team; Harder to spin up instantly
Scalability / Rate Limits	Limits on concurrent sessions, tokens per minute, request rate	OpenAI GPT-4o: 800K tokens/min, 1K requests/min; Enterprise: up to 30M tokens/min	Watch for WebSocket caps or long-lived session constraints; Request enterprise quotas if needed
Bandwidth Overhead	Cost of streaming audio data over network	~8–64 kbps per stream; Telephony codecs (e.g. G.711 vs G.729) can affect costs	Minor cost per stream, but adds up at scale; Ensure egress limits aren't exceeded in cloud setups
Enterprise Overhead	SLAs, premium support, custom deployments, fallback systems	Regional/on-prem hosting; Redundancy systems (e.g. backup STT or fallback bots)	Adds reliability and control; Contractual/licensing complexity increases total cost of ownership

Understanding Speech-to-Speech Pricing

Speech-to-speech models like GPT-4o Realtime and Gemini 2.5 Flash Live have different cost structures than chained STT/TTS pipelines. For detailed provider comparisons with latency benchmarks and accuracy metrics, see the complete STT and TTS selection guide. Three factors drive higher costs:

Proprietary multimodal infrastructure – These models require specialized neural architectures that process audio natively, maintaining acoustic features throughout the pipeline rather than collapsing to text
Cloud-only deployment – No self-hosting option means paying for enterprise-grade streaming infrastructure, low-latency global endpoints, and WebRTC/gRPC orchestration
Advanced real-time capabilities – Support for interruptions, emotional tone preservation, and sub-300ms latency requires substantial compute resources per session

Real-world cost reports from OpenAI's developer community:

$3 spent on "a few short test conversations" in the playground (simple questions like bedtime stories)
$10 consumed during weekend integration testing, leading developers to call the API "unusable at the moment" due to cost
Costs increase per minute as conversations get longer – in a 15-minute session, one developer reported $5.28 for audio input vs $0.65 for output. This happens because tokens accumulate in the context window, and the model re-charges for all previous tokens on each turn, making longer conversations disproportionately more expensive

User-reported costs differ from official per-minute estimates because actual costs depend on conversation length (context accumulation), system prompt size (larger prompts = more tokens per turn), and conversation complexity (more back-and-forth = more context to maintain). A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more due to accumulated context.

Native Audio Models (Moshi, VITA-Audio) are early-stage and experimental. While they promise the lowest latency and most natural interactions, they are:

Mostly research projects, not production-ready
Require significant GPU resources for self-hosting (A100/H100 class)
Lack the ecosystem support, tooling, and reliability of commercial offerings
Limited voice customization and control compared to modular approaches

Match Cost Strategy to Deployment Scale

Early-stage projects with low volume benefit from cloud APIs: fast setup, predictable pricing, pay-per-use. As usage grows, self-hosting economics may improve, particularly when requiring tight control, data locality, or custom model tuning.

Enterprise scale depends on reliability, rate limits, support agreements, and long-term flexibility – not just price per minute. Total cost of ownership (TCO) includes processing minutes, bandwidth, DevOps effort, redundancy, and support.

Cost calculation for specific scenarios: average conversation length × conversations per day × per-minute pricing = monthly cost. Compare against self-hosting infrastructure investment. Monitor usage limits and enterprise tier requirements.

Technical Implementation Challenges in Speech-to-Speech Voice Agent Deployment

Deploying to production requires integrating streaming, connecting to telephony, handling noise, and orchestrating streams.

Streaming Integration (WebRTC, WebSockets, etc.)

Low latency requires appropriate streaming mechanisms. Three options: WebRTC, WebSockets, and streaming HTTP/gRPC.

WebRTC

Web Real-Time Communication is the standard for low-latency audio/video streaming in browsers and mobile apps. Uses UDP for fast transmission and handles packet loss gracefully. Both OpenAI and Google use WebRTC for client-side audio capture and playback.

Browser and mobile app interactions use WebRTC to send microphone audio to the server. Includes Acoustic Echo Cancellation (AEC), noise reduction, and automatic gain control (AGC). Libraries like LiveKit, mediasoup, or Twilio provide WebRTC integration.

WebSockets and gRPC

Server-side connections between application servers and AI services use persistent bidirectional connections. OpenAI's voice API uses WebSockets – client sends audio chunks and receives tokens continuously. Google's API uses gRPC streaming over HTTP/2.

Both provide continuous streams rather than discrete HTTP requests. Implementation requires proper binary audio frame handling and maintaining open connections for conversation duration.

Audio Encoding

Audio format choice depends on API requirements. PCM raw audio is simple but bulky. Opus codec (used by WebRTC) provides high quality at low bitrate, though not all APIs accept Opus packets. Some APIs accept WAV or FLAC frames.

Compressed codecs save bandwidth for mobile users. Phone calls use G.711 µ-law 8kHz, requiring transcoding to 16kHz linear PCM for most ASR systems (Whisper, DeepSpeech).

Latency Tuning

Streaming systems use buffers to smooth network variation. WebRTC jitter buffers trade smooth audio for added delay. Default WebRTC parameters suffice for most deployments.

WebSocket implementations send data immediately (20ms audio frame every 20ms) without batching. Most WebSocket libraries disable Nagle's algorithm by default to avoid delaying small packets.

Handling Network Issues

WebRTC handles packet loss through loss concealment, filling missing audio chunks with plausible noise. WebSocket implementations lack this but ASR systems handle minor gaps reasonably well on decent networks.

Output packet loss can cause audio blips. Some systems use redundant packets or forward error correction on unreliable networks.

Many implementations combine approaches: WebRTC from client to relay server, then WebSocket from server to AI API. OpenAI's example follows this pattern. WebRTC handles unpredictable client networks while WebSocket simplifies AI model interfacing.

Telephony Integration (8 kHz and PSTN)

Call Quality Challenges

Phone deployments reveal quality issues absent in web-based implementations. Standard PSTN uses 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio.

Most high-quality ASR models (including GPT-4o Realtime, Gemini Live, Whisper) train primarily on 16 kHz audio, so 8 kHz telephony input reduces their accuracy significantly.

Provider Support

Twilio's standard codecs operate at 8 kHz with limited support for higher-quality audio streaming needed for AI models. Telnyx offers native 16 kHz support via G.722 wideband codec through their owned infrastructure, but requires more expertise to configure properly.

Speech-to-speech models (GPT-4o Realtime, Gemini Live) optimized for high-quality web audio don't perform as well over standard PSTN. Their latency and integration benefits disappear over phone while premium pricing remains. This makes chained STT/LLM/TTS pipelines with telephony-optimized components often more reliable and cost-effective for phone-based deployments.

SIP and VoIP Integration

Telephony integration uses services like Twilio, Nexmo, or on-premises SIP systems. These provide audio via WebSocket (Twilio streams 8k PCM in real time) or media servers. Architecture must ingest these streams and connect to the AI pipeline.

DTMF and Control

Telephony providers detect DTMF tones (touch-tone input) out-of-band to avoid confusing ASR. Twilio sends webhook events for DTMF. Speech-to-speech voice agents minimize DTMF menus but users may still attempt touch-tone input.

Telephony Latency

Phone networks add 100-200ms fixed latency. Processing pipelines should minimize additional overhead. Hosting AI services near telephony ingress points reduces roundtrip latency.

Human Agent Handoffs

Human agent handoffs benefit from passing conversation context. AI conversations that escalate after collecting information should provide transcribed summaries to avoid user repetition.

Handling Background Noise & Voice Variability

Noise Suppression

Noise suppression algorithms applied before ASR improve recognition accuracy. ML models like RNNoise remove background noise (keyboard sounds, fans) in real time. Picovoice's Koala demonstrates intelligibility improvements.

Tradeoff: slightly distorts voice and consumes extra CPU.

Microphone Differences

Audio quality varies across headsets, speakerphones, and car bluetooth (frequency response, echo). Echo cancellation prevents agent voice from being picked up by microphone. WebRTC's AEC handles most cases.

Telephone scenarios rely on network echo cancellers or require adaptive echo cancelers in the pipeline.

VAD and Barge-In

Voice Activity Detection (VAD) distinguishes speech from noise. Noisy conditions cause false positives/negatives. Combining VAD with ASR confidence improves accuracy. Treat silence as end-of-utterance only when ASR confirms finality.

Continue assuming speech while ASR generates transcribed words. End turn after 500ms silence. Barge-in requires monitoring microphone during agent speech to stop TTS when user interrupts.

Accents and Languages

Diverse user bases require testing across accents and dialects. Cloud ASRs support accent/locale specifications for improved accuracy. Open models benefit from fine-tuning on accented data.

Bilingual support requires models supporting multiple languages (Google, OpenAI). Multi-language detection works through auto-detection or routing to language-specific models.

Stream Management and Orchestration

Continuous conversation streams require managing concurrent input/output and conversation state.

Half-Duplex vs Full-Duplex

Most systems use half-duplex with barge-in – users can interrupt agents, but agents don't interrupt users except for short backchannel utterances ("uh-huh", "I see"). Backchannel implementation requires detecting pauses and generating quick responses without disrupting ASR.

Prompt Management

Persistent conversation state requires maintaining rolling prompts for the LLM. APIs with persistent sessions handle this up to context limits. Manual implementations append each utterance and reply.

Long conversations require summarizing older content to stay within context windows. Important user-provided facts need re-injection into prompts as needed.

Ensuring Required Steps

Flows requiring specific actions (identity verification, mandatory questions) benefit from checkpoints. Teams can implement checkpoints through LLM prompt instructions or external state machines.

Some systems prevent sending queries to LLM until prerequisite steps complete, or override LLM responses that skip required actions. This combines rule-based flow with AI – trusting AI for understanding and generation while enforcing action sequences.

Speech-to-speech agent orchestration requires managing concurrent input/output streams. Best practices and libraries exist for common patterns. Testing should include scenarios like users interrupting agent speech to verify barge-in logic stops TTS promptly.

Conclusion

Speech-to-speech voice agents reduce latency to 200-300ms, approaching human response times. Proprietary platforms (OpenAI GPT-4o, Google Gemini 2.5 Flash) and open-source options (Ultravox, Moshi) are production-ready or nearing maturity.

Architecture choice depends on deployment environment and constraints:

Chained Pipeline – Voice → STT → LLM → TTS → Voice – provides maximum flexibility and reliability but higher latency
Speech-to-Speech (Half-Cascade) – Voice → Audio Encoder → Text-based LLM → TTS → Voice – balances performance with production readiness, but at significantly higher cost
Native Audio – Voice → Unified Model → Voice – offers the lowest latency and most natural interactions, but remains experimental and not production-ready

Implementation factors:

Performance requirements: TTFT, WER, and RTF targets for the use case
Cost structure: Cloud APIs vs. self-hosting economics at expected scale
Technical complexity: Streaming integration, telephony connectivity, noise handling
Deployment environment: Phone systems (8kHz PSTN) vs. web-based (16kHz+ audio)

System design includes audio streaming, orchestration, testing, and optimization for specific constraints. Cloud APIs enable rapid prototyping. Production deployment requires testing with real user patterns and audio conditions.

You Can't Fix What You Can't See: Production AI Agent Observability Guide

Elijah Atamas — Mon, 20 Oct 2025 20:11:14 +0000

An AI agent runs perfectly for two weeks. Users love it. Then it starts failing for 30% of conversations. No code changes. No infrastructure issues. Just random failures.

Customer support tickets pile up. Attempts to reproduce the failures in staging come up empty. Error logs show nothing useful. No way to tell if the problem is the model, the prompts, the RAG retrieval, or the tool integrations.

Debugging AI agents while users complain is brutal. Traditional debugging doesn't work. Error logs are useless. The failures can't be reproduced.

This pattern appears in 95% of AI implementations. MIT research analyzing 300 public AI deployments found that most pilots stall and deliver little measurable impact.

The survivors share one characteristic: production AI monitoring that reveals what's actually happening inside their agents.

Why AI Agent Debugging Is Different

Traditional debugging techniques collapse when facing AI agents. Breakpoints assume deterministic execution. Unit tests expect consistent outputs. Log files track linear flows. AI agents break all three assumptions.

The same prompt with temperature 0.9 generates different responses every time. Setting breakpoints in probabilistic execution paths provides little value when the next run takes a completely different path. A single misstep hides among thousands of chat turns, and multi-step reasoning chains amplify unpredictability.

Carnegie Mellon's "TheAgentCompany" experiment demonstrates this. AI agents from OpenAI, Meta, Google, and Anthropic worked as software engineers. Result: zero functional product. The agents excelled at meetings and internal conflicts, but debugging their failures proved impossible without proper tracing.

The Replit AI coding assistant incident shows the stakes. In July 2024, it deleted a production database, modified production code despite explicit instructions not to, and concealed bugs by generating 4,000 fake users. It fabricated reports and lied about unit test results. Standard debugging offered no visibility into these failure modes.

What changes with AI agents:

Traditional software monitoring tracks system metrics: CPU usage, memory consumption, error rates. AI agent observability adds quality evaluation as a core component. Monitoring whether an agent responds quickly matters less if that response hallucinates facts or ignores retrieved context.

Context explains behavior more than code. The same agent code produces wildly different outcomes based on conversation history, retrieved documents, tool availability, and model state. Observability must capture this context, not just log statements and stack traces.

Multi-agent systems multiply the challenges. When three agents collaborate to complete a task, failures cascade in unpredictable ways. Agent A's output influences Agent B's reasoning, which affects Agent C's tool selection. Traditional logs show three separate execution traces but miss the emergent behavior arising from their interaction.

Diagnosing Your Observability Gap

Three distinct problems require three different observability solutions. Identify which gap exists before implementing tools.

The Reproducibility Problem

Symptoms:

Failures can't be reproduced in staging
Same input produces different outputs unpredictably
Bugs disappear when investigated, then return later
No way to debug issues from days or weeks ago

Diagnostic question: Can the team pull up the exact execution state (conversation history, temperature, tool calls, random seed) for a specific failed conversation from last week?

If no: Reproducibility gap. Need request IDs, conversation versioning, and state snapshots.

Solution focus: Tracing and logging infrastructure (see "Debugging Reproducibility Failures" below).

The Visibility Problem

Symptoms:

No idea where latency comes from (model? retrieval? tools?)
Token costs spike without understanding why
Can't tell which conversations are expensive vs. cheap
Failures show error messages but no execution context

Diagnostic question: Can the team see a waterfall breakdown showing exactly how much time each component (RAG retrieval, model inference, tool execution) contributes to total latency?

If no: Visibility gap. Need metrics, dashboards, and execution tracing.

Solution focus: Real-time monitoring and metrics (see "Debugging Visibility Failures" below).

The Quality Problem

Symptoms:

Users report wrong answers but team can't identify the root cause
Can't tell if poor responses come from retrieval or generation
No systematic way to measure groundedness or hallucinations
Quality degrades over time but no metrics to track it

Diagnostic question: Can the team distinguish between a retrieval failure (wrong documents surfaced) versus a generation failure (hallucinated content despite correct retrieval)?

If no: Quality gap. Need evaluation frameworks and separate metrics for each component.

Solution focus: Evaluation and quality assessment (see "Debugging Quality Failures" below).

The Three Pillars of AI Agent Observability

If monitoring stops at the model call, it's not monitoring. Full observability requires three pillars that address the diagnostic gaps above.

Pillar 1: Tracing – The Execution Path

Tracing captures every decision, tool call, and reasoning step an agent makes. Each interaction with external tools gets logged. Every prompt sent to the model gets recorded. All responses get tracked.

Full request traces must follow a user interaction through the entire toolchain: the vector store query, the model call, tool executions, and back to the final output. This addresses the reproducibility problem.

A waterfall view shows exactly where latency comes from. Context assembly takes 800ms. Model inference takes 300ms. Tool execution takes 1.2 seconds. Suddenly, optimization targets become obvious. This addresses the visibility problem.

Tracing also reveals intent versus execution gaps. An agent forgets to use a tool versus a tool fails – these look identical without tracing. The trace shows the agent never attempted the tool call, pointing directly to a prompt engineering fix.

Pillar 2: Real-Time Monitoring – The Alert System

Real-time monitoring catches problems before users complain. Cost spikes signal infinite loops causing excessive API calls. Latency degradation reveals infrastructure issues. Error rate jumps expose integration failures.

Production benchmarks that matter:

Latency: <500ms response speed
Error rates: <5% failure rate
Task completion: ≥90%
Accuracy: ≥95%

Alerts configured on these metrics provide early warning. When token usage doubles in an hour, something is wrong. When latency crosses 500ms for 10% of requests, investigation starts immediately.

Without real-time monitoring, founders discover problems through support tickets. With monitoring, alerts trigger before users notice degradation.

Pillar 3: Evaluation – The Quality Check

Evaluation measures what monitoring can't: response quality, groundedness, relevance, and hallucinations. This addresses the quality problem.

Separating retrieval from generation proves critical for RAG systems. Contextual precision measures whether the reranker orders relevant chunks at the top. Contextual recall checks if the embedding model retrieves all necessary information. These retrieval metrics catch problems before generation begins.

Generation metrics assess groundedness to retrieved context, answer relevance, and completeness. When a response fails quality checks, evaluation metrics pinpoint the failure location: retrieval or generation.

Automated evaluations run continuously. Human evaluations sample high-stakes interactions. User feedback through thumbs-up/thumbs-down provides ground truth on real-world performance.

Debugging Reproducibility Failures

The most common observability gap: failures that can't be reproduced. These patterns and solutions address the reproducibility problem.

"My Agent Fails Randomly and I Can't Reproduce It"

Root cause: LLM non-determinism makes identical prompts yield different outputs.

What observability reveals: Immutable snapshots of conversation history, tool results, and random seeds throughout execution. When failures occur, rollback to the last good state instead of starting from scratch.

Solution:

Request IDs link every log entry to a specific conversation. Conversation versioning tracks state changes. Your team logs complete context at each step: conversation history, temperature settings, seed values, and tool results.

When debugging a failure from three days ago, pull up the complete execution history by request ID and isolate the exact divergence point.

"My Agent Gets Stuck in Infinite Loops"

Root cause: Loop agents lack inherent termination mechanisms.

What observability reveals: Message repeat patterns expose loops before they consume budget. Recursive thought detection identifies when agents fall into planning breakdowns.

Solution:

Implement message sequence tracking. Alert when the same tool gets called more than 3 times consecutively. Set maximum iteration limits. Define termination signals through custom events or context flags. Monitor for runaway tool loops that indicate the agent lost track of its goal.

"Tool Calling Fails Unpredictably"

Root cause: Poor tool descriptions or integration errors.

What observability reveals: Trace logs show exact tool call attempts, parameter passing, and response integration. The difference between "agent forgot to use tool" and "tool returned error" becomes clear.

Solution:

Tool calling failures stem from unclear descriptions. Agents need crystal-clear parameter definitions.

Bad tool description: "search - Search for information"

Good tool description: "search_knowledge_base - Search internal knowledge base for customer support articles. Use when user asks about product features, pricing, or troubleshooting. Returns top 5 most relevant articles. Parameter: query (natural language search query, example: 'how to reset password')"

Add API schema validation, retry logic with exponential backoff, and circuit breakers to prevent cascading failures.

Debugging Visibility Failures

The visibility gap: knowing something is wrong but not knowing where. These patterns and solutions address the visibility problem.

"Costs Doubled and I Don't Know Why"

Root cause: No token-level visibility into spending.

What observability reveals: Cost per conversation, expensive tool chains, token usage by execution step. Session-level and agent-level spend tracking identifies cost drivers.

Solution:

Track token usage per request: prompt tokens, completion tokens, cost calculation based on model pricing. Log this alongside request IDs. Alert when single requests exceed cost thresholds (e.g., $1 per request).

Real-time dashboards showing token usage, API call frequency, and cost trends. Alerts when spending exceeds budget thresholds.

"Users Say It's Slow but I Can't See the Bottleneck"

Root cause: Aggregate latency metrics hide component-level delays.

What observability reveals: Token-level latency, tool execution time, context assembly duration. Granular metrics show exactly where time gets spent.

Solution:

Context assembly often becomes the bottleneck with large datasets. In production systems, the breakdown is typically 60% context assembly (RAG retrieval), 25% model inference, and 15% tool execution.

Waterfall visualization exposes the slowest components. Break down latency across each request lifecycle step. Monitor model inference time, tool response time, and data retrieval duration separately.

Debugging Quality Failures

The quality gap: knowing responses are wrong but not knowing why. These patterns address the quality problem.

"I Can't Tell if It's Hallucinating or if RAG Retrieval is Broken"

Root cause: No separation between retrieval and generation evaluation.

What observability reveals: Contextual precision shows reranker quality. Contextual recall measures embedding coverage. Groundedness to context identifies hallucinations.

Solution:

Evaluate retrieval and generation separately. When relevant chunks appear at position 8 instead of position 1, the reranker failed, not the LLM. Founders often blame the model when the issue is retrieval.

Track contradictions (claims against provided context) versus unsupported claims (parts not grounded in context). LLM-as-a-judge approaches combined with deterministic checks automate this.

Monitor retrieval precision, recall, and faithfulness separately from generation relevance and completeness.

RAG Evaluation Specifics

Retrieval failures masquerade as generation problems in most cases. Poor reranking looks identical to embedding issues. Separate evaluation reveals the actual failure point.

Retrieval metrics: Contextual precision measures reranker effectiveness. When relevant chunks appear at position 8 instead of position 1, the reranker failed. Contextual recall measures embedding coverage. Missing 30% of relevant documents indicates embedding model issues.

Search stability matters. Semantically equivalent queries should retrieve similar results. "How do I reset my password" versus "password reset instructions" should surface the same documents. When they don't, the embedding model lacks robustness.

Generation metrics: Groundedness to retrieved context separates hallucinations from valid inferences. An agent claiming "the policy allows refunds within 60 days" when retrieved context says "30 days" reveals a hallucination. An agent inferring "enterprise customers get priority support" from context stating "enterprise tier includes dedicated support" shows valid reasoning.

Production RAG monitoring requires logging inputs, outputs, and intermediate steps. Query rewrites, retrieved chunks, reranking scores, and final responses all need capture. Without intermediate logging, debugging becomes guesswork.

Observability Tools: What We've Tested

Tool selection determines implementation speed and debugging capability. We've tested these tools in production and staging environments.

Tool	Setup Time	Cost	Best For	Softcery Take
Helicone	30 min	$0-$100/mo	Emergency visibility, voice agents	Our go-to for rapid implementation. One-line proxy integration. Built-in caching reduces costs. Free tier handles most MVP traffic.
Langfuse	2 hours	$0-$200/mo	Comprehensive observability, self-hosting	Best for long-term vendor independence. Full tracing, evaluations, prompt management. SDK integration requires more setup but offers more control.
LangSmith	1 hour	$200-$1K/mo	LangChain-native projects	Only justified if deeply integrated with LangChain. Plug-and-play for LangChain users. Otherwise, Langfuse provides better value.
Galileo	1 day	$500-$2K/mo	Mission-critical applications	Advanced hallucination detection. 100% sampling in production. Worth the cost for high-stakes deployments where quality failures are expensive.
Datadog LLM	1 day	$500-$2K/mo	Enterprise observability stacks	Makes sense if already using Datadog for infrastructure. Automated contradiction detection. Integrated alerting.
RAGAS	4 hours	Free	RAG evaluation (open source)	Solid open-source RAG metrics. Retrieval and generation separation. Good starting point before commercial tools.
TruLens	4 hours	Free tier	Domain-specific evaluation	Specializes in domain evaluation. Free tier sufficient for testing. Upgrade for production scale.

Tool Selection Criteria

For MVP-stage founders:

Start with Helicone or Langfuse. Don't build custom observability. Helicone offers fastest implementation (30 minutes, proxy-based). Langfuse provides more comprehensive features (2 hours, SDK-based). Both free for basic usage.

When to consider commercial tools:

LangSmith makes sense only for teams deeply integrated with LangChain. Galileo or Datadog justify their cost ($500-$2000/month) for mission-critical applications where hallucination detection and 100% sampling matter.

Consider cost versus debugging time savings. Spending $500/month on observability beats spending 40 hours debugging blind.

Key criteria:

OpenTelemetry support prevents vendor lock-in. Integration approach matters more than features. Proxy integration works immediately. SDK integration provides flexibility. Built-in framework integration offers best developer experience.

Rapid implementation beats perfect architecture. Visibility in days wins over solutions requiring weeks of setup.

Minimum Viable Observability Setup

Building custom observability wastes 2-4 weeks and misses critical data. Use proven tools.

Implement First (This Week)

Request/Response Logging with Request IDs

Every request gets a unique identifier. Every log entry references that ID. Your team implements UUID generation at request start and includes it in all logging.

Basic Metrics Dashboard

Track four metrics: latency, cost, token usage, error rates. Real-time visibility into these fundamentals catches 80% of production issues. Use Helicone (proxy approach) or Langfuse (SDK approach) – both free for basic usage.

Simple Tracing

Multi-step execution paths reveal agent reasoning. Even basic tracing showing model calls and tool executions provides massive debugging leverage.

Prompt Version Control

Store prompts in version control like code. Tag production versions. When behavior changes unexpectedly, diff prompt versions to identify what changed.

Error Handling Visibility

Log retry attempts, circuit breaker activations, and fallback logic execution. Understanding error recovery is as important as tracking errors.

What Can Wait (Month 2+)

Advanced hallucination detection can wait. Comprehensive evaluation frameworks can wait. Custom metrics and dashboards can wait. Integration with broader observability stacks can wait.

Get visibility first. Expand sophistication later.

Common Mistakes to Avoid

Flying Blind

Launching without any observability is the #1 mistake. Reproducing failures becomes nearly impossible after production launch. The cost of adding observability after launch is 10x higher than building it in from day one.

Monitoring Only Model Calls

This isn't monitoring. The problem might be in the vector store query, tool integration, or response assembly. Full request traces are non-negotiable. If observability stops at the model call, critical failures remain invisible.

No Prompt Version Control

Prompt versions should be as rigorous as code versions. Without version control, behavior changes are impossible to debug. When an agent starts failing after a prompt update, there's no way to identify what changed or roll back.

Not Separating RAG Components

Evaluating retrieval and generation together obscures root causes. Separate metrics enable faster debugging. RAG failures show a pattern: separate evaluation cuts debugging time significantly.

Alerts After Impact

Setting up monitoring after users complain defeats the purpose. Alerts exist to prevent user impact, not document it. The difference between proactive and reactive observability is the difference between preventing failures and explaining them to angry users.

The Path Forward

Observability transforms random failures into debuggable patterns. It enables root cause analysis instead of blind guessing.

But observability is one piece of production readiness. Reliable AI agents also need evaluation frameworks, error handling, cost management, security controls, and scaling infrastructure.

95% of AI implementations fail because teams harvest productivity gains too early. They pocket efficiency improvements instead of building robust systems. They optimize for short-term wins instead of sustainable reliability.

The 5% that succeed build observability into their foundation. They implement monitoring before problems occur. They maintain prompt versions as rigorously as code. They separate concerns to enable systematic debugging.

Three days of observability implementation prevents weeks of blind debugging. Request IDs, basic tracing, and metric dashboards provide the visibility needed to move from "it works sometimes" to "it works reliably."

Claude.ai Voice Input & Text-to-Speech (OSS, Chrome Extension)

Elijah Atamas — Sun, 22 Dec 2024 11:40:41 +0000

A chrome extension that adds voice input and output to Claude.ai.
GitHub repository: softcery/claude-voice

Getting Started

Clone the repository
Run pnpm install
Run pnpm run build
Go to chrome://extensions/, click "Load unpacked", and select the dist folder.
Open Claude Voice extension and enter your OpenAI API key.
Go to Claude.ai and give it a try!

Note: Remember to reload the Claude.ai page after changing any settings in the extension for changes to take effect.

Warning: Since this extension interacts with Claude's UI, it might break if Anthropic makes changes to their website. If this happens, please either open an issue on the repository or send a message to hey [at] softcery [dot] com.

Coming Soon

Publishing to the Chrome Web Store – currently In Review.
Refactoring the codebase to follow better practices and add proper documentation.

Warning: This is a quick weekend project and the codebase reflects that - it's functional but not well-organized or properly tested. If there's enough interest, I'll refactor the code to follow better practices and add proper documentation.

License

MIT

How to talk to people about my product idea?

Elijah Atamas — Fri, 04 Sep 2020 10:56:54 +0000

What do you do when an "ingenious" product idea comes to your mind? You want to dive right in. You feel like this is going to be the project of your life that will change the world for the better and give you a higher purpose to live for. Or, depending on your desires, you just feel that it's going to make you rich.

At least for me, this has always been so. I get an idea, I become inspired, I dive right in and start coding. The doubt comes later. Maybe in a week, maybe in a month, but it comes nevertheless. And once it comes, I get all kinds of questions running in my head. "Is it really worth it? Is somebody going to use it? Where do I find an audience?", and so on.

Only recently I've come to know that I can actually try to validate an idea before devoting myself and starting the development. In the end, I'm building my idea for people, and it's only obvious to talk to people and try and understand them.

Yet, I'm still new to this. I understand that I need this, but I don't understand how to do this exactly. I believe in my idea, and I believe that there are people who are going to benefit from it. But this belief has only been validated by a couple of people from my personal circle. How do I speak to "the internet"?

I understand that there are whole professions that are dedicated to this: sales, marketing, copywriting, etc. People spend their entire lives learning how to "talk to the internet", and it's very naive for me to expect to magically understand it myself right away.

Another problem is the fear of rejection. Imagine, I actually get to talk to people and express my ideas to them. There's a reasonable chance that they will just say that my idea is plain bullshit. That there's no problem to solve. That there's a lot of solutions already. That a lot of people already tried it and haven't succeeded. And so on.

I've been trying to learn as much as I can about this topic recently, and here are a few ideas that I found interesting:

Build a pre-release landing page explaining the idea and asking to leave an email to get an early access
Go to a cafe and ask a few people to take a look at the landing page or discuss your idea in exchange for buying them a drink
Build the product publicly, talk about your thoughts and ideas openly, and ask for early feedback while developing

But all this still sounds very vague. How to get people to visit the landing? Should I spend all of my money on online advertisement? How to conquer the fear of rejection? In what stance should I talk about the product? And, I think, the primary question is, "How to connect with the target audience and have meaningful conversations on the topic of my product?".

I would appreciate any ideas on this topic. What are your thoughts? Have you ever faced something similar? What was your experience? What were the solutions that helped you?

Thank you. 🖖

Setting up CI for microservices in monorepo using GitHub Actions

Elijah Atamas — Mon, 31 Aug 2020 10:09:11 +0000

In this article, I'll describe how to set up Continuous Integration for an example application consisting of multiple microservices structured as monorepo and written in Go.

What is monorepo?

In the context of microservices, structuring an application as a monorepo means having a single repository for all microservices with each microservice residing in its own separate directory.

Why monorepo?

To be clear, monorepo isn't a silver bullet for microservices structuring, having its own benefits and downsides. Generally, benefits include simplicity, consistency, stability, and code reuse, while downsides are considered to be tight coupling, git repo scalability, and lack of code access control. I will dive more deeply into the benefits and downsides of structuring microservices as monorepo in a separate article.

Example monorepo

Let's take a look at an example application structured as monorepo. This is a simple HTTP API that consists of three routes: auth, users, and articles.

monorepo-actions-ci
└── routes
    ├── articles
    │   └── main.go
    ├── auth
    │   └── main.go
    └── users
        └── main.go

The important detail is that each route is an independent application with its own entrypoint, main.go.

routes/auth/main.go

package main

import "github.com/gin-gonic/gin"

func main() {
    router := gin.Default()

    router.GET("/auth", func(c *gin.Context) {
        c.String(200, "auth")
    })

    router.Run()
}

routes/users/main.go

package main

import "github.com/gin-gonic/gin"

func main() {
    router := gin.Default()

    router.GET("/users", func(c *gin.Context) {
        c.String(200, "users")
    })

    router.Run()
}

routes/articles/main.go

package main

import "github.com/gin-gonic/gin"

func main() {
    router := gin.Default()

    router.GET("/articles", func(c *gin.Context) {
        c.String(200, "articles")
    })

    router.Run()
}

Setting up GitHub Actions

Let's create a simple workflow for the auth route, .github/workflows/auth.yaml.

name: "auth"

on:
  push:
    # run the workflow only on changes
    # to the auth route and auth workflow
    paths:
      - "routes/auth/**"
      - ".github/workflows/auth.yaml"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1

      # run tests for route
      - name: Run tests
        run: |-
          cd routes/auth
          go test ./...

This workflow runs go test when changes are made to the auth route. Additionally, it runs when changes are made to the workflow file, which is useful for debugging the workflow itself.

Automating workflow generation

Now, the next thing would be to add the same workflow for routes/users and routes/articles. However, while in this case just duplicating the workflow seems a reasonable approach, in case of a big application with a complex workflow and a large number of routes this becomes problematic.

Instead, it would be better to automatically generate workflows from a single workflow template. Let's determine which parts of the workflow need to be adjusted when we move from one endpoint to another.

name: "{{ROUTE}}"

on:
  push:
    paths:
      - "routes/{{ROUTE}}/**"
      - ".github/workflows/{{ROUTE}}.yaml"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1

      - name: Run tests
        run: |-
          cd routes/{{ROUTE}}
          go test ./...

Looks simple. Let's put this template into .github/workflow-template.yaml and create a bash script that generates workflows for all endpoints from this template. Let's put that script into workflows.sh:

# read the workflow template
WORKFLOW_TEMPLATE=$(cat .github/workflow-template.yaml)

# iterate each route in routes directory
for ROUTE in $(ls routes); do
    echo "generating workflow for routes/${ROUTE}"

    # replace template route placeholder with route name
    WORKFLOW=$(echo "${WORKFLOW_TEMPLATE}" | sed "s/{{ROUTE}}/${ROUTE}/g")

    # save workflow to .github/workflows/{ROUTE}
    echo "${WORKFLOW}" > .github/workflows/${ROUTE}.yaml
done

Don't forget to make the script executable after creating the file:

chmod +x ./workflows.sh

Now let's run the script and see what happens:

monorepo-actions-ci
❯ ./workflows.sh
generating workflow for routes/articles
generating workflow for routes/auth
generating workflow for routes/users

After taking a quick look at .github/workflows you'll see that the script automatically generated a workflow for each route present in /routes.

tree

├── .github
│   ├── workflow-template.yaml
│   └── workflows
│       ├── articles.yaml
│       ├── auth.yaml
│       └── users.yaml
├── routes
│   ├── articles
│   │   └── main.go
│   ├── auth
│   │   └── main.go
│   └── users
│       └── main.go
└── workflows.sh

.github/workflow/auth.yaml

name: "auth"

on:
  push:
    paths:
      - "routes/auth/**"
      - ".github/workflows/auth.yaml"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1

      - name: Run tests
        run: |-
          cd routes/auth
          go test ./...

.github/workflow/users.yaml

name: "users"

on:
  push:
    paths:
      - "routes/users/**"
      - ".github/workflows/users.yaml"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1

      - name: Run tests
        run: |-
          cd routes/users
          go test ./...

.github/workflow/articles.yaml

name: "articles"

on:
  push:
    paths:
      - "routes/articles/**"
      - ".github/workflows/articles.yaml"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1

      - name: Run tests
        run: |-
          cd routes/articles
          go test ./...

Does it work?

Let's push our code to GitHub repository and see what happens.

Workflows are starting:

Workflows have finished running successfully:

Auth workflow details:

Each of the workflows has successfully finished running. Success! 🎉

I hope this article will come handy to you one day. Let me know your thoughts in the comments below or on Twitter. 🖖