Forem: Elizabeth Fuentes L

Desbordamiento de Ventana de Contexto de IA: Solución con Puntero de Memoria

Elizabeth Fuentes L — Thu, 14 May 2026 07:00:00 +0000

El desbordamiento de ventana de contexto** ocurre cuando las salidas de herramientas de un agente de IA exceden el límite de tokens que el modelo de lenguaje grande (LLM) puede procesar de una vez. El agente no falla: silenciosamente trunca datos, pierde contexto anterior o produce resultados incompletos. Este post muestra cómo el Patrón de Puntero de Memoria lo soluciona: desde agente único hasta coordinación multi-agente donde 145KB de datos nunca entran en ningún contexto de LLM.

Esta demo usa Strands Agents. El Patrón de Puntero de Memoria es independiente del framework y se puede aplicar con LangGraph, AutoGen u otros frameworks de agentes que soporten contexto de herramientas.

Código funcional: github.com/aws-samples/sample-why-agents-fail

Serie: Por Qué Fallan los Agentes de IA

Desbordamiento de Ventana de Contexto (este post) — Patrón de Puntero de Memoria para datos grandes
Herramientas MCP Que Nunca Responden — Patrón asíncrono para APIs externas lentas
Loops de Razonamiento en Agentes de IA — Detectar y bloquear llamadas repetidas a herramientas

El Problema: Los Agentes No Pueden Manejar Salidas Grandes de Herramientas

Cuando un agente de IA llama a una herramienta que devuelve datos grandes (logs del servidor, resultados de bases de datos, contenidos de archivos), la respuesta puede desbordar la ventana de contexto del LLM. El agente no falla con un error claro. Se degrada silenciosamente: trunca datos, pierde contexto o no completa la tarea.

Una investigación de IBM (Solving Context Window Overflow in AI Agents, 2025) cuantifica esto:

En flujos de trabajo de Ciencia de Materiales, las salidas de herramientas pueden alcanzar más de 2 millones de elementos
El enfoque tradicional consumió 20,822,181 tokens y falló
El mismo flujo con punteros de memoria usó 1,234 tokens y tuvo éxito
Eso es una reducción de más de 16,000x en este flujo de trabajo

Una observación comunitaria (Context Window Limits Explained, Airbyte 2025) confirma que los equipos descubren estos límites "de la manera difícil" a través de errores silenciosos. El agente parece funcionar pero produce resultados incompletos o incorrectos.

El concepto de pasar referencias en lugar de datos sin procesar también ha sido validado en configuraciones multi-agente. Una investigación de Amazon (Towards Effective GenAI Multi-Agent Collaboration, 2024) introduce "referenciación de carga útil", donde los agentes intercambian punteros a datos compartidos en lugar de incrustar cargas grandes en mensajes. Esto mejoró el rendimiento en tareas intensivas en código en un 23% y logró tasas de éxito de objetivos de extremo a extremo del 90% en benchmarks empresariales. Esto es exactamente lo que implementamos a continuación con Strands Swarm.

Por Qué Sucede Esto

Cuando la salida de la herramienta es pequeña (unos pocos KB), esto funciona bien. Pero cuando una herramienta devuelve 200KB de logs del servidor:

La salida completa se inyecta en la conversación
La ventana de contexto del LLM se llena
El contexto más antiguo (incluida la pregunta original) se expulsa
El LLM no puede razonar sobre los datos porque no puede verlos todos
El agente falla o produce respuestas incompletas

Solución 1: Agente Único con Strands ToolContext

El primer enfoque usa agent.state, un almacén clave-valor nativo con alcance para cada instancia de agente. Las herramientas escriben datos grandes allí vía ToolContext y devuelven una cadena de puntero corta al contexto:

from strands import Agent, tool, ToolContext

# context=True inyecta ToolContext como el último parámetro — requerido para acceder a agent.state
@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Obtiene logs de aplicación. Devuelve un puntero de memoria para datasets grandes."""
    logs = generate_logs(app_name, hours)  # Podría ser 200KB+

    if len(str(logs)) > 20_000:  # Umbral: almacenar externamente por encima de 20KB
        pointer = f"logs-{app_name}"
        # Almacena la carga útil completa en agent.state — nunca entra al contexto del LLM
        tool_context.agent.state.set(pointer, logs)
        # Devuelve solo la clave del puntero (52 bytes) — esto es todo lo que ve el LLM
        return f"Datos almacenados como puntero '{pointer}'. Usa herramientas de análisis para consultarlo."
    return str(logs)  # Suficientemente pequeño para devolver directamente

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analiza errores — resuelve puntero desde agent.state."""
    # Recupera el dataset completo desde agent.state usando la clave del puntero
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    # Devuelve un resumen (no datos sin procesar) — mantiene la respuesta pequeña
    return f"Se encontraron {len(errors)} errores en {len(set(e['service'] for e in errors))} servicios"

El LLM nunca ve los 200KB. Solo ve "Datos almacenados como puntero 'logs-payment-service'" (52 bytes). La siguiente herramienta lee los datos completos desde agent.state y devuelve un resumen. Strands proporciona esta capacidad nativamente, sin diccionarios globales, sin hashlib, sin infraestructura externa.

Resultados de Agente Único

Métrica	Sin Punteros	Con Punteros de Memoria
Datos en contexto	214KB (logs completos)	52 bytes (puntero)
Comportamiento del agente	Trunca/falla	Procesa todos los datos
Errores detectados	Parcial	Completo (todos los servicios)

Solución 2: Multi-Agente con Strands Swarm

Un solo agente funciona para pipelines lineales. Pero la respuesta a incidentes del mundo real involucra roles especializados: alguien obtiene datos, alguien los analiza, alguien escribe el reporte. Strands Swarm coordina múltiples agentes autónomamente: define agentes con diferentes herramientas, y el Swarm maneja los traspasos.

Este es el mismo patrón de "referenciación de carga útil" del paper de colaboración multi-agente de Amazon. Los agentes intercambian punteros a datos compartidos en lugar de pasar cargas sin procesar. La diferencia es que Strands Swarm maneja la coordinación automáticamente, y proporciona invocation_state como la API oficial para compartir datos entre agentes.

from strands import Agent, tool, ToolContext
from strands.multiagent import Swarm

# invocation_state es un dict compartido entre todos los agentes en el Swarm — el almacén entre agentes
@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 6) -> str:
    logs = generate_logs(hours)  # 145KB+
    pointer = f"logs-{app_name}"
    # Almacena en invocation_state para que todos los agentes descendentes puedan acceder sin re-obtener
    tool_context.invocation_state[pointer] = logs
    # Solo la cadena de puntero viaja a través del contexto del LLM al siguiente agente
    return f"Almacenado como '{pointer}'. Traspasar a analyzer."

@tool(context=True)
def analyze_error_patterns(logs_pointer: str, tool_context: ToolContext) -> str:
    # Resuelve el puntero al dataset completo — sin contexto de LLM consumido
    logs = tool_context.invocation_state.get(logs_pointer)
    errors = [l for l in logs if l["level"] == "ERROR"]
    result = {"total_errors": len(errors)}  # campos adicionales omitidos por brevedad
    # Almacena resultados de análisis como otro puntero para el agente reporter
    tool_context.invocation_state["error_analysis"] = result
    return json.dumps(result)

# Cada agente tiene un rol enfocado; el Swarm decide el orden de traspaso autónomamente
collector = Agent(name="collector", tools=[fetch_application_logs], model=MODEL)
analyzer = Agent(name="analyzer", tools=[analyze_error_patterns, detect_latency_anomalies], model=MODEL)
reporter = Agent(name="reporter", tools=[generate_incident_report], model=MODEL)

swarm = Swarm([collector, analyzer, reporter], entry_point=collector)
result = swarm("Obtén logs, analiza y genera reporte de incidente.")

El Swarm automáticamente:

Comienza con el collector, que obtiene 145KB de logs y los almacena en invocation_state
El collector traspasa al analyzer con el puntero "logs-payment-service"
El analyzer ejecuta análisis de errores y latencia, almacena resultados en invocation_state, traspasa al reporter
El reporter genera el reporte de incidente final

No se necesita código de orquestación ni lógica de traspaso manual. Cada agente tiene sus propias herramientas y el Swarm determina el flujo a partir de las descripciones de agentes y la tarea. Todo el intercambio de datos ocurre vía tool_context.invocation_state, la misma API de ToolContext usada en agente único, con un almacén diferente.

Resultados de Swarm

Status: COMPLETED
Agents: collector → analyzer → reporter
Time: ~14s
Shared store:
  logs-payment-service: 145,310 bytes
  error_analysis: 135 bytes
  latency_analysis: 70 bytes

145KB de logs procesados por tres agentes. Nada de eso entró nunca a ninguna ventana de contexto de LLM.

Investigación de Seguimiento

Después de que el swarm se completa, los datos permanecen en el almacén compartido. Un agente investigador separado puede profundizar en servicios específicos sin re-obtener:

# El investigator reutiliza invocation_state poblado por el swarm — sin re-obtención de datos
investigator = Agent(
    name="investigator",
    tools=[get_error_details, analyze_error_patterns],
    model=MODEL,
)

# Cada pregunta resuelve el puntero desde invocation_state y ejecuta análisis en memoria
investigator("¿Qué servicio tuvo más errores?")
investigator("Muéstrame los logs de error de cache-layer")
investigator("¿Qué códigos de estado devuelven esos errores?")
# Todas las consultas leen de los mismos 145KB ya en invocation_state — sin re-obtención, sin desbordamiento de contexto

Cuándo Usar Cada Enfoque

Agente único + agent.state — pipelines lineales donde un agente maneja obtención + análisis + reporte. Usa ToolContext para acceder a tool_context.agent.state desde herramientas.

Swarm + invocation_state — roles especializados, flujos complejos, o cuando quieres coordinación autónoma. Usa ToolContext para acceder a tool_context.invocation_state: la API oficial de Strands para intercambio de datos multi-agente. El Swarm gestiona traspasos, timeouts y detección de traspasos repetitivos.

Ambos — usa SlidingWindowConversationManager como protección adicional. Recorta automáticamente el historial de conversación y maneja ContextWindowOverflowException con reintento.

Estos enfoques son parte de ingeniería de contexto para agentes de IA: la práctica de decidir qué información entra a la ventana de contexto del LLM y cuándo.

Pruébalo Tú Mismo

Necesitas Python 3.9+, uv, y una clave API de OpenAI.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/01-context-overflow-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="tu-clave-aquí"

uv run python test_context_overflow.py   # Agente único: 4 escenarios
uv run python swarm_demo.py              # Multi-agente: Collector → Analyzer → Reporter

O abre test_context_overflow.ipynb en Kiro, VS Code, o tu entorno de notebook preferido.

Conclusiones Clave

El desbordamiento de contexto es silencioso — los agentes no fallan, producen resultados incorrectos
Los punteros de memoria lo solucionan — almacena datos grandes externamente, pasa referencias
Reducción de >16,000x en tokens — validado por IBM Research en el benchmark de Ciencia de Materiales
Agente único usa agent.state — @tool(context=True) + ToolContext para almacenar y recuperar datos fuera del contexto
Multi-agente usa invocation_state — misma API de ToolContext, compartida entre todos los agentes en el Swarm. No se necesita código de orquestación
Los datos persisten para seguimiento — después de que el pipeline se completa, los datos almacenados están disponibles para investigación sin re-obtención

Preguntas Frecuentes

¿Por qué los agentes de IA se quedan sin contexto?

Los agentes de IA se quedan sin contexto cuando las respuestas de herramientas se inyectan directamente en el historial de conversación del LLM. Cada respuesta consume tokens. Cuando las salidas acumuladas de herramientas exceden el límite de ventana de contexto del modelo, el LLM pierde contexto anterior, trunca datos o falla por completo. Esto sucede silenciosamente: el agente parece funcionar pero produce resultados incompletos o incorrectos.

¿Qué es el Patrón de Puntero de Memoria para agentes de IA?

El Patrón de Puntero de Memoria almacena salidas grandes de herramientas (logs, datasets, resultados de consultas) en estado externo en lugar de en la ventana de contexto del LLM. Las herramientas devuelven una clave de referencia corta (el "puntero") que herramientas subsiguientes usan para recuperar los datos completos. IBM Research validó este patrón con una reducción de más de 16,000x en el benchmark de Ciencia de Materiales.

¿En qué se diferencia agent.state de invocation_state en Strands Agents?

agent.state tiene alcance para una sola instancia de agente. Úsalo para pipelines lineales donde un agente maneja todos los pasos. invocation_state se comparte entre todos los agentes en un Strands Swarm. Úsalo cuando múltiples agentes especializados necesitan intercambiar datos sin pasar cargas grandes a través del contexto del LLM.

¿Puedo usar el Patrón de Puntero de Memoria con LangGraph u otros frameworks?

Sí. El patrón requiere dos capacidades: un almacén clave-valor compartido accesible desde herramientas, y la capacidad de pasar cadenas de referencia cortas a través del contexto del LLM. LangGraph proporciona esto a través de su gestión de estado, AutoGen a través de memoria compartida, y CrewAI a través de contexto de tareas. La implementación de Strands usa ToolContext como la API nativa.

Referencias

Investigación

Solving Context Window Overflow in AI Agents — IBM Research, Nov 2025
Towards Effective GenAI Multi-Agent Collaboration — Amazon, Dec 2024
Context Window Limits Explained — Airbyte blog (observación comunitaria), Dec 2025
Efficient On-Device Agents via Adaptive Context Management — Nov 2025

Implementación

Strands Agent State — ToolContext and agent.state
Strands Swarm — Multi-agent orchestration
Strands Conversation Management — Sliding window and context overflow

¿Has alcanzado límites de ventana de contexto en tus agentes? ¿Qué estrategias funcionaron para ti? Comparte en los comentarios.

Siguiente en esta serie: Herramientas MCP Que Nunca Responden — patrones asíncronos para APIs externas lentas.

Todo el código en esta serie es open source bajo la Licencia MIT-0. Dale estrella al repositorio para seguir las actualizaciones.

Gracias!

🇻🇪Dev.to - Linkedin - GitHub - Twitter - Instagram - Youtube

Built-in Token Counting: Telemetry for Production AI Agents

Elizabeth Fuentes L — Wed, 13 May 2026 07:00:00 +0000

Strands Agents provides native telemetry and cost tracking out of the box. Stop writing custom token counters.

Building AI agents is easy. Deploying them to production is where most teams hit a wall.

One of the first questions from finance: "How much will this cost per request?"

Most agent frameworks make you build your own token counter. Strands Agents gives you one.

The Problem with Custom Token Counting

Every AI application needs cost monitoring. But tracking tokens across:

Multiple model calls
Tool invocations
Prompt caching
Multi-agent workflows

...requires custom infrastructure most teams rebuild from scratch.

Native Telemetry in Strands Agents

Strands Agents includes production-grade telemetry by default:

from strands import Agent
from strands_tools import calculator

# Create an agent with tools
agent = Agent(tools=[calculator])

# Invoke the agent with a prompt and get an AgentResult
result = agent("What is the square root of 144?")

# Access metrics through the AgentResult
print(f"Total tokens: {result.metrics.accumulated_usage['totalTokens']}")
print(f"Execution time: {sum(result.metrics.cycle_durations):.2f} seconds")
print(f"Tools used: {list(result.metrics.tool_metrics.keys())}")

# Cache metrics (when available)
if 'cacheReadInputTokens' in result.metrics.accumulated_usage:
    print(f"Cache read tokens: {result.metrics.accumulated_usage['cacheReadInputTokens']}")
if 'cacheWriteInputTokens' in result.metrics.accumulated_usage:
    print(f"Cache write tokens: {result.metrics.accumulated_usage['cacheWriteInputTokens']}")

No configuration. No custom code. It just works.

What You Get

Every AgentResult includes:

Metric	Description
`inputTokens`	Tokens sent to the model
`outputTokens`	Tokens generated by the model
`totalTokens`	Total cost (input + output)
`cacheReadInputTokens`	Tokens read from cache (Bedrock prompt caching)
`cacheWriteInputTokens`	Tokens written to cache

Multi-Agent Token Tracking

For multi-agent systems (executor → validator → critic), aggregate metrics across all agents:

from strands.multiagent import Swarm

swarm = Swarm([executor, validator, critic])
result = swarm("Query")

total_tokens = 0
for node_result in result.results.values():
    usage = node_result.result.metrics.accumulated_usage
    total_tokens += usage['totalTokens']

print(f"Total cost across all agents: {total_tokens} tokens")

Per-Cycle Tracking

For agents that run multiple reasoning cycles, track tokens per cycle:

from strands import Agent
from strands_tools import calculator

agent = Agent(tools=[calculator])

# First invocation
result1 = agent("What is 5 + 3?")

# Second invocation
result2 = agent("What is the square root of 144?")

# Access metrics for the latest invocation
latest_invocation = result2.metrics.latest_agent_invocation
cycles = latest_invocation.cycles
usage = latest_invocation.usage

# Or access all invocations
for invocation in response.metrics.agent_invocations:
    print(f"Invocation usage: {invocation.usage}")
    for cycle in invocation.cycles:
        print(f"  Cycle {cycle.event_loop_cycle_id}: {cycle.usage}")

# Or print the summary (includes all invocations)
print(result2.metrics.get_summary())

For a complete list of attributes and their types, see the EventLoopMetrics API reference.

Why This Matters

Cost visibility is the difference between a prototype and production AI.

With Strands telemetry:

✅ Budget AI workloads before deployment
✅ Identify expensive queries in production
✅ Optimize prompts with real token data
✅ Track prompt caching savings

All without writing a single line of telemetry code.

Works with All Model Providers

Token tracking works regardless of your model provider:

Amazon Bedrock (Claude, Llama, Mistral)
OpenAI (GPT-4, GPT-3.5)
Anthropic API
Ollama (local models)

Same API, same metrics, zero config changes.

Try It

pip install strands-agents

Full documentation: strandsagents.com/docs/user-guide/concepts/agents/

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

Cómo Monitorear Costos de Agentes IA sin Configuración

Elizabeth Fuentes L — Wed, 13 May 2026 07:00:00 +0000

Strands Agents proporciona telemetría nativa y seguimiento de costos desde el primer momento. Deja de escribir contadores de tokens personalizados.

Construir agentes de IA es fácil. Desplegarlos a producción es donde la mayoría de los equipos se encuentran con un muro.

Una de las primeras preguntas de finanzas: "¿Cuánto costará esto por solicitud?"

La mayoría de los frameworks de agentes te obligan a construir tu propio contador de tokens. Strands Agents te proporciona uno.

El Problema con el Conteo Personalizado de Tokens

Cada aplicación de IA necesita monitoreo de costos. Pero rastrear tokens a través de:

Múltiples llamadas al modelo
Invocaciones de herramientas
Caché de prompts
Flujos de trabajo multi-agente

...requiere infraestructura personalizada que la mayoría de los equipos reconstruyen desde cero.

Telemetría Nativa en Strands Agents

Strands Agents incluye telemetría de grado de producción por defecto:

from strands import Agent
from strands_tools import calculator

# Crear un agente con herramientas
agent = Agent(tools=[calculator])

# Invocar el agente con un prompt y obtener un AgentResult
result = agent("¿Cuál es la raíz cuadrada de 144?")

# Acceder a métricas a través del AgentResult
print(f"Tokens totales: {result.metrics.accumulated_usage['totalTokens']}")
print(f"Tiempo de ejecución: {sum(result.metrics.cycle_durations):.2f} segundos")
print(f"Herramientas usadas: {list(result.metrics.tool_metrics.keys())}")

# Métricas de caché (cuando estén disponibles)
if 'cacheReadInputTokens' in result.metrics.accumulated_usage:
    print(f"Tokens leídos de caché: {result.metrics.accumulated_usage['cacheReadInputTokens']}")
if 'cacheWriteInputTokens' in result.metrics.accumulated_usage:
    print(f"Tokens escritos en caché: {result.metrics.accumulated_usage['cacheWriteInputTokens']}")

Sin configuración. Sin código personalizado. Simplemente funciona.

Lo Que Obtienes

Cada AgentResult incluye:

Métrica	Descripción
`inputTokens`	Tokens enviados al modelo
`outputTokens`	Tokens generados por el modelo
`totalTokens`	Costo total (entrada + salida)
`cacheReadInputTokens`	Tokens leídos desde caché (caché de prompts de Bedrock)
`cacheWriteInputTokens`	Tokens escritos en caché

Seguimiento de Tokens Multi-Agente

Para sistemas multi-agente (ejecutor → validador → crítico), agrega métricas a través de todos los agentes:

from strands.multiagent import Swarm

swarm = Swarm([executor, validator, critic])
result = swarm("Consulta")

total_tokens = 0
for node_result in result.results.values():
    usage = node_result.result.metrics.accumulated_usage
    total_tokens += usage['totalTokens']

print(f"Costo total a través de todos los agentes: {total_tokens} tokens")

Seguimiento por Ciclo

Para agentes que ejecutan múltiples ciclos de razonamiento, rastrea tokens por ciclo:

from strands import Agent
from strands_tools import calculator

agent = Agent(tools=[calculator])

# Primera invocación
result1 = agent("¿Cuánto es 5 + 3?")

# Segunda invocación
result2 = agent("¿Cuál es la raíz cuadrada de 144?")

# Acceder a métricas de la última invocación
latest_invocation = result2.metrics.latest_agent_invocation
cycles = latest_invocation.cycles
usage = latest_invocation.usage

# O acceder a todas las invocaciones
for invocation in response.metrics.agent_invocations:
    print(f"Uso de invocación: {invocation.usage}")
    for cycle in invocation.cycles:
        print(f"  Ciclo {cycle.event_loop_cycle_id}: {cycle.usage}")

# O imprimir el resumen (incluye todas las invocaciones)
print(result2.metrics.get_summary())

Para una lista completa de atributos y sus tipos, consulta la referencia de API de EventLoopMetrics.

Por Qué Esto Importa

La visibilidad de costos es la diferencia entre un prototipo y una IA en producción.

Con la telemetría de Strands:

✅ Presupuesta cargas de trabajo de IA antes del despliegue
✅ Identifica consultas costosas en producción
✅ Optimiza prompts con datos reales de tokens
✅ Rastrea ahorros del caché de prompts

Todo sin escribir una sola línea de código de telemetría.

Funciona con Todos los Proveedores de Modelos

El seguimiento de tokens funciona independientemente de tu proveedor de modelo:

Amazon Bedrock (Claude, Llama, Mistral)
OpenAI (GPT-4, GPT-3.5)
Anthropic API
Ollama (modelos locales)

Misma API, mismas métricas, cero cambios de configuración.

Pruébalo

pip install strands-agents

Documentación completa: strandsagents.com/docs/user-guide/concepts/agents/

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

Cómo Guiar Asistentes de IA para Construir Agentes Listos para Producción: 8 Patrones Esenciales

Elizabeth Fuentes L — Mon, 11 May 2026 18:54:52 +0000

Cuando le pides a un asistente de IA como Kiro (el asistente de IA de AWS), Claude Code o ChatGPT que "construya un agente," obtienes código funcional. Pero no ves las decisiones de arquitectura que ocurren detrás de escena. El agente responde a consultas, pero podría desperdiciar tokens en bucles de razonamiento, alucinar respuestas a partir de datos incompletos, o congelarse con APIs lentas. Estas fallas son silenciosas hasta llegar a producción.

Cuando le pides a asistentes de IA que construyan agentes, toman decisiones de arquitectura silenciosamente—eligiendo estrategias de recuperación, enfoques de validación y patrones de manejo de errores. Estos 8 patrones te dan el vocabulario para especificar decisiones de grado producción en tus prompts, previniendo alucinaciones y desperdicio de tokens antes de que se genere código.

Este post cierra dos series que escribí documentando las fallas de agentes más costosas en producción: Stop AI Agent Hallucinations (5 técnicas) y Why AI Agents Fail (3 modos de falla). Si conoces estos 8 patrones, puedes guiar a los asistentes de IA para evitarlos desde el inicio.

Esto no es una guía de implementación paso a paso. Es una referencia para saber qué existe y así reconocer cuándo usar cada patrón según tu caso de uso.

Código funcional para las 8 técnicas: Enlazado en cada sección

Por Qué Esto Importa

Los asistentes de IA generan código de agentes en segundos. Kiro, Claude Code, Cursor y ChatGPT pueden crear estructuras de herramientas, configurar llamadas a LLM y conectar sistemas de recuperación más rápido que programar manualmente.

Pero la velocidad crea un problema: obtienes código funcional sin ver las concesiones.

Cuando escribes "construye un agente de reservas con RAG," el asistente toma decisiones:

¿Qué estrategia de recuperación? (similitud vectorial, consultas de grafos, híbrido)
¿Cómo manejar salidas grandes? (truncar, resumir, almacenamiento externo)
¿Qué validación se ejecuta antes de usar una herramienta? (ninguna, prompts, hooks de framework)
¿Cómo manejar APIs lentas? (bloquear, timeout, patrones asíncronos)

Tu prompt no especifica esto. El asistente elige valores por defecto. Esos valores por defecto crean los modos de falla que este post documenta.

Los 8 Patrones de Falla (Referencia Rápida)

Fallas por Alucinación (5 patrones):

GraphRAG - RAG vectorial fabrica estadísticas a partir de fragmentos incompletos
Semantic Tool Selection - Demasiadas herramientas, el agente elige las equivocadas
Neurosymbolic Guardrails - El agente ignora reglas de negocio en los prompts
Runtime Guardrails (Steering) - El agente viola reglas, necesita corrección no bloqueo
Multi-Agent Validation - Un solo agente afirma éxito cuando las operaciones fallan

Desperdicio Silencioso de Tokens (3 patrones):

Memory Pointer Pattern - Datos grandes desbordan el contexto, causan truncamiento
Async HandleId Pattern - APIs lentas bloquean el agente indefinidamente
DebounceHook + Explicit States - El agente hace bucle con la misma llamada sin progreso

No implementas los 8. Aprendes qué resuelven, luego especificas los que tu caso de uso necesita al hacer prompts.

¿Qué Son Estos 8 Patrones?

Estos patrones resuelven las fallas de producción más costosas: alucinaciones por datos incompletos (GraphRAG, Semantic Tool Selection, Guardrails, Steering, Multi-Agent), y desperdicio silencioso de tokens (Memory Pointers, Async HandleId, DebounceHook). Aprendes qué resuelve cada uno, luego especificas los que tu caso de uso necesita al pedir a asistentes de IA. Esto previene depurar código de caja negra en producción.

Impacto Medido en Producción

Patrón	Resultado	Fuente
GraphRAG	Conteos exactos vs aproximaciones fabricadas	RAG vs GraphRAG
Semantic Tool Selection	86.4% menos errores, 89% menos costos de tokens	Tool Selection
Memory Pointers	20M tokens reducidos a 1,234 tokens	estudio IBM Materials Science
Async HandleId	Bloqueo de 18 segundos eliminado, sin timeouts 424	MCP Timeouts
Explicit States	14 llamadas reducidas a 2 (mejora de 7x)	Reasoning Loops

Patrón 1: GraphRAG para Consultas Precisas

¿Qué Es GraphRAG?

GraphRAG reemplaza la similitud vectorial con consultas a bases de datos de grafos para datos estructurados. Cuando tu agente necesita conteos exactos, agregaciones o recorrido de relaciones, GraphRAG traduce lenguaje natural a consultas Cypher que retornan resultados precisos desde datos estructurados en lugar de estadísticas alucinadas desde fragmentos de texto. Úsalo para consultas estructuradas, mantén RAG vectorial para búsqueda semántica.

Qué Se Rompe

RAG vectorial fabrica estadísticas. Preguntas "¿Cuántos hoteles en Miami tienen piscina y desayuno?" y la similitud vectorial recupera 3 fragmentos de texto que mencionan piscinas y desayuno. El LLM ve datos incompletos, calcula a partir de muestras y retorna "aproximadamente 120 hoteles" (fabricado a partir de 3 fragmentos de 200 hoteles).

Las consultas fuera de dominio retornan respuestas alucinadas en lugar de admitir que no existen datos.

La Solución

Reemplaza la recuperación vectorial con consultas de grafos para datos estructurados. Almacena hoteles, amenidades y relaciones en Neo4j. El LLM traduce "hoteles con piscinas y desayuno" a Cypher:

MATCH (h:Hotel)-[:HAS_AMENITY]->(a:Amenity)
WHERE a.name IN ['pool', 'breakfast']
RETURN count(DISTINCT h)

Resultado: 133 hoteles (conteo exacto desde la base de datos).

Consulta fuera de dominio: "No se encontraron hoteles en la Antártida" en lugar de fabricar resultados.

Qué Decirle a Tu Asistente de IA

"Construye un agente de viajes usando GraphRAG con Neo4j. Para consultas 
estructuradas (hoteles, amenidades, disponibilidad), traduce a Cypher 
y ejecuta contra el grafo. Solo usa RAG vectorial para descripciones 
no estructuradas. Retorna conteos exactos desde recorrido del grafo."

Cuándo Usar

Datos estructurados con relaciones (productos, inventario, ubicaciones)
Consultas que requieren conteos, agregaciones o recorrido multi-salto
Dominios donde fabricar estadísticas crea riesgo legal/financiero

Detalles completos: RAG vs GraphRAG: When Agents Hallucinate Answers

Aprende más: Documentación Neo4j Cypher

Patrón 2: Semantic Tool Selection

¿Qué Es Semantic Tool Selection?

Semantic tool selection usa embeddings vectoriales para filtrar herramientas antes de que el LLM las vea. Cuando tu agente tiene más de 10 herramientas, enviar todas las descripciones en cada llamada aumenta las tasas de error (el agente elige herramientas incorrectas) y los costos de tokens (pagando por descripciones no usadas). El filtrado semántico inserta descripciones de herramientas offline, luego en tiempo de ejecución compara la consulta con las 5 herramientas más relevantes, reduciendo errores en 86.4% y costos en 89%.

Qué Se Rompe

Con 50 herramientas, ocurren dos fallas: (1) el agente elige herramientas incorrectas porque las descripciones se superponen, y (2) los costos de tokens explotan por enviar las 50 descripciones de herramientas en cada llamada al LLM.

Impacto medido: Las tasas de error aumentan con el conteo de herramientas, los costos de tokens escalan linealmente.

La Solución

Usa embeddings vectoriales para filtrar herramientas antes de que el LLM las vea. Inserta descripciones de herramientas offline. En tiempo de ejecución, inserta la consulta del usuario, calcula similitud, pasa solo las 5 herramientas más relevantes al agente.

Resultados en producción:

Errores reducidos: 86.4%
Costos de tokens reducidos: 89%
Latencia: <10ms para filtrado de herramientas

Qué Decirle a Tu Asistente de IA

"Construye un agente multi-herramienta con semantic tool selection. Usa FAISS 
y SentenceTransformers para insertar descripciones de herramientas offline. En 
tiempo de ejecución, inserta la consulta, recupera las 5 herramientas más similares, 
pasa solo esas al agente. Mantén memoria de conversación, intercambia herramientas dinámicamente."

Cuándo Usar

Agentes con más de 10 herramientas
Herramientas con descripciones que se superponen
Aplicaciones sensibles a costos

Detalles completos: Reduce Agent Errors and Token Costs with Semantic Tool Selection

Patrón 3: Neurosymbolic Guardrails (Bloqueo)

¿Qué Son Neurosymbolic Guardrails?

Neurosymbolic guardrails aplican reglas de negocio a nivel de framework, por debajo del control del LLM. Cuando los prompts solos no pueden aplicar restricciones (máximo de huéspedes, fechas válidas, límites de presupuesto), los guardrails usan hooks de pre-ejecución para validar parámetros y cancelar operaciones inválidas. Las reglas viven en código, no en prompts, así que el LLM no puede evadirlas. Usa guardrails de bloqueo para restricciones duras que no pueden violarse.

Qué Se Rompe

Los prompts no pueden aplicar reglas de negocio. Incluso con docstrings claros ("max_guests debe ser ≤10"), el LLM pasa max_guests=15 bajo presión porque los prompts son sugerencias, no restricciones. El agente viola reglas silenciosamente.

La Solución

Usa hooks de framework para validar parámetros antes de la ejecución de herramientas. Si la validación falla, cancela la llamada de herramienta y retorna guía correctiva. Las reglas viven en código a nivel de framework, por debajo del control del LLM.

Impacto medido: Cero violaciones en prueba de 100 consultas (vs. 12 violaciones solo con prompts).

Qué Decirle a Tu Asistente de IA

"Construye un agente de reservas con guardrails usando hooks de Strands Agents. 
Crea un hook BeforeToolCallEvent que valide:
- max_guests ≤ 10
- check_in_date > hoy
- budget > 0

Si la validación falla, cancela la llamada de herramienta con event.cancel_tool() 
y retorna mensaje de error. No confíes en prompts para validación."

Cuándo Usar

Reglas de negocio que no pueden violarse (cumplimiento, legales, financieras)
Validación que requiere cálculo (matemáticas de fechas, verificaciones de inventario)
Reglas que cambian frecuentemente

Detalles completos: AI Agent Guardrails: Rules That LLMs Cannot Bypass

Patrón 4: Runtime Guardrails (Dirigir, No Bloquear)

¿Qué Es Dirigir vs Bloquear?

Steering guardrails retornan guía correctiva en lugar de bloquear operaciones. Cuando el agente viola una regla suave (problemas de formato, ajustes de parámetros, redacción de datos), el steering retorna instrucciones vía Guide() para que el agente se autocorrija y reintente. Esto difiere de los guardrails de bloqueo (Patrón 3) que detienen flujos de trabajo completamente. Usa steering para reglas donde el agente puede corregirse, bloqueo para restricciones duras.

Qué Se Rompe

Los guardrails duros (Patrón 3) bloquean operaciones y detienen flujos de trabajo. Para reglas suaves donde el agente puede autocorregirse (problemas de formato, ajustes de parámetros, redactar datos sensibles), el bloqueo crea fricción. El agente podría arreglar el problema por sí mismo si se le da guía.

La Solución

Usa Agent Control para retornar guía correctiva vía Guide() en lugar de bloquear. Cuando el agente viola una regla suave, el plano de control retorna instrucciones: "Ajusta el parámetro X a Y y reintenta." El agente se autocorrige y completa la tarea sin intervención humana.

Diferencia con el Patrón 3:

Bloquear (Patrón 3): Restricciones duras, el flujo de trabajo se detiene
Dirigir (Patrón 4): Reglas suaves, el agente se autocorrige

Qué Decirle a Tu Asistente de IA

"Construye un agente de reservas con Agent Control para reglas suaves. Conéctate 
al servidor Agent Control. Para reglas suaves (formato de parámetros, ajustes 
de fecha, redacción de datos), retorna Guide() con instrucciones de corrección 
en lugar de bloquear. El agente debe reintentar con la corrección aplicada.

Usa bloqueos duros (Patrón 3) solo para reglas de cumplimiento que no pueden 
violarse bajo ninguna circunstancia."

Cuándo Usar

Reglas donde el agente puede autocorregirse (formato, ajustar parámetros)
Flujos de trabajo donde el bloqueo crea UX pobre
Reglas gestionadas centralmente vía API/dashboard (actualizar sin redesplegar)

Detalles completos: Runtime Guardrails for AI Agents: Steer, Don't Block

Patrón 5: Multi-Agent Validation

¿Qué Es Multi-Agent Validation?

Multi-agent validation despliega agentes especializados con diferentes roles (Executor, Validator, Critic) que verifican cruzadamente el trabajo de los demás. Los agentes únicos optimizan para parecer exitosos, no verificar resultados. Múltiples agentes con diferentes funciones de optimización atrapan errores que los demás pierden. El Executor realiza tareas, el Validator verifica contra la verdad fundamental, el Critic proporciona revisión final antes de retornar al usuario.

Qué Se Rompe

Los agentes únicos no pueden autovalidarse. Cuando un agente reserva un hotel, afirma "Éxito: Reservado Grand Plaza Hotel" incluso si la API retornó un error o el hotel no existe en la base de datos. El agente optimiza para parecer exitoso, no verificar resultados.

La Solución

Despliega múltiples agentes con diferentes roles: el Executor realiza tareas, el Validator verifica contra la verdad fundamental, el Critic proporciona revisión final. Los agentes comparten contexto y transfieren control autónomamente cuando su rol se completa.

Impacto medido: Multi-agente atrapa errores que el agente único pierde (p.ej., reservar hoteles inexistentes).

Qué Decirle a Tu Asistente de IA

"Construye un sistema multi-agente usando Strands Swarm con 3 agentes:
1. Executor: Reserva hoteles, busca vuelos
2. Validator: Verifica cruzadamente operaciones contra la base de datos
3. Critic: Revisión final antes de retornar al usuario

Los agentes comparten contexto vía swarm.context. Usa transferencias autónomas. 
Los agentes deciden cuándo transferir según completación de tarea."

Cuándo Usar

Operaciones de alto riesgo (financieras, médicas, legales)
Tareas donde "parece exitoso" difiere de "realmente exitoso"
Flujos de trabajo complejos con múltiples puntos de verificación

Detalles completos: How to Stop AI Agents from Hallucinating Silently with Multi-Agent Validation

Patrón 6: Memory Pointer Pattern

¿Qué Es el Memory Pointer Pattern?

El Memory Pointer Pattern almacena datos grandes fuera del contexto del LLM y pasa referencias cortas en su lugar. Cuando las herramientas retornan logs de más de 200KB o resultados de base de datos de 1000 filas, pasarlos directamente causa truncamiento silencioso. Los memory pointers almacenan datos en agent.state, retornan un puntero al LLM y proporcionan herramientas separadas que resuelven punteros para acceder a datos completos. IBM redujo de 20M tokens a 1,234 tokens usando este patrón.

Qué Se Rompe

El desbordamiento de ventana de contexto ocurre cuando las herramientas retornan más datos de los que el LLM puede procesar (logs de más de 200KB, resultados de base de datos de 1000 filas). El agente no colapsa. Trunca datos silenciosamente, pierde contexto, produce respuestas incompletas.

Caso real de producción (IBM Materials Science):

Antes: 20 millones de tokens, flujo de trabajo falló
Después: 1,234 tokens, flujo de trabajo exitoso

La Solución

Almacena datos grandes en agent.state, pasa referencias cortas al LLM. Las herramientas retornan punteros como "logs-app-server". Las herramientas subsiguientes resuelven punteros para acceder a datos completos. El LLM solo ve: "Datos almacenados como logs-app-server. Usa analyze_errors(pointer)."

Datos en contexto reducidos: 214KB → 52 bytes

Qué Decirle a Tu Asistente de IA

"Construye un agente de análisis de logs usando Memory Pointer Pattern. Cuando 
fetch_logs retorne más de 20KB:
1. Almacenar en agent.state con ID de puntero único
2. Retornar al LLM: 'Datos almacenados como logs-{app}. Usa analyze_logs(pointer).'
3. Implementar analyze_logs(pointer) que resuelva desde agent.state

Nunca pases datos grandes directamente al contexto del LLM."

Cuándo Usar

Herramientas que retornan salidas grandes (logs, consultas de base de datos, archivos)
Flujos de trabajo con múltiples pasos de procesamiento sobre los mismos datos grandes
Aplicaciones sensibles a costos

Detalles completos: AI Context Window Overflow: Memory Pointer Fix

Patrón 7: Async HandleId Pattern

¿Qué Es el Async HandleId Pattern?

El async handleId pattern previene que APIs externas lentas bloqueen tu agente. Cuando una API toma más de 30 segundos, las llamadas síncronas congelan todo el agente. Async handleId retorna un ID de trabajo inmediatamente, permitiendo que el agente continúe con otras tareas. Una herramienta check_status separada sondea por resultados cuando estén listos. Esto elimina errores de timeout 424 y mantiene los agentes responsivos.

Qué Se Rompe

Las APIs externas que toman más de 30 segundos bloquean el agente indefinidamente. Ninguna otra herramienta puede ejecutarse. Después de ~7 segundos, muchas implementaciones retornan errores de timeout 424, congelando el flujo de trabajo.

La Solución

Las herramientas retornan inmediatamente con un ID de trabajo en lugar de esperar. El agente almacena handleId y continúa. Una herramienta check_status(job_id) separada sondea por resultados asincrónicamente.

Impacto medido:

Antes: API de 18 segundos bloquea agente, timeout 424
Después: Herramienta retorna en menos de 1 segundo, agente sondea cuando está listo

Qué Decirle a Tu Asistente de IA

"Construye un agente con async handleId pattern para APIs lentas:

1. start_analysis(data): Envía trabajo, retorna job_id inmediatamente
2. check_status(job_id): Sondea por resultados

El agente llama start_analysis, almacena job_id, continúa con otras 
tareas, llama check_status cuando está listo. No implementes llamadas bloqueantes."

Cuándo Usar

APIs externas con tiempos de respuesta mayores a 5 segundos
Procesamiento por lotes (análisis de video, transformaciones grandes)
Cualquier sistema fuera de tu control

Detalles completos: Fix MCP Timeouts: Async HandleId Pattern

Patrón 8: DebounceHook + Explicit States

¿Qué Previene los Bucles de Razonamiento?

Los bucles de razonamiento ocurren cuando retroalimentación ambigua ("más puede estar disponible") señala que reintentar podría ayudar. Dos correcciones funcionan juntas: estados terminales explícitos (retornar SUCCESS/FAILED para que el LLM sepa cuándo detenerse) y DebounceHook (hook de framework que bloquea llamadas duplicadas). Las pruebas de producción mostraron que los estados explícitos redujeron las llamadas de 14 a 2, mientras que DebounceHook proporciona una red de seguridad para casos extremos.

Qué Se Rompe

Los agentes hacen bucle llamando a la misma herramienta repetidamente sin progreso. Retroalimentación ambigua como "Se encontraron 3 resultados. Más pueden estar disponibles" señala que reintentar podría ayudar. El agente hace bucle indefinidamente.

Caso real de producción: 847 pasos de razonamiento a $47/minuto, sin respuesta entregada.

La Solución (Dos Partes)

Parte A: Estados Terminales Explícitos

Retorna estados claros de SUCCESS o FAILED. Cambia "Más pueden estar disponibles" a "SUCCESS: Se encontraron todos los 3 vuelos coincidentes."

Parte B: Red de Seguridad DebounceHook

El hook de framework rastrea llamadas recientes a herramientas. Cuando el mismo par (tool_name, input) aparece dos veces, bloquea el tercer intento.

Impacto medido (demo de reserva de viajes):

Retroalimentación ambigua: 14 llamadas
SUCCESS explícito: 2 llamadas (reducción de 7x)
DebounceHook: 12 llamadas (2 bloqueadas)

Qué Decirle a Tu Asistente de IA

"Construye un agente de viajes con protección anti-bucle:

1. Todas las herramientas retornan estados explícitos:
   - SUCCESS: [completación clara]
   - FAILED: [error claro]
   Nunca retornes 'más puede estar disponible'

2. Implementa DebounceHook:
   - Rastrea las últimas 3 llamadas de herramientas como (tool_name, input)
   - Si el mismo par aparece dos veces, bloquea el tercer intento
   - Retorna 'BLOCKED: Duplicado detectado'

Esto previene bucles sin límites manuales de reintentos."

Cuándo Usar

Agentes propensos a bucles de reintento (búsqueda, agregadores de API)
Aplicaciones sensibles a costos donde reintentos ilimitados son costosos
Sistemas de producción donde bucles infinitos crean riesgo de disponibilidad

Detalles completos: How to Prevent AI Agent Reasoning Loops from Wasting Tokens

Errores Comunes

Error 1: Asumir Que los Valores Por Defecto Son Mejores Prácticas

Problema: "Construye un agente de producción" asume que el asistente sabe qué significa producción.

Solución: Especifica patrones: "Usa GraphRAG, guardrails, patrones async, etc..."

Error 2: Confiar Solo en Prompts para Validación

Problema: "Asegúrate de que max_guests < 10" en el prompt del sistema es ignorado bajo presión.

Solución: "Implementa hook BeforeToolCallEvent que valide y cancele llamadas inválidas."

Error 3: No Reconocer Cuándo Aplican los Patrones

Problema: El agente funciona en demo, se rompe en casos extremos.

Solución: Conoce los 8 patrones. Cuando veas alucinaciones, timeouts o bucles, reconocerás qué patrón lo resuelve.

Lo Que Esto Significa para el Desarrollo Asistido por IA

Los asistentes de IA seguirán mejorando en generar código funcional. Pero código funcional y arquitectura lista para producción siguen siendo objetivos diferentes.

La brecha no es la capacidad del asistente. Es la especificidad del prompt.

Cuando escribes "construye un agente de reservas," el asistente optimiza para código que compila y responde a consultas.

Cuando escribes "construye un agente de reservas usando GraphRAG para consultas estructuradas, guardrails para validación y patrones async para APIs de reservas," el asistente optimiza para código que compila, responde a consultas, previene alucinaciones, aplica reglas de negocio y maneja APIs lentas.

Estos 8 patrones son el vocabulario para comunicar intención de producción.

No implementas los 8. Aprendes qué resuelven. Cuando ves alucinaciones, reconoces que GraphRAG aplica. Cuando ves timeouts, reconoces que async handleId aplica. Cuando ves bucles, reconoces que estados explícitos + DebounceHook aplican.

Este conocimiento cambia cómo haces prompts a Kiro, Claude Code, Cursor y ChatGPT. En lugar de depurar fallas de caja negra en producción, especificas los patrones que las previenen durante la generación.

Aprende Más (Guías de Implementación Completas)

Cada patrón tiene una guía completa con código funcional:

GraphRAG: RAG vs GraphRAG: When Agents Hallucinate Answers
Semantic Tool Selection: Reduce Agent Errors and Token Costs
Neurosymbolic Guardrails: AI Agent Guardrails: Rules That LLMs Cannot Bypass
Runtime Guardrails (Steering): Runtime Guardrails for AI Agents: Steer, Don't Block
Multi-Agent Validation: Stop AI Agents from Hallucinating Silently
Memory Pointers: AI Context Window Overflow: Memory Pointer Fix
Async HandleId: Fix MCP Timeouts: Async HandleId Pattern
DebounceHook: Prevent AI Agent Reasoning Loops

Serie completa:

Cierre

Cada patrón en este post existe porque algo se rompió en producción. Agentes que alucinaron estadísticas en demos de clientes. Bucles que quemaron tokens a $47/minuto. Desbordamientos de contexto que truncaron datos críticos. Timeouts que congelaron flujos de trabajo.

Ahora sabes qué se rompe y cómo prevenirlo al hacer prompts correctamente.

Cuando le pides a Kiro, Claude Code o ChatGPT que construya un agente, puedes especificar qué patrones aplican. Esa es la diferencia entre prototipos que se rompen y agentes que escalan.

Úsalo.

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

Prompt AI Coding Assistants to Build Production-Ready Agents: 8 Essential Patterns

Elizabeth Fuentes L — Mon, 11 May 2026 07:00:00 +0000

When you ask an AI assistant like Kiro (AWS's AI coding assistant), Claude Code, or ChatGPT to "build me an agent," you get working code. But you don't see the architecture decisions happening behind the scenes. The agent responds to queries, but it might waste tokens in reasoning loops, hallucinate answers from incomplete data, or freeze on slow APIs. These failures are silent until production.

When you prompt AI coding assistants to build agents, they make architecture decisions silently—choosing retrieval strategies, validation approaches, and error handling patterns. These 8 patterns give you the vocabulary to specify production-grade decisions in your prompts, preventing hallucinations and token waste before code is generated.

This post closes two series I wrote documenting the most expensive agent failures in production: Stop AI Agent Hallucinations (5 techniques) and Why AI Agents Fail (3 failure modes). If you know these 8 patterns, you can guide AI assistants to avoid them from the start.

This isn't a step-by-step implementation guide. It's a reference for knowing what exists so you can recognize when to use each pattern based on your use case.

Working code for all 8 techniques: Linked in each section

Why This Matters

AI coding assistants generate agent code in seconds. Kiro, Claude Code, Cursor, and ChatGPT can scaffold tools, configure LLM calls, and wire up retrieval systems faster than manual coding.

But speed creates a problem: you get working code without seeing the tradeoffs.

When you prompt "build a booking agent with RAG," the assistant makes decisions:

Which retrieval strategy? (vector similarity, graph queries, hybrid)
How to handle large outputs? (truncate, summarize, external storage)
What validation runs before tool execution? (none, prompts, framework hooks)
How to handle slow APIs? (block, timeout, async patterns)

Your prompt doesn't specify these. The assistant picks defaults. Those defaults create the failure modes this post documents.

The 8 Failure Patterns

Hallucination Failures (5 patterns):

GraphRAG - Vector RAG fabricates statistics from incomplete chunks
Semantic Tool Selection - Too many tools, agent picks wrong ones
Neurosymbolic Guardrails - Agent ignores business rules in prompts
Runtime Guardrails (Steering) - Agent violates rules, needs correction not blocking
Multi-Agent Validation - Single agent claims success when operations fail

Silent Token Waste (3 patterns):

Memory Pointer Pattern - Large data overflows context, causes truncation
Async HandleId Pattern - Slow APIs block agent indefinitely
DebounceHook + Explicit States - Agent loops same tool call without progress

You don't implement all 8. You learn what they solve, then specify the ones your use case needs when prompting.

What Are These 8 Patterns?

These patterns solve the most expensive production failures: hallucinations from incomplete data (GraphRAG, Semantic Tool Selection, Guardrails, Steering, Multi-Agent), and silent token waste (Memory Pointers, Async HandleId, DebounceHook). You learn what each solves, then specify the ones your use case needs when prompting AI assistants. This prevents debugging black-box code in production.

Measured Impact from Production

Pattern	Result	Source
GraphRAG	Exact counts vs fabricated approximations	RAG vs GraphRAG
Semantic Tool Selection	86.4% fewer errors, 89% lower token costs	Tool Selection
Memory Pointers	20M tokens reduced to 1,234 tokens	IBM Materials Science study
Async HandleId	18-second block eliminated, no 424 timeouts	MCP Timeouts
Explicit States	14 calls reduced to 2 (7x improvement)	Reasoning Loops

Pattern 1: GraphRAG for Precise Queries

What Is GraphRAG?

GraphRAG replaces vector similarity with graph database queries for structured data. When your agent needs exact counts, aggregations, or relationship traversal, GraphRAG translates natural language to Cypher queries that return precise results from structured data instead of hallucinated statistics from incomplete text chunks. Use it for structured queries, keep vector RAG for semantic search.

What Breaks

Vector RAG fabricates statistics. Ask "How many hotels in Miami have pools and breakfast?" and vector similarity retrieves 3 text chunks mentioning Miami, pools and breakfast. The LLM sees incomplete data, calculates from samples, and returns "approximately 120 hotels" (fabricated from 3 chunks out of 200 hotels).

Out-of-domain queries return hallucinated answers instead of admitting no data exists.

The Fix

Replace vector retrieval with graph queries for structured data. Store hotels, amenities, and relationships in Neo4j. The LLM translates "hotels with pools and breakfast" into Cypher:

MATCH (h:Hotel)-[:HAS_AMENITY]->(a:Amenity)
WHERE a.name IN ['pool', 'breakfast']
RETURN count(DISTINCT h)

Result: 133 hotels (exact count from database).

Out-of-domain query: "No hotels found in Antarctica" instead of fabricating results.

What to Tell Your AI Assistant

"Build a travel agent using GraphRAG. For structured 
queries (hotels, amenities, availability), translate to Cypher 
and execute against the graph. Only use vector RAG for unstructured descriptions. Return exact counts from graph traversal."

When to Use

Structured data with relationships (products, inventory, locations)
Queries requiring counts, aggregations, or multi-hop traversal
Domains where fabricating statistics creates legal/financial risk

Full details: RAG vs GraphRAG: When Agents Hallucinate Answers

Learn more: Neo4j Cypher Documentation

Pattern 2: Semantic Tool Selection

What Is Semantic Tool Selection?

Semantic tool selection uses vector embeddings to filter tools before the LLM sees them. When your agent has 10+ tools, sending all descriptions on every call increases error rates (agent picks wrong tools) and token costs (paying for unused descriptions). Semantic filtering embeds tool descriptions offline, then at runtime matches the query to top-5 relevant tools, reducing errors by 86.4% and costs by 89%.

What Breaks

With 50 tools, two failures occur: (1) agent picks wrong tools because descriptions overlap, and (2) token costs explode from sending all 50 tool descriptions on every LLM call.

Measured impact: Error rates increase with tool count, token costs scale linearly.

The Fix

Use vector embeddings to filter tools before the LLM sees them. Embed tool descriptions offline. At runtime, embed the user query, compute similarity, pass only top-5 relevant tools to the agent.

Results from production:

Errors reduced: 86.4%
Token costs reduced: 89%
Latency: <10ms for tool filtering

What to Tell Your AI Assistant

"Build a multi-tool agent with semantic tool selection At 
runtime, embed the query, retrieve top-5 similar tools, pass only 
those to the agent. Keep conversation memory, dynamically swap tools."

When to Use

Agents with 10+ tools
Tools with overlapping descriptions
Cost-sensitive applications

Full details: Reduce Agent Errors and Token Costs with Semantic Tool Selection

Pattern 3: Neurosymbolic Guardrails (Block)

What Are Neurosymbolic Guardrails?

Neurosymbolic guardrails enforce business rules at the framework level, below the LLM's control. When prompts alone cannot enforce constraints (max guests, valid dates, budget limits), guardrails use pre-execution hooks to validate parameters and cancel invalid operations. Rules live in code, not prompts, so the LLM cannot bypass them. Use blocking guardrails for hard constraints that cannot be violated.

What Breaks

Prompts cannot enforce business rules. Even with clear docstrings ("max_guests must be ≤10"), the LLM passes max_guests=15 under pressure because prompts are suggestions, not constraints. The agent violates rules silently.

The Fix

Use framework hooks to validate parameters before tool execution. If validation fails, cancel the tool call and return corrective guidance. Rules live in code at the framework level, below the LLM's control.

Measured impact: Zero violations in 100-query test (vs. 12 violations with prompts alone).

What to Tell Your AI Assistant

"Build a booking agent with guardrails using Strands Agents hooks. 
Create a BeforeToolCallEvent hook that validates:
- max_guests ≤ 10
- check_in_date > today
- budget > 0

If validation fails, cancel the tool call with event.cancel_tool() 
and return error message. Do not rely on prompts for validation."

When to Use

Business rules that cannot be violated (compliance, legal, financial)
Validation requiring computation (date math, inventory checks)
Rules that change frequently

Full details: AI Agent Guardrails: Rules That LLMs Cannot Bypass

Pattern 4: Runtime Guardrails (Steer, Don't Block)

What Is Steering vs Blocking?

Steering guardrails return corrective guidance instead of blocking operations. When the agent violates a soft rule (format issues, parameter adjustments, data redaction), steering returns instructions via Guide() so the agent self-corrects and retries. This differs from blocking guardrails (Pattern 3) which stop workflows entirely. Use steering for rules where the agent can fix itself, blocking for hard constraints.

What Breaks

Hard guardrails (Pattern 3) block operations and stop workflows. For soft rules where the agent can self-correct (format issues, parameter adjustments, redacting sensitive data), blocking creates friction. The agent could fix the problem itself if given guidance.

The Fix

Use Agent Control to return corrective guidance via Guide() instead of blocking. When the agent violates a soft rule, the control plane returns instructions: "Adjust parameter X to Y and retry." The agent self-corrects and completes the task without human intervention.

Difference from Pattern 3:

Block (Pattern 3): Hard constraints, workflow stops
Steer (Pattern 4): Soft rules, agent self-corrects

What to Tell Your AI Assistant

"Build a booking agent with Agent Control for soft rules. Connect 
to Agent Control server. For soft rules (parameter formatting, 
date adjustments, data redaction), return Guide() with correction 
instructions instead of blocking. Agent should retry with fix applied.

Use hard blocks (Pattern 3) only for compliance rules that cannot 
be violated under any circumstance."

When to Use

Rules where agent can self-correct (format, adjust parameters)
Workflows where blocking creates poor UX
Rules managed centrally via API/dashboard (update without redeploying)

Full details: Runtime Guardrails for AI Agents: Steer, Don't Block

Pattern 5: Multi-Agent Validation

What Is Multi-Agent Validation?

Multi-agent validation deploys specialized agents with different roles (Executor, Validator, Critic) that cross-check each other's work. Single agents optimize for appearing successful, not verifying outcomes. Multiple agents with different optimization functions catch errors the others miss. Executor performs tasks, Validator cross-checks against ground truth, Critic provides final review before returning to the user.

What Breaks

Single agents cannot self-validate. When an agent books a hotel, it claims "Success: Booked Grand Plaza Hotel" even if the API returned an error or the hotel doesn't exist in the database. The agent optimizes for appearing successful, not verifying outcomes.

The Fix

Deploy multiple agents with different roles: Executor performs tasks, Validator cross-checks against ground truth, Critic provides final review. Agents share context and hand off control autonomously when their role completes.

Measured impact: Multi-agent catches errors single agent misses (e.g., booking non-existent hotels).

What to Tell Your AI Assistant

"Build a multi-agent system using Strands Swarm with 3 agents:
1. Executor: Books hotels, searches flights
2. Validator: Cross-checks operations against database
3. Critic: Final review before returning to user

Agents share context via swarm.context. Use autonomous handoffs. 
Agents decide when to hand off based on task completion."

When to Use

High-stakes operations (financial, medical, legal)
Tasks where "appears successful" differs from "actually successful"
Complex workflows with multiple verification points

Full details: How to Stop AI Agents from Hallucinating Silently with Multi-Agent Validation

Pattern 6: Memory Pointer Pattern

What Is the Memory Pointer Pattern?

The Memory Pointer Pattern stores large data outside the LLM context and passes short references instead. When tools return 200KB+ logs or 1000-row database results, passing them directly causes silent truncation. Memory pointers store data in agent.state, return a pointer to the LLM, and provide separate tools that resolve pointers to access full data. IBM reduced 20M tokens to 1,234 tokens using this pattern.

What Breaks

Context window overflow occurs when tools return more data than the LLM can process (200KB+ logs, 1000-row database results). The agent doesn't crash. It silently truncates data, loses context, produces incomplete answers.

Real production case (IBM Materials Science):

Before: 20 million tokens, workflow failed
After: 1,234 tokens, workflow succeeded

The Fix

Store large data in agent.state, pass short references to the LLM. Tools return pointers like "logs-app-server". Subsequent tools resolve pointers to access full data. LLM only sees: "Data stored as logs-app-server. Use analyze_errors(pointer)."

Data in context reduced: 214KB → 52 bytes

What to Tell Your AI Assistant

"Build a log analysis agent using Memory Pointer Pattern. When 
fetch_logs returns >20KB:
1. Store in agent.state with unique pointer ID
2. Return to LLM: 'Data stored as logs-{app}. Use analyze_logs(pointer).'
3. Implement analyze_logs(pointer) that resolves from agent.state

Never pass large data directly to LLM context."

When to Use

Tools returning large outputs (logs, database queries, files)
Workflows with multiple processing steps on same large data
Cost-sensitive applications

Full details: AI Context Window Overflow: Memory Pointer Fix

Pattern 7: Async HandleId Pattern

What Is the Async HandleId Pattern?

The async handleId pattern prevents slow external APIs from blocking your agent. When an API takes 30+ seconds, synchronous calls freeze the entire agent. Async handleId returns a job ID immediately, letting the agent continue with other tasks. A separate check_status tool polls for results when ready. This eliminates 424 timeout errors and keeps agents responsive.

What Breaks

External APIs that take 30+ seconds block the agent indefinitely. No other tools can run. After ~7 seconds, many implementations return 424 timeout errors, freezing the workflow.

The Fix

Tools return immediately with a job ID instead of waiting. Agent stores handleId and continues. Separate check_status(job_id) tool polls for results asynchronously.

Measured impact:

Before: 18-second API blocks agent, 424 timeout
After: Tool returns <1 second, agent polls when ready

What to Tell Your AI Assistant

"Build an agent with async handleId pattern for slow APIs:

1. start_analysis(data): Submit job, return job_id immediately
2. check_status(job_id): Poll for results

Agent calls start_analysis, stores job_id, continues with other 
tasks, calls check_status when ready. Do not implement blocking calls."

When to Use

External APIs with >5 second response times
Batch processing (video analysis, large transforms)
Any system outside your control

Full details: Fix MCP Timeouts: Async HandleId Pattern

Pattern 8: DebounceHook + Explicit States

What Prevents Reasoning Loops?

Reasoning loops occur when ambiguous tool feedback ("more may be available") signals that retrying might help. Two fixes work together: explicit terminal states (return SUCCESS/FAILED so the LLM knows when to stop) and DebounceHook (framework hook that blocks duplicate calls). Production tests showed explicit states reduced calls from 14 to 2, while DebounceHook provides a safety net for edge cases.

What Breaks

Agents loop calling the same tool repeatedly without progress. Ambiguous feedback like "Found 3 results. More may be available" signals that retrying might help. The agent loops indefinitely.

Real production case: 847 reasoning steps at $47/minute, no answer delivered.

The Fix (Two Parts)

Part A: Explicit Terminal States

Return clear SUCCESS or FAILED states. Change "More may be available" to "SUCCESS: Found all 3 matching flights."

Part B: DebounceHook Safety Net

Framework hook tracks recent tool calls. When same (tool_name, input) appears twice, block third attempt.

Measured impact (travel booking demo):

Ambiguous feedback: 14 calls
Explicit SUCCESS: 2 calls (7x reduction)
DebounceHook: 12 calls (2 blocked)

What to Tell Your AI Assistant

"Build a travel agent with anti-loop protection:

1. All tools return explicit states:
   - SUCCESS: [clear completion]
   - FAILED: [clear error]
   Never return 'more may be available'

2. Implement DebounceHook:
   - Track last 3 tool calls as (tool_name, input)
   - If same pair appears twice, block third attempt
   - Return 'BLOCKED: Duplicate detected'

This prevents loops without manual retry limits."

When to Use

Agents prone to retry loops (search, API aggregators)
Cost-sensitive applications where unbounded retries are expensive
Production systems where infinite loops create availability risk

Full details: How to Prevent AI Agent Reasoning Loops from Wasting Tokens

Example: Generic vs Informed Prompting

❌ Generic Prompt

"Build a customer support agent that searches our knowledge base 
and books appointments"

What you get:

Vector RAG (may hallucinate on structured queries)
Synchronous booking API (may timeout)
No validation (can book invalid times)
Single agent (claims success even when booking fails)

Result: Works in demo, fails in production.

✅ Informed Prompt

"Build a customer support agent:

Knowledge Base:
- Use Neo4j GraphRAG for structured queries (pricing, features)
- Use vector RAG only for semantic search (descriptions)

Booking:
- Validate appointment_time > now() before booking
- Use async handleId for booking API (10+ seconds)
- Return explicit states: SUCCESS / FAILED

Validation:
- Multi-agent: Executor (search/book), Validator (cross-check), 
  Critic (final review)
- Use Strands Swarm for autonomous handoffs

Loop Prevention:
- DebounceHook blocks duplicate calls
- All tools return terminal states"

What you get:

GraphRAG prevents hallucinations
Async prevents timeouts
Guardrails prevent invalid bookings
Multi-agent catches false successes
DebounceHook prevents loops

Result: Production-ready agent.

Common Mistakes

Mistake 1: Assuming Defaults Are Best Practices

Problem: "Build a production agent" assumes the assistant knows what production means.

Fix: Specify patterns: "Use GraphRAG, guardrails, async patterns."

Mistake 2: Relying Only on Prompts for Validation

Problem: "Make sure max_guests < 10" in system prompt gets ignored under pressure.

Fix: "Implement BeforeToolCallEvent hook that validates and cancels invalid calls."

Mistake 3: Not Recognizing When Patterns Apply

Problem: Agent works in demo, breaks on edge cases.

Fix: Know the 8 patterns. When you see hallucinations, timeouts, or loops, you'll recognize which pattern solves it.

My Thoughts

AI coding assistants will keep improving at generating working code. But working code and production-ready architecture remain different targets.

The gap isn't the assistant's capability. It's the prompt's specificity.

Next Steps

If You're Building a New Agent

Identify which patterns apply (use symptom checklist)
Specify patterns in your prompt
Verify generated code implements them
Test failure modes (timeouts, invalid inputs, non-existent data)

If You're Debugging an Existing Agent

Identify the symptom (hallucinations, loops, timeouts, rule violations)
Map symptom to pattern (see Step 1: Recognize the Symptom)
Prompt your assistant to add the pattern: "Add DebounceHook to prevent loops"
Verify fix with targeted tests

Learn More (Full Implementation Guides)

Each pattern has a complete guide with working code:

GraphRAG: RAG vs GraphRAG: When Agents Hallucinate Answers
Semantic Tool Selection: Reduce Agent Errors and Token Costs
Neurosymbolic Guardrails: AI Agent Guardrails: Rules That LLMs Cannot Bypass
Runtime Guardrails (Steering): Runtime Guardrails for AI Agents: Steer, Don't Block
Multi-Agent Validation: Stop AI Agents from Hallucinating Silently
Memory Pointers: AI Context Window Overflow: Memory Pointer Fix
Async HandleId: Fix MCP Timeouts: Async HandleId Pattern
DebounceHook: Prevent AI Agent Reasoning Loops

Complete series:

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

[Boost]

Elizabeth Fuentes L — Sat, 09 May 2026 00:58:52 +0000

Elizabeth Fuentes L for AWS

May 4

How to Prevent AI Agent Reasoning Loops from Wasting Tokens

#ai #tutorial #python #aws

Comments 2

8 min read

Por Qué Fallan los Agentes de IA: 3 Modos de Fallo Que Cuestan Tokens y Tiempo

Elizabeth Fuentes L — Fri, 08 May 2026 23:19:28 +0000

Los agentes de IA no fallan como el software tradicional: no se bloquean con un stack trace. Fallan silenciosamente: devuelven respuestas incompletas, se congelan en APIs lentas o queman tokens llamando a la misma herramienta una y otra vez. El agente parece funcionar, pero la salida está mal, llega tarde o es costosa.

Esta serie cubre los tres modos de fallo más comunes con soluciones respaldadas por investigación. Cada técnica tiene una demostración ejecutable que mide la diferencia antes/después.

Código funcional: github.com/aws-samples/sample-why-agents-fail

Las demos usan Strands Agents con OpenAI (GPT-4o-mini). Los patrones son independientes del framework: aplican a LangGraph, AutoGen, CrewAI o cualquier framework que soporte llamadas a herramientas y hooks de ciclo de vida.

Esta Serie: 3 Soluciones Esenciales

Desbordamiento de Ventana de Contexto — Patrón de Puntero de Memoria para datos grandes
Herramientas MCP Que Nunca Responden — Patrón handleId asíncrono para APIs externas lentas
Loops de Razonamiento en Agentes de IA — DebounceHook + estados claros de herramientas para bloquear llamadas repetidas

¿Qué Sucede Cuando las Salidas de Herramientas Desbordan la Ventana de Contexto?

El desbordamiento de ventana de contexto ocurre cuando una herramienta devuelve más datos de los que el LLM puede procesar: logs del servidor, resultados de bases de datos o contenidos de archivos que exceden el límite de tokens. El agente no falla con un error. Se degrada silenciosamente: trunca datos, pierde contexto o produce respuestas incompletas.

Una investigación de IBM cuantifica esto: un flujo de trabajo de Ciencia de Materiales consumió 20 millones de tokens y falló. El mismo flujo con punteros de memoria usó 1,234 tokens y tuvo éxito.

La solución — Patrón de Puntero de Memoria: Almacena datos grandes en agent.state, devuelve un puntero corto al contexto. La siguiente herramienta resuelve el puntero para acceder a los datos completos:

from strands import tool, ToolContext

@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Obtiene logs. Almacena datos grandes como puntero para evitar desbordamiento de contexto."""
    logs = generate_logs(app_name, hours)  # Podría ser 200KB+

    if len(str(logs)) > 20_000:
        pointer = f"logs-{app_name}"
        tool_context.agent.state.set(pointer, logs)
        return f"Datos almacenados como puntero '{pointer}'. Usa herramientas de análisis para consultarlo."
    return str(logs)

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analiza errores — resuelve puntero desde agent.state."""
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    return f"Se encontraron {len(errors)} errores en {len(set(e['service'] for e in errors))} servicios"

El LLM nunca ve los 200KB: solo ve "Datos almacenados como puntero 'logs-payment-service'" (52 bytes).

¿Por qué Strands Agents? La API de ToolContext proporciona agent.state como un almacén clave-valor nativo con alcance para cada agente: sin diccionarios globales, sin infraestructura externa. Para flujos multi-agente, invocation_state comparte datos entre agentes en un Swarm con la misma API.

Métrica	Sin punteros	Con Punteros de Memoria
Datos en contexto	214KB (logs completos)	52 bytes (puntero)
Comportamiento del agente	Trunca o falla	Procesa todos los datos
Errores detectados	Parcial	Completo

Demo completa: 01-context-overflow-demo — implementaciones de agente único y multi-agente (Swarm) con notebooks.

¿Por Qué los Agentes de IA se Congelan al Llamar APIs Externas?

Los agentes de IA se congelan cuando las herramientas MCP llaman a APIs externas lentas o que no responden. El agente se bloquea en la llamada a la herramienta, el usuario no ve progreso, y después de 7 segundos muchas implementaciones devuelven un error 424. MCP (Model Context Protocol) les da a los agentes la capacidad de llamar herramientas externas, pero no maneja timeout o reintentos por defecto.

La solución — Patrón handleId asíncrono: La herramienta devuelve inmediatamente un ID de trabajo. El agente consulta una herramienta separada check_status:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("timeout-demo")
JOBS = {}

@mcp.tool()
async def start_long_job(task: str) -> str:
    """Devuelve handle inmediatamente — previene timeout."""
    job_id = str(uuid.uuid4())[:8]
    JOBS[job_id] = {"status": "processing", "task": task}
    asyncio.create_task(_process_job(job_id))  # Trabajo en segundo plano
    return f"Trabajo iniciado. Handle: {job_id}. Usa check_job_status para consultar."

@mcp.tool()
async def check_job_status(job_id: str) -> str:
    """Consulta estado del trabajo — devuelve 'processing' o 'completed' con resultado."""
    job = JOBS.get(job_id)
    if not job:
        return f"FAILED: Trabajo '{job_id}' no encontrado"
    return f"{job['status'].upper()}: {job.get('result', 'Todavía procesando...')}"

Escenario	Tiempo de respuesta	UX
API rápida (1s)	3s total	OK
API lenta (15s)	18s bloqueado	Agente congelado
API fallida	Error 424 después de 7s	Agente falla
handleId asíncrono	~4s (inmediato + consulta)	Agente responde

¿Por qué Strands Agents? El MCPClient se conecta a cualquier servidor MCP. El agente descubre herramientas en tiempo de ejecución vía list_tools_sync(): sin lista de herramientas codificada. Cuando el servidor MCP implementa el patrón asíncrono, el agente consulta automáticamente sin código de orquestación adicional.

Demo completa: 02-mcp-timeout-demo — servidor MCP local con los 4 escenarios y notebook.

¿Por Qué los Agentes de IA Repiten la Misma Llamada a Herramienta?

Los loops de razonamiento en agentes de IA ocurren cuando el agente llama a la misma herramienta repetidamente con parámetros idénticos, sin hacer progreso. La causa raíz es retroalimentación ambigua de la herramienta: respuestas como "puede haber más resultados disponibles" hacen que el agente piense que otra llamada producirá mejores resultados. Las investigaciones muestran que los agentes pueden hacer loops cientos de veces sin entregar una respuesta.

Solución 1 — Estados terminales claros: Las herramientas devuelven SUCCESS o FAILED explícito en lugar de mensajes ambiguos:

# Ambiguo (causa loops)
return f"Vuelos encontrados: {results}. Puede haber más resultados disponibles."

# Claro (el agente se detiene)
return f"SUCCESS: Vuelo {conf_id} reservado para {passenger}. Confirmación enviada."

Solución 2 — DebounceHook: Detecta y bloquea llamadas duplicadas a herramientas a nivel de framework:

from strands.hooks.registry import HookProvider, HookRegistry
from strands.hooks.events import BeforeToolCallEvent

class DebounceHook(HookProvider):
    """Bloquea llamadas duplicadas a herramientas en una ventana deslizante."""
    def __init__(self, window_size=3):
        self.call_history = []
        self.window_size = window_size

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_duplicate)

    def check_duplicate(self, event: BeforeToolCallEvent) -> None:
        key = (event.tool_use["name"], json.dumps(event.tool_use.get("input", {})))
        if self.call_history.count(key) >= 2:
            event.cancel_tool = f"BLOCKED: Llamada duplicada a {event.tool_use['name']}"
        self.call_history.append(key)
        self.call_history = self.call_history[-self.window_size:]

Estrategia	Llamadas a herramientas	Resultado
Retroalimentación ambigua (línea base)	14 llamadas	Sin respuesta definitiva
DebounceHook	12 llamadas (2 bloqueadas)	Completa con bloqueos
Estados SUCCESS claros	2 llamadas	Completado inmediato

¿Por qué Strands Agents? La API de HookProvider intercepta llamadas a herramientas vía BeforeToolCallEvent antes de que se ejecuten. Establecer event.cancel_tool bloquea la ejecución a nivel de framework: el LLM no puede omitirlo. Esto hace que los hooks sean componibles para apilar DebounceHook, LimitToolCounts y validadores personalizados en el mismo agente.

Demo completa: 03-reasoning-loops-demo — los 4 escenarios con hooks y notebook.

Requisitos Previos

Necesitas Python 3.9+, uv (un gestor de paquetes rápido de Python), y una clave API de OpenAI.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens

# Elige cualquier demo
cd 01-context-overflow-demo   # o 02-mcp-timeout-demo, 03-reasoning-loops-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="tu-clave-aquí"

uv run python test_*.py

Cada demo es independiente con sus propias dependencias, script de prueba y notebook de Jupyter.

Preguntas Frecuentes

¿Cuáles son los modos de fallo más comunes en agentes de IA?

Los tres modos de fallo más comunes son el desbordamiento de ventana de contexto (la herramienta devuelve más datos de los que el LLM puede procesar), timeouts de herramientas MCP (APIs externas bloquean al agente indefinidamente) y loops de razonamiento (el agente repite la misma llamada a herramienta sin progresar). Cada modo de fallo causa desperdicio de tokens y degrada la calidad de respuesta.

¿Cómo reduzco los costos de tokens de un agente de IA?

Las dos técnicas más efectivas son los punteros de memoria y estados claros de herramientas. El Patrón de Puntero de Memoria almacena salidas grandes de herramientas en estado externo y pasa referencias cortas al contexto del LLM, reduciendo el uso de tokens de más de 200KB a menos de 100 bytes por llamada a herramienta. Estados terminales claros (SUCCESS/FAILED) en respuestas de herramientas previenen que el agente reintente operaciones completadas, lo que puede reducir las llamadas a herramientas de 14 a 2.

¿Puedo usar estos patrones con frameworks distintos a Strands Agents?

Sí. El Patrón de Puntero de Memoria funciona con cualquier framework que soporte contexto de herramientas (pasar estado entre herramientas). El patrón handleId asíncrono es un patrón de diseño de servidor MCP: funciona con cualquier agente compatible con MCP. DebounceHook requiere hooks de ciclo de vida, que están disponibles en LangGraph, AutoGen y CrewAI con APIs diferentes.

Referencias

Investigación

Solving Context Window Overflow in AI Agents — IBM Research, Nov 2025
Towards Effective GenAI Multi-Agent Collaboration — Amazon, Dec 2024
Resilient AI Agents With MCP — Octopus, May 2025
Language models can overthink — The Decoder, Jan 2025

Implementación

Strands Agent State — ToolContext and agent.state
Strands MCP Tools — Connect any MCP server
Strands Hooks — Lifecycle events and tool cancellation

¿Qué modo de fallo has encontrado en tus agentes? Comparte en los comentarios.

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

How to Prevent AI Agent Reasoning Loops from Wasting Tokens

Elizabeth Fuentes L — Mon, 04 May 2026 23:00:00 +0000

AI agent reasoning loops occur when an agent calls the same tool repeatedly without making progress, convinced that one more attempt will produce the perfect answer. The agent wastes tokens, time, and money without delivering a result. This post shows how to detect and block repeated calls, validated with a demo where ambiguous tools caused 14 calls vs clear SUCCESS states that stopped in 2.

This demo uses Strands Agents. The patterns — debounce hooks, clear tool states, and call limits — are framework-agnostic and apply to any agent that supports lifecycle hooks, including LangGraph, AutoGen, and CrewAI.

Working code: github.com/aws-samples/sample-why-agents-fail

Series: Why AI Agents Fail

Context Window Overflow — Memory Pointer Pattern for large data
MCP Tools That Never Respond — Async pattern for slow external APIs
AI Agent Reasoning Loops (this post) — Detect and block repeated tool calls

The Problem: Agents That Overthink

AI agent reasoning loops occur when an agent calls the same tool repeatedly without making progress, wasting tokens and time without delivering a result. AI agents don't just fail by giving wrong answers; they fail by never finishing. Research shows agents get trapped in reasoning loops where they call the same tool repeatedly, convinced that "one more step" will produce the perfect answer.

The Decoder (Jan 2025) found that even with unlimited computing power, overthinking leads to poor decisions. Incomplete understanding of the world causes compounding errors. Each additional reasoning step makes things worse, not better.

Particula (Jul 2025) (community observation) documented an extreme case: an agent executed 847 reasoning steps at $47 per minute and never delivered a final answer. It kept refining logic, questioning conclusions, and requesting more data in an endless cycle.

CodiesHub (Dec 2025) (community observation) identifies the root causes:

Unclear goals — agent doesn't know when the task is complete
Ambiguous tool feedback — tools don't return clear success/failure states
No stopping criteria — no hard limits on iterations or time

Why Loops Happen: Ambiguous Tool Feedback

Ambiguous tool feedback occurs when tools return partial results or suggest "more data may be available" without clear terminal states, causing agents to retry the same call. Tools that return partial results or suggest "more data may be available" cause agents to retry:

@tool
def search_flights(origin: str, destination: str, max_price: float) -> str:
    """Search for flights under a max price."""
    prices = [random.randint(200, 800) for _ in range(3)]
    matching = [p for p in prices if p <= max_price]
    # The problem: "More results may be available" signals the LLM to retry
    # The agent interprets this as "I should search again to find a better deal"
    return (
        f"Found {len(matching)} flights under ${max_price} "
        f"(out of {len(prices)} checked). "
        "Note: More results may be available. Prices change frequently."
    )

That "Note: More results may be available" triggers the loop. The agent sees it and thinks: "Maybe if I search again, I'll find a better deal." It retries with the same parameters, gets similar results, and the cycle continues.

Solution 1: Debounce Hook with Strands

Strands Hooks intercept the agent lifecycle at any point. A Debounce Hook uses BeforeToolCallEvent to detect duplicate calls before they execute:

from strands.hooks import HookProvider, BeforeToolCallEvent, BeforeInvocationEvent

class DebounceHook(HookProvider):
    def __init__(self, window_size=3):
        self.call_history = []       # Tracks (tool_name, input) pairs
        self.window_size = window_size  # Sliding window size for duplicate detection
        self.blocked_count = 0

    def register_hooks(self, registry):
        # BeforeInvocationEvent fires once at the start of each agent.invoke() call
        registry.add_callback(BeforeInvocationEvent, self.reset)
        # BeforeToolCallEvent fires before every tool execution — this is where we intercept
        registry.add_callback(BeforeToolCallEvent, self.check_duplicate)

    def reset(self, event):
        # Clear history at the start of each invocation so limits don't bleed across calls
        self.call_history = []

    def check_duplicate(self, event):
        # Build a fingerprint from tool name + exact inputs
        key = (event.tool_use["name"], str(event.tool_use["input"]))
        recent = self.call_history[-self.window_size:]

        if recent.count(key) >= 2:
            # cancel_tool is a native Strands API: blocks execution and returns this message to the LLM
            event.cancel_tool = "BLOCKED: Duplicate call detected"
            self.blocked_count += 1
            return

        self.call_history.append(key)

agent = Agent(tools=[search_flights], hooks=[DebounceHook()])

The hook tracks the last 3 tool calls. If the same tool with the same parameters appears twice, the third attempt is blocked via event.cancel_tool, a native Strands API that prevents tool execution and returns an error message to the LLM.

Solution 2: Clear SUCCESS/FAILED States

Tools that return explicit terminal states help agents know when to stop:

@tool
def book_hotel(hotel: str, guest: str, nights: int) -> str:
    """Book a hotel room. Returns clear SUCCESS or FAILED.

    Returns:
        SUCCESS: Booking confirmed with ID
        FAILED: Booking failed with reason
    """
    if random.random() > 0.15:
        conf = f"HT{random.randint(10000, 99999)}"
        price = random.randint(150, 350)
        return f"SUCCESS: Booking {conf} confirmed — {guest} at {hotel}, {nights} nights, ${price * nights} total"
    return f"FAILED: {hotel} fully booked"

When the agent receives "SUCCESS: Booking HT79265 confirmed", it knows the task is done. No ambiguity, no extra calls.

Solution 3: Hard Limits with LimitToolCounts

CodiesHub recommends: "Iterations, tokens, time, spend are non-negotiable." Strands provides LimitToolCounts in the Hooks Cookbook — a hook that caps tool calls per invocation:

from strands.hooks import HookProvider, BeforeToolCallEvent, BeforeInvocationEvent
from threading import Lock

class LimitToolCounts(HookProvider):
    """Limits tool calls per invocation. From Strands Hooks Cookbook."""

    def __init__(self, max_tool_counts: dict[str, int]):
        # Per-tool call budgets: {"search_flights": 2} means max 2 searches per invocation
        self.max_tool_counts = max_tool_counts
        self.tool_counts = {}
        self._lock = Lock()  # Thread-safe for concurrent tool calls in Swarm scenarios

    def register_hooks(self, registry):
        registry.add_callback(BeforeInvocationEvent, self.reset_counts)
        registry.add_callback(BeforeToolCallEvent, self.intercept_tool)

    def reset_counts(self, event):
        # Reset per invocation so limits apply per task, not per agent lifetime
        with self._lock:
            self.tool_counts = {}

    def intercept_tool(self, event):
        tool_name = event.tool_use["name"]
        with self._lock:
            max_count = self.max_tool_counts.get(tool_name)
            count = self.tool_counts.get(tool_name, 0) + 1
            self.tool_counts[tool_name] = count

            if max_count and count > max_count:
                # Hard ceiling: block the call and tell the LLM explicitly to stop
                event.cancel_tool = f"Tool '{tool_name}' limit reached. DO NOT CALL ANYMORE."

# Enforce a hard limit of 2 flight searches per booking task — prevents runaway costs
limit_hook = LimitToolCounts(max_tool_counts={"search_flights": 2})
agent = Agent(tools=[search_flights], hooks=[limit_hook])

Even if the agent wants to search 10 times, it's capped at 2. Hard ceiling, predictable costs.

Demo Results

We tested with a travel booking agent that searches for flights and hotels:

Scenario	Tool Calls	Time	Result
Ambiguous Feedback	14	21s	Agent retried organically — "prices may change" caused loops
DebounceHook	12	15s	Reduced retries but some variation in parameters
Clear SUCCESS States	2	4s	Agent stopped immediately after SUCCESS
LimitToolCounts	6 (2 blocked)	6s	Hard ceiling enforced — no runaway

The contrast is dramatic: 14 calls with ambiguous tools vs 2 calls with clear SUCCESS states. That is a 7x difference caused purely by tool feedback design.

When to Use Each Solution

DebounceHook — prevents duplicate calls with identical parameters. Use when tools are idempotent and retrying with the same input is wasteful.

Clear SUCCESS/FAILED states — the simplest solution. Design tools to return explicit terminal states. The agent knows when to stop.

LimitToolCounts — hard ceiling on tool calls per invocation. Use in production to prevent runaway costs regardless of tool design. From the Strands Hooks Cookbook.

All three together — defense in depth. Clear states prevent most loops, debounce catches duplicates, and hard limits guarantee bounded execution.

Try It Yourself

You need Python 3.9+, uv, and an OpenAI API key.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/03-reasoning-loops-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_reasoning_loops.py   # Runs all 4 scenarios

Or open test_reasoning_loops.ipynb in Jupyter, JupyterLab, VS Code, or your preferred notebook environment.

Key Takeaways

Ambiguous tool feedback causes organic loops — "more results may be available" makes agents retry
14 calls vs 2 calls — clear SUCCESS states reduce calls by 7x in our demo
Hooks intercept before execution — BeforeToolCallEvent.cancel_tool blocks the call before the tool runs. The DebounceHook is ~30 lines of code
Hard limits are mandatory — every agent needs caps on iterations, time, and spend
847 steps at $47/min was documented (Particula, community observation) — unbounded agents burn money without delivering answers

Frequently Asked Questions

Why do AI agents repeat the same tool call?

Agents repeat tool calls when tool responses contain ambiguous feedback such as "more results may be available" or "prices change frequently." The LLM interprets these signals as a reason to retry, expecting different or better results. Without clear terminal states (SUCCESS/FAILED), the agent has no way to know the task is complete.

What is a DebounceHook and how does it prevent reasoning loops?

A DebounceHook tracks recent tool calls in a sliding window. When the same tool is called with identical parameters more than a set threshold (typically 2 times within a window of 3), the hook blocks the call using event.cancel_tool before the tool executes. The LLM receives a "BLOCKED: Duplicate call" message and must try a different approach. In Strands Agents, this is about 30 lines of code using the HookProvider API.

How do clear SUCCESS/FAILED states reduce tool calls?

When a tool returns "SUCCESS: Booking HT79265 confirmed," the LLM recognizes the task is complete and stops calling that tool. Ambiguous responses such as "Found 2 flights, more may be available" lack this signal, causing the agent to retry. In our demo, clear states reduced tool calls from 14 to 2, a 7x improvement.

References

Research

Language models can overthink — The Decoder, Jan 2025
How many reasoning steps do AI agents need — Particula (community observation), Jul 2025
How to Prevent Infinite Loops and Spiraling Costs — CodiesHub (community observation), Dec 2025

Implementation

Strands Hooks — Lifecycle event interception and tool cancellation

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

[Boost]

Elizabeth Fuentes L — Sat, 02 May 2026 20:00:22 +0000

Francisco

May 1

De bloquear a autocorregir: una demo práctica de guardrails para agentes de IA con Laravel, Grok y OpenSpec

#laravel #ai #openspec #guardrails

Comments 1

9 min read

Fix MCP Timeouts: Async HandleId Pattern

Elizabeth Fuentes L — Thu, 30 Apr 2026 19:28:53 +0000

MCP tools freeze AI agents when external APIs are slow, causing 424 errors. The async handleId pattern returns immediately with a job ID and polls for results without blocking.

MCP tool timeout occurs when an AI agent calls a Model Context Protocol (MCP) tool that depends on a slow external API. The tool blocks the agent indefinitely instead of returning an error. The result is a 424 (Failed Dependency) error or a frozen workflow with no user feedback. This post shows the problem with real scenarios and how the async handleId pattern provides immediate responses.

This demo uses Strands Agents with MCP (Model Context Protocol). The async pattern is framework-agnostic and applies to any agent that calls external APIs through MCP.

Working code: github.com/aws-samples/sample-why-agents-fail

Series: Why AI Agents Fail

Context Window Overflow — Memory Pointer Pattern for large data
MCP Tools That Never Respond (this post) — Async pattern for slow external APIs
AI Agent Reasoning Loops — Detect and block repeated tool calls

The Problem: MCP Tools That Never Respond

The Model Context Protocol (MCP) enables AI agents to call external tools. But when those tools depend on slow APIs, the entire agent workflow freezes. The agent waits. The user waits. Nothing happens.

Community observation from Octopus (Resilient AI Agents With MCP, 2025) identifies the core issue: as external system integrations increase, so does the likelihood of failure. Systems become unavailable, slow to respond, or return errors. Agents have no built-in strategy to handle this.

OpenAI Community reports confirm the real-world impact:

424 errors when MCP tools take too long
Unresponsive states where requests neither succeed nor fail
Tools that pass handshake validation but timeout during execution

Why This Happens

MCP expects tools to respond quickly. When a tool calls a slow external API.

The MCP protocol has implicit timeout expectations. If the tool doesn't respond within ~7-10 seconds, the connection may drop with a 424 (Failed Dependency) error. The agent receives an error instead of data, and the user gets no useful response.

Three failure modes:

Slow API — Tool waits 15+ seconds, poor UX but eventually responds
Failing API — External service unavailable, 424 error after timeout
Unresponsive state — Request accepted but never returns, requires session restart

The Demo: Simulating Real Timeout Scenarios

We built an MCP server that simulates these real-world scenarios:

from mcp.server import FastMCP
import asyncio

# FastMCP is a lightweight MCP server framework — tools are registered with @mcp.tool()
mcp = FastMCP("Timeout Demo Server")

# Baseline: responds in 1s, well within MCP's implicit timeout threshold (~7-10s)
@mcp.tool(description="Fast API - responds in 1 second")
async def fast_api(query: str) -> str:
    await asyncio.sleep(1)
    return f"Fast result for: {query}"

# Problem case: 15s delay exceeds MCP timeout — the agent freezes waiting for this
@mcp.tool(description="Slow API - responds in 15 seconds")
async def slow_api(query: str) -> str:
    await asyncio.sleep(15)  # Simulates a slow external service (data pipeline, batch job)
    return f"Slow result for: {query}"

# Failure case: 7s delay triggers the timeout, then raises Failed Dependency (424)
@mcp.tool(description="Failing API - returns 424 after delay")
async def failing_api(query: str) -> str:
    await asyncio.sleep(7)
    raise Exception("Failed Dependency: External service unavailable")

The Async HandleId Solution

Instead of waiting for slow operations, return immediately with a tracking ID:

import uuid

# In-memory job store: maps job_id → {status, query, result}
# For production, replace with a persistent store (Redis, DynamoDB) for durability across restarts
JOBS = {}

# The handleId pattern: return a tracking ID immediately instead of blocking
@mcp.tool(description="Start a long-running job, returns immediately with job ID")
async def start_async_job(query: str) -> str:
    job_id = str(uuid.uuid4())[:8]  # Short ID the LLM can pass in follow-up calls
    JOBS[job_id] = {"status": "processing", "query": query}

    # Fire-and-forget: slow work runs in background, tool returns before it finishes
    asyncio.create_task(do_work(job_id, query))

    # The agent receives this in < 1s — no timeout, no frozen UI
    return f"Job started: {job_id}. Use check_job_status to poll for results."

# Polling endpoint: the agent calls this repeatedly until status is "completed"
@mcp.tool(description="Check status of a running job")
async def check_job_status(job_id: str) -> str:
    job = JOBS.get(job_id)
    if not job:
        return f"Job {job_id} not found"
    if job["status"] == "completed":
        return f"COMPLETED: {job['result']}"  # Return the actual result to the agent
    return f"PROCESSING: Job {job_id} still running"  # Agent polls again after a short wait

Demo Results

We tested all four scenarios with a Strands Agent connected to the MCP server:

Scenario	Response Time	User Experience	Research Finding
Fast API (1s delay)	3.2s total	✅ Good UX	Baseline
Slow API (15s delay)	17.8s total	❌ Poor UX — agent waits	Octopus: "agent waits indefinitely"
Failing API (424)	7.7s total	❌ Error after wait	OpenAI Community: 424 errors
Async pattern (handleId)	3.7s total	✅ Immediate response	Solution: "respond ASAP with handleId"

The async pattern transforms a 17.8s wait into a 3.7s immediate response. The agent tells the user "job started" and can check status later, with no frozen UI and no timeout errors.

Why Strands Agents for MCP Integration?

The MCPClient connects to any MCP server in two lines. The agent discovers available tools at runtime through list_tools_sync(), so you don't maintain a hardcoded tool list. When the MCP server implements the async handleId pattern, the agent polls automatically without extra orchestration code.

Strands supports multiple model providers (OpenAI, Amazon Bedrock, Anthropic, Ollama). The MCP timeout patterns shown here work identically across all providers.

When to Use Each Pattern

Direct call (fast tools < 5s):

Lookups, calculations, small API calls
No timeout risk

Async handleId (slow tools > 5s):

External API calls with unpredictable latency
Data processing, report generation
Any operation that might exceed MCP timeout

Retry with backoff (intermittent failures):

Services that occasionally fail but recover
Network-dependent operations

Try It Yourself

You need Python 3.9+, uv, and an OpenAI API key. The MCP server runs locally as a subprocess, so no external services are needed.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/02-mcp-timeout-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_mcp_timeout.py   # Runs all 4 scenarios

Or open test_mcp_timeout.ipynb in Jupyter, JupyterLab, VS Code, or your preferred notebook environment.

Key Takeaways

MCP tools timeout silently — 424 errors with no recovery
Slow APIs freeze the entire agent — 17.8s wait with no feedback
Async handleId pattern solves it — immediate response, poll for results
Design for failure — every external call can timeout, plan accordingly

Frequently Asked Questions

What causes 424 errors in MCP tool calls?

A 424 (Failed Dependency) error occurs when an MCP tool takes longer than the implicit timeout threshold (typically 7-10 seconds) to respond. The MCP protocol expects tools to return quickly. When an external API blocks the tool beyond this threshold, the connection drops and the agent receives a 424 error instead of data.

When should I use the async handleId pattern instead of a direct MCP tool call?

Use the async handleId pattern for any tool that calls an external API with unpredictable latency: data processing, report generation, third-party service calls, or any operation that might exceed 5 seconds. For fast lookups, calculations, and small API calls under 5 seconds, direct calls work fine.

Does the async handleId pattern work with any MCP server, not only Strands?

Yes. The async handleId pattern is an MCP server design pattern, not a framework feature. Any MCP-compatible agent can call start_long_job and check_job_status tools. The pattern works with OpenAI Agents, LangChain MCP integrations, and any client that supports the Model Context Protocol.