Forem: ClevAgent

How to Monitor CrewAI Agents in Production

ClevAgent — Sat, 04 Apr 2026 15:43:39 +0000

If you're running CrewAI crews in production, you've probably hit this: your cron job exits with code 0, but the crew didn't actually finish its work. The researcher agent got stuck retrying a rate-limited API, the analyst never received input, and nobody noticed until Friday.

Multi-agent orchestration frameworks like CrewAI fail differently from traditional services. A crew can fail without crashing. Here's how to catch those failures with heartbeat monitoring — in about 3 lines of code.

Why CrewAI crews need dedicated monitoring

CrewAI orchestrates multiple agents that call LLMs, use tools, and pass context to each other. Each agent is a potential failure point:

Agent hangs: One agent waits indefinitely for an LLM response. The crew stalls, but the process stays alive.
Infinite loops: An agent retries a failed tool call endlessly. Your token meter spins, but no useful output appears.
Silent quality degradation: The LLM returns garbage, the next agent processes it anyway, and the final output is subtly wrong. No error thrown.
Cost spikes: A single crew run normally costs $0.15. One bad run costs $12 because an agent kept rephrasing the same request.

Traditional process monitoring (systemd, Docker health checks) only tells you the process is alive. It tells you nothing about whether the crew is making progress.

Try it now — monitor your CrewAI agent in 2 lines:

pip install clevagent

import clevagent
clevagent.init(api_key="YOUR_KEY", agent="my-crew")

Free for 3 agents. No credit card required. Get your API key →

Add ClevAgent to your CrewAI crew in 3 lines

ClevAgent monitors your crew at the agent level — heartbeats, loop detection, and per-run cost tracking. Setup takes about 30 seconds.

Step 1: Install

pip install clevagent

Step 2: Initialize before kickoff

import os
import clevagent

clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="my-research-crew",
)

That's it. ClevAgent starts sending heartbeats automatically. If your crew hangs or the process dies, you get alerted within 120 seconds.

Step 3 (optional): Add a step callback for per-agent tracking

CrewAI supports a step_callback on each agent. Wire it to ClevAgent to get visibility into each agent's work:

def track_step(step_output):
    clevagent.ping(
        status="step_complete",
        meta={
            "agent": step_output.agent,
            "output_length": len(str(step_output.output)),
        },
    )

Pass this callback when defining your agents:

researcher = Agent(
    role="Research Analyst",
    goal="Find the latest market data",
    backstory="You are a senior research analyst...",
    llm=llm,
    step_callback=track_step,
)

Now every agent step shows up on your dashboard with timing and metadata.

Complete example: 2-agent crew with monitoring

Here's a full working example — a research crew with two agents, monitored by ClevAgent:

import os
from crewai import Agent, Task, Crew, Process
import clevagent

# Initialize monitoring
clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="daily-research-crew",
)

def track_step(step_output):
    clevagent.ping(
        status="step_complete",
        meta={
            "agent": step_output.agent,
            "output_length": len(str(step_output.output)),
        },
    )

# Define agents
researcher = Agent(
    role="Research Analyst",
    goal="Find the 3 most important tech news stories today",
    backstory="You are a senior research analyst who reads dozens of sources daily.",
    verbose=True,
    step_callback=track_step,
)

writer = Agent(
    role="Report Writer",
    goal="Write a concise morning briefing from the research",
    backstory="You are a technical writer who distills complex topics into clear summaries.",
    verbose=True,
    step_callback=track_step,
)

# Define tasks
research_task = Task(
    description="Search for today's top 3 tech news stories. Include source URLs.",
    expected_output="A list of 3 news items with title, summary, and source URL.",
    agent=researcher,
)

writing_task = Task(
    description="Write a 200-word morning briefing based on the research.",
    expected_output="A formatted briefing email ready to send.",
    agent=writer,
)

# Assemble and run
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff()

# Report completion with output metadata
clevagent.ping(
    status="crew_complete",
    meta={
        "output_length": len(str(result)),
        "agents_used": 2,
    },
)

print(result)

The entire monitoring integration is 8 lines — the init(), the track_step callback, and the final ping(). Your existing CrewAI code stays exactly the same.

What ClevAgent catches

Once connected, ClevAgent watches for three categories of problems:

Crew hangs

If no heartbeat arrives for 120 seconds, ClevAgent sends an alert to Telegram or Slack. This catches the most common CrewAI failure: an agent waiting on an LLM call that never returns. Your cron job sees a running process. ClevAgent sees a silent agent.

Agent loops

ClevAgent tracks the frequency and pattern of ping() calls. If an agent sends 50 step completions in 30 seconds with identical metadata, that's a loop. You get a warning before the token bill becomes a problem.

Token cost spikes

Every ping() with metadata feeds into per-run cost estimates. ClevAgent compares the current run against your historical average. A run that's 5x the normal cost triggers a warning. You can set a hard budget ceiling per agent in the dashboard — if exceeded, ClevAgent sends an immediate alert.

Use clevagent.ping() for work-progress tracking

Beyond failure detection, ping() is useful for tracking that your crew is actually doing its job:

result = crew.kickoff()

clevagent.ping(
    status="crew_complete",
    meta={
        "report_date": today,
        "stories_found": len(stories),
        "word_count": len(result.split()),
    },
)

On the ClevAgent dashboard, this creates a timeline of crew runs. You can see at a glance:

Did today's 6 AM run actually complete?
How many stories did it find compared to yesterday?
Is the output length consistent, or did something degrade?

This is the difference between "the process ran" and "the crew did useful work." Process monitoring gives you the first. Ping metadata gives you the second.

How to Monitor LangChain Agents in Production

ClevAgent — Fri, 03 Apr 2026 22:01:39 +0000

Your LangChain agent works in development. Chains resolve, tools return, the ReAct loop converges. Ship it. Day one — fine. Day two — 200 requests, zero errors. Day three — your OpenAI bill says $340.

The agent got stuck in a tool-retry loop at 2 AM. It kept calling a search tool that returned empty results, parsing the response, deciding to search again, and repeating. No exceptions, no crashes, every health check returned 200 OK.

Tracing tools like LangSmith would show you the traces after the fact. Nobody would have woken you up at 2 AM when it started.

If you're running LangChain or LangGraph agents in production, this is the gap between observability and runtime monitoring. Here's how to close it.

Why LangChain agents need runtime monitoring

LangChain agents fail differently from web services:

Stuck chains: An HTTP tool call hangs indefinitely. The chain never completes. The process is alive, the health endpoint responds, but no work is happening.
Infinite ReAct loops: The agent keeps calling tools without converging. max_iterations helps, but only caps iteration count — not cost.
Silent cost spikes: A loop making 50 LLM calls in 30 seconds doesn't spike CPU. It spikes your API bill. By the time you see the invoice, the damage is done.
Zombie agents: The callback thread is alive, traces are flowing to LangSmith, but the actual work loop is stuck on a deadlocked resource.

LangSmith and Langfuse are excellent for tracing — understanding what happened after the fact. But they don't answer the real-time question: is this agent alive and making progress right now?

Add ClevAgent to your LangChain agent in 3 lines

Step 1. Install the SDK.

pip install clevagent

Step 2. Initialize ClevAgent with your API key.

import clevagent

clevagent.init(
    api_key="your-key-here",
    agent="langchain-research-agent",
)

Step 3. Add the callback handler to your LLM or chain.

from clevagent.integrations.langchain import ClevAgentCallbackHandler

handler = ClevAgentCallbackHandler()
llm = ChatOpenAI(model="gpt-4o", callbacks=[handler])

Every LLM call now sends a heartbeat with token usage. If the agent stops calling the LLM — because a chain hung, a tool timed out, or the process crashed — ClevAgent detects the silence and alerts you.

Complete example

import clevagent
from clevagent.integrations.langchain import ClevAgentCallbackHandler
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import Tool

clevagent.init(api_key="your-key-here", agent="research-agent")
handler = ClevAgentCallbackHandler()

llm = ChatOpenAI(model="gpt-4o", callbacks=[handler])

tools = [
    Tool(name="search", func=search_web, description="Search the web"),
    Tool(name="calculate", func=calculator, description="Do math"),
]

agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, max_iterations=15)

# Every LLM call and tool use is now monitored
result = executor.invoke({"input": "Research the latest AI agent frameworks"})

LangGraph agents: use the node decorator

For LangGraph's graph-based agents, ClevAgent provides a @monitored_node decorator that wraps each node with automatic heartbeat monitoring:

from clevagent.integrations.langgraph import monitored_node
from langgraph.graph import StateGraph

@monitored_node("research")
def research_node(state):
    result = llm.invoke(state["messages"])
    return {"messages": [result]}

@monitored_node("summarize")
def summarize_node(state):
    summary = llm.invoke(f"Summarize: {state['messages'][-1].content}")
    return {"messages": [summary]}

graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("summarize", summarize_node)
graph.add_edge("research", "summarize")

Each node execution sends a heartbeat. If a node hangs — because an API call never returns or an LLM request times out — ClevAgent detects the gap and alerts you.

What ClevAgent catches

Stuck chains and hung tools

Your agent calls an external API inside a tool. The API hangs. The chain never completes. systemctl status says "running" — but no heartbeats are arriving.

ClevAgent detects the silence within your configured threshold (default: 120 seconds) and sends an alert.

Infinite ReAct loops

The agent enters a loop: call tool → parse result → decide to call tool again → repeat. An agent making 15 iterations of GPT-4o calls in 30 seconds burns through tokens fast.

ClevAgent tracks cumulative token usage per heartbeat cycle. If tokens spike 10-100x above your agent's baseline, you get a cost alert — while the loop is still running, not after.

Silent exits

The process gets OOM-killed at 3 AM. No traceback, no error log, no alert. ClevAgent expects a heartbeat every N seconds. When it stops arriving, you get an alert within one missed interval. Optional auto-restart brings the agent back without manual intervention.

Getting started

pip install clevagent
Get your API key from clevagent.io/signup
Add clevagent.init() and the callback handler
Deploy — ClevAgent starts monitoring immediately
Configure alerts in the dashboard: Telegram, Slack, Discord, or email

Free for 3 agents. No credit card required.

Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch

ClevAgent — Thu, 02 Apr 2026 19:23:31 +0000

One of my agents exited cleanly at 3 AM, another sat "healthy" while doing zero useful work for four hours, and a third burned through $50 in API credits in 40 minutes without throwing a single error.

Those incidents looked unrelated at first. They weren't. All three slipped past the usual stack of process checks, log watchers, and CPU or memory alerts because those tools were measuring infrastructure symptoms, not whether the agent was still doing useful work.

Failure #1: The Silent Exit

One of my agents exited cleanly at 3 AM. No traceback. No error log. No crash dump. The Python process simply stopped. My log monitoring saw nothing because there was nothing to log.

I found out six hours later when I noticed the bot hadn't posted since 3 AM.

What happened

The OS killed the process for memory. The agent was slowly leaking — a library was caching LLM responses in memory without any eviction policy. RSS grew from 200MB to 4GB over a few days. The OOM killer sent SIGKILL, which leaves no Python traceback.

Why traditional monitoring missed it

Process monitoring (systemd, supervisor): Saw the exit code, but by the time you check alerts, the damage is done
Log monitoring (Datadog, CloudWatch): Nothing to see — OOM kill happens below the application layer
CPU/memory dashboards: Would have caught it if someone was watching. Nobody watches dashboards at 3 AM.

The pattern that catches this

Positive heartbeat. Instead of monitoring for bad signals (errors, crashes), monitor for the absence of a good signal. The agent must actively report "I'm alive" every N seconds. If the heartbeat stops for any reason — clean exit, OOM, segfault, kernel panic — you know immediately.

# Inside your agent's main loop
while True:
    result = do_work()
    heartbeat()  # This is the line that matters
    sleep(interval)

If heartbeat() doesn't fire, something is wrong. You don't need to know what — you need to know when.

Failure #2: The Zombie Agent

This one is more insidious. The process was running. CPU usage normal. Memory stable. Every health check said "healthy."

But the agent hadn't done useful work in four hours.

What happened

The agent was stuck on an HTTP request. An upstream API had rotated its TLS certificate, and the request was hanging — the socket was open, the connection was established, but the TLS handshake was deadlocked. No timeout was set on the request (a classic oversight).

From the outside, the process was "running." From the inside, the main loop was blocked on line 47 of api_client.py, and it would stay blocked forever.

Why traditional monitoring missed it

PID checks: Process exists ✓
Port checks: Agent's HTTP server responds ✓ (the health endpoint runs on a separate thread)
CPU/memory: Normal ✓

The health check thread was fine. The work thread was dead.

The pattern that catches this

Work-progress heartbeat. A background-thread heartbeat (like the one in Failure #1) catches crashes and OOM kills — it proves the process is alive. But it can't catch zombies, because the health-check thread keeps running even when the work loop is stuck.

For zombie detection, the heartbeat must come from inside the work loop:

# Level 1 — Liveness (background thread)
# Catches: crashes, OOM kills, clean exits
# Misses: zombies, hung calls, deadlocks
threading.Thread(target=lambda: while True: heartbeat(); sleep(30)).start()

# Level 2 — Work-progress (inside the loop)
# Catches: everything above + zombies, hung API calls, logic deadlocks
while True:
    data = fetch_from_api()    # If this hangs...
    process(data)
    heartbeat()                # ...this never fires
    sleep(interval)

Both levels are valid — they answer different questions. A background thread measures "is the process alive?" A work-loop heartbeat measures "is the agent making progress?" For full coverage, you want both.

Try it now — monitor your agent in 2 lines:

pip install clevagent

import clevagent
clevagent.init(api_key="***", agent="my-agent")

Free for 3 agents. No credit card required. Get your API key →

Failure #3: The Runaway Loop

This is the scariest failure mode because the agent looks great. It's running. It's doing work. It's calling the LLM API, getting responses, processing them, and calling again. Every metric says "healthy."

Except your bill is exploding.

What happened

The agent received a malformed response from an API. It asked the LLM to parse it. The LLM returned a structured output that triggered the same code path again. The agent asked the LLM to re-parse. Same result. Repeat.

Token usage went from 200/min (normal) to 40,000/min. In 40 minutes, it burned through about $50 of API credits. Not catastrophic for a single incident, but imagine this happening overnight with a larger model.

Why traditional monitoring missed it

Process health: Running ✓
Heartbeat: Firing normally ✓ (the loop is running, just wastefully)
Error rate: Zero ✓ (no errors — the LLM is responding successfully every time)
CPU/memory: Normal ✓ (LLM calls are I/O-bound, not compute-bound)

The pattern that catches this

Cost as a health metric. Track token usage (or API cost) per heartbeat cycle. If it spikes 10-100x above baseline, flag it.

while True:
    start_tokens = get_token_count()
    result = do_llm_work()
    end_tokens = get_token_count()

    heartbeat(
        tokens_used=end_tokens - start_tokens,
        cost_estimate=calculate_cost(end_tokens - start_tokens)
    )
    sleep(interval)

This is the one metric that's unique to LLM-backed agents. Traditional services don't have a per-request cost that can spike 200x. AI agents do.

The Monitoring Stack for AI Agents

After dealing with all three failures, I realized the monitoring requirements for AI agents are fundamentally different from web services:

What to monitor	Web service	AI agent
Is it alive?	Process check	Positive heartbeat (agent must prove it's alive)
Is it working?	Request latency	Application-level heartbeat (from inside the work loop)
Is it healthy?	Error rate	Cost per cycle (token usage as health signal)

The minimum viable version of this is surprisingly simple:

Put a heartbeat call inside your main loop (not in a health-check thread)
Include token/cost data in each heartbeat
Alert on silence (missed heartbeat) and on cost spikes

That alone would have caught all three of my failures within 60 seconds instead of hours.

Where ClevAgent fits

If you do not want to wire this yourself, ClevAgent packages the same operating pattern: heartbeat freshness, loop and cost-spike detection, auto-restart, and daily reporting for long-running agents.

But the pattern matters more than the product mention here. Even if you roll your own with a webhook plus PagerDuty, the three signals above — heartbeat, work-progress freshness, and cost tracking — will catch most of the failures that basic infra monitoring misses.

The dangerous cases are not just crashes. They are the hours where the process still looks alive while useful work has stopped or spend has detached from baseline. If you want a runtime watchdog built around those signals, start monitoring with ClevAgent.

Start monitoring your agents for free

ClevAgent is free for up to 3 agents — no credit card required. Add one line to your agent loop and get heartbeat monitoring, zombie detection, runaway cost alerts, and auto-restart in minutes.

👉 Start free at clevagent.io →

Why Your AI Agent Health Check Is Lying to You

ClevAgent — Wed, 01 Apr 2026 21:36:58 +0000

Your monitoring dashboard shows green across the board. Process running. Port responding. CPU normal. Memory stable.

But your AI agent hasn't done anything useful in four hours.

The problem with traditional health checks

Traditional health checks answer one question: "Is the process alive?" For web servers, that's usually enough. If Nginx is running and responding on port 80, it's probably serving pages.

AI agents are different. An agent can be alive without being productive. The process is running, but the main work loop is stuck on a hung HTTP call, waiting on a deadlocked mutex, or spinning in a retry loop that will never succeed.

Three ways health checks lie

1. PID exists ≠ working

systemctl status my-agent says "active (running)". But the agent's main loop has been blocked on requests.get() for three hours because an upstream API rotated its TLS certificate and the connection is hanging without a timeout.

The health check thread runs independently and reports "I'm fine" every 30 seconds.

2. Port responds ≠ working

Many agents expose an HTTP health endpoint. A load balancer pings /health, gets 200 OK, and assumes everything is fine.

But the /health handler runs on a different thread from the agent's work loop. The work loop is dead. The health endpoint is alive. Two completely different things.

3. No errors ≠ working

Your error tracking shows zero exceptions. Must be healthy, right?

Except the agent is caught in a logic loop: parse response → ask LLM to fix → get the same malformed response → repeat. Every request succeeds. Every response is valid. The agent just isn't making progress, and it's burning through API credits at 200x the normal rate.

What actually works

There are two levels of heartbeat protection, and they catch different failures.

Level 1 — Liveness heartbeat (background thread or sidecar). This proves the process is alive. It catches crashes, OOM kills, and clean exits. But it doesn't catch zombies — the health-check thread keeps ticking even when the work loop is stuck on a hung API call.

Level 2 — Work-progress heartbeat (inside the work loop). This proves the agent is doing useful work:

while True:
    data = fetch_data()      # If this hangs...
    result = process(data)
    heartbeat()              # ...this never fires
    sleep(interval)

If heartbeat() doesn't fire within the expected interval, something is wrong. You don't need to know what — you need to know when.

A background-thread heartbeat is better than nothing because it solves the silent-exit problem. But for zombie failures, the heartbeat needs to come from inside the loop that does the actual work. For full coverage, use both.

Adding cost as a health signal

For LLM-backed agents, there's a third dimension: cost per cycle. A runaway loop doesn't spike CPU because LLM calls are I/O-bound. But it does spike token usage.

Track tokens per heartbeat cycle. If it jumps 10-100x above baseline, you have a loop even if every other metric says "healthy."

The monitoring stack for AI agents

Signal	Web server	AI agent
Is it alive?	Process check	Positive heartbeat
Is it working?	Request latency	Heartbeat from work loop
Is it healthy?	Error rate	Cost per cycle

The minimum version is simple: put a heartbeat inside your main loop, include token count, and alert on silence and cost spikes. That catches most AI agent failures that traditional monitoring misses.

I originally wrote this pattern up after debugging long-running agent failures in production. If you want the fuller walkthrough, the canonical version lives on the ClevAgent blog.

Forem: ClevAgent

How to Monitor CrewAI Agents in Production

Why CrewAI crews need dedicated monitoring

Add ClevAgent to your CrewAI crew in 3 lines

Step 1: Install

Step 2: Initialize before kickoff

Step 3 (optional): Add a step callback for per-agent tracking

Complete example: 2-agent crew with monitoring

What ClevAgent catches

Crew hangs

Agent loops

Token cost spikes

Use clevagent.ping() for work-progress tracking

Related reading

How to Monitor LangChain Agents in Production

Why LangChain agents need runtime monitoring

Add ClevAgent to your LangChain agent in 3 lines

Complete example

LangGraph agents: use the node decorator

What ClevAgent catches

Stuck chains and hung tools

Infinite ReAct loops

Silent exits

Getting started

Related reading

Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch

Failure #1: The Silent Exit

What happened

Why traditional monitoring missed it

The pattern that catches this

Failure #2: The Zombie Agent

What happened

Why traditional monitoring missed it

The pattern that catches this

Failure #3: The Runaway Loop

What happened

Why traditional monitoring missed it

The pattern that catches this

The Monitoring Stack for AI Agents

Where ClevAgent fits

Related reading

Start monitoring your agents for free

Why Your AI Agent Health Check Is Lying to You

The problem with traditional health checks

Three ways health checks lie

1. PID exists ≠ working

2. Port responds ≠ working

3. No errors ≠ working

What actually works

Adding cost as a health signal

The monitoring stack for AI agents