<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: ClevAgent</title>
    <description>The latest articles on Forem by ClevAgent (@clevagent).</description>
    <link>https://forem.com/clevagent</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3856167%2Fe4b30e16-0103-4b82-aadd-35957653ac46.png</url>
      <title>Forem: ClevAgent</title>
      <link>https://forem.com/clevagent</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/clevagent"/>
    <language>en</language>
    <item>
      <title>How to Monitor CrewAI Agents in Production</title>
      <dc:creator>ClevAgent</dc:creator>
      <pubDate>Sat, 04 Apr 2026 15:43:39 +0000</pubDate>
      <link>https://forem.com/clevagent/how-to-monitor-crewai-agents-in-production-k6i</link>
      <guid>https://forem.com/clevagent/how-to-monitor-crewai-agents-in-production-k6i</guid>
      <description>&lt;p&gt;If you're running CrewAI crews in production, you've probably hit this: your cron job exits with code 0, but the crew didn't actually finish its work. The researcher agent got stuck retrying a rate-limited API, the analyst never received input, and nobody noticed until Friday.&lt;/p&gt;

&lt;p&gt;Multi-agent orchestration frameworks like CrewAI fail differently from traditional services. &lt;strong&gt;A crew can fail without crashing.&lt;/strong&gt; Here's how to catch those failures with heartbeat monitoring — in about 3 lines of code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why CrewAI crews need dedicated monitoring
&lt;/h2&gt;

&lt;p&gt;CrewAI orchestrates multiple agents that call LLMs, use tools, and pass context to each other. Each agent is a potential failure point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent hangs&lt;/strong&gt;: One agent waits indefinitely for an LLM response. The crew stalls, but the process stays alive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinite loops&lt;/strong&gt;: An agent retries a failed tool call endlessly. Your token meter spins, but no useful output appears.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent quality degradation&lt;/strong&gt;: The LLM returns garbage, the next agent processes it anyway, and the final output is subtly wrong. No error thrown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost spikes&lt;/strong&gt;: A single crew run normally costs $0.15. One bad run costs $12 because an agent kept rephrasing the same request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional process monitoring (systemd, Docker health checks) only tells you the process is alive. It tells you nothing about whether the &lt;em&gt;crew&lt;/em&gt; is making progress.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try it now&lt;/strong&gt; — monitor your CrewAI agent in 2 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;clevagent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clevagent&lt;/span&gt;
&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-crew&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free for 3 agents. No credit card required. &lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=crewai-monitoring-guide" rel="noopener noreferrer"&gt;Get your API key →&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Add ClevAgent to your CrewAI crew in 3 lines
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://clevagent.io?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=crewai-monitoring-guide" rel="noopener noreferrer"&gt;ClevAgent&lt;/a&gt; monitors your crew at the agent level — heartbeats, loop detection, and per-run cost tracking. Setup takes about 30 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;clevagent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Initialize before kickoff
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clevagent&lt;/span&gt;

&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLEVAGENT_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-research-crew&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. ClevAgent starts sending heartbeats automatically. If your crew hangs or the process dies, you get alerted within 120 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 (optional): Add a step callback for per-agent tracking
&lt;/h3&gt;

&lt;p&gt;CrewAI supports a &lt;code&gt;step_callback&lt;/code&gt; on each agent. Wire it to ClevAgent to get visibility into each agent's work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;step_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pass this callback when defining your agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research Analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find the latest market data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior research analyst...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;track_step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every agent step shows up on your dashboard with timing and metadata.&lt;/p&gt;

&lt;h2&gt;
  
  
  Complete example: 2-agent crew with monitoring
&lt;/h2&gt;

&lt;p&gt;Here's a full working example — a research crew with two agents, monitored by ClevAgent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clevagent&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize monitoring
&lt;/span&gt;&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLEVAGENT_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-research-crew&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;step_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define agents
&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research Analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find the 3 most important tech news stories today&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior research analyst who reads dozens of sources daily.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;track_step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Report Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a concise morning briefing from the research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a technical writer who distills complex topics into clear summaries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;track_step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define tasks
&lt;/span&gt;&lt;span class="n"&gt;research_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search for today&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s top 3 tech news stories. Include source URLs.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A list of 3 news items with title, summary, and source URL.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;writing_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a 200-word morning briefing based on the research.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A formatted briefing email ready to send.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Assemble and run
&lt;/span&gt;&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;research_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writing_task&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Report completion with output metadata
&lt;/span&gt;&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crew_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agents_used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire monitoring integration is 8 lines — the &lt;code&gt;init()&lt;/code&gt;, the &lt;code&gt;track_step&lt;/code&gt; callback, and the final &lt;code&gt;ping()&lt;/code&gt;. Your existing CrewAI code stays exactly the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ClevAgent catches
&lt;/h2&gt;

&lt;p&gt;Once connected, ClevAgent watches for three categories of problems:&lt;/p&gt;

&lt;h3&gt;
  
  
  Crew hangs
&lt;/h3&gt;

&lt;p&gt;If no heartbeat arrives for 120 seconds, ClevAgent sends an alert to Telegram or Slack. This catches the most common CrewAI failure: an agent waiting on an LLM call that never returns. Your cron job sees a running process. ClevAgent sees a silent agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent loops
&lt;/h3&gt;

&lt;p&gt;ClevAgent tracks the frequency and pattern of &lt;code&gt;ping()&lt;/code&gt; calls. If an agent sends 50 step completions in 30 seconds with identical metadata, that's a loop. You get a warning before the token bill becomes a problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Token cost spikes
&lt;/h3&gt;

&lt;p&gt;Every &lt;code&gt;ping()&lt;/code&gt; with metadata feeds into per-run cost estimates. ClevAgent compares the current run against your historical average. A run that's 5x the normal cost triggers a warning. You can set a hard budget ceiling per agent in the dashboard — if exceeded, ClevAgent sends an immediate alert.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use clevagent.ping() for work-progress tracking
&lt;/h2&gt;

&lt;p&gt;Beyond failure detection, &lt;code&gt;ping()&lt;/code&gt; is useful for tracking that your crew is actually doing its job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crew_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stories_found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stories&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the ClevAgent dashboard, this creates a timeline of crew runs. You can see at a glance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did today's 6 AM run actually complete?&lt;/li&gt;
&lt;li&gt;How many stories did it find compared to yesterday?&lt;/li&gt;
&lt;li&gt;Is the output length consistent, or did something degrade?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the difference between "the process ran" and "the crew did useful work." Process monitoring gives you the first. Ping metadata gives you the second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/clevagent/why-your-ai-agent-health-check-is-lying-to-you-2ib1"&gt;Why Your AI Agent Health Check Is Lying to You&lt;/a&gt; — The hidden gap between "process alive" and "agent working."&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/clevagent/three-ai-agent-failure-modes-that-traditional-monitoring-will-never-catch-2ik4"&gt;Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch&lt;/a&gt; — Silent exits, zombie agents, and runaway loops with real examples.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/clevagent/how-to-monitor-langchain-agents-in-production-2aic"&gt;How to Monitor LangChain Agents in Production&lt;/a&gt; — LangChain callback handler and LangGraph node monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://clevagent.io/blog/how-to-monitor-ai-agents?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=crewai-monitoring-guide" rel="noopener noreferrer"&gt;How to Monitor AI Agents in Production&lt;/a&gt; — The complete guide to heartbeat-based monitoring for any AI agent framework.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;ClevAgent is free for up to 3 agents. No credit card, no config files, no separate infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=crewai-monitoring-guide" rel="noopener noreferrer"&gt;Start monitoring your CrewAI crews →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Monitor LangChain Agents in Production</title>
      <dc:creator>ClevAgent</dc:creator>
      <pubDate>Fri, 03 Apr 2026 22:01:39 +0000</pubDate>
      <link>https://forem.com/clevagent/how-to-monitor-langchain-agents-in-production-2aic</link>
      <guid>https://forem.com/clevagent/how-to-monitor-langchain-agents-in-production-2aic</guid>
      <description>&lt;p&gt;Your LangChain agent works in development. Chains resolve, tools return, the ReAct loop converges. Ship it. Day one — fine. Day two — 200 requests, zero errors. Day three — your OpenAI bill says $340.&lt;/p&gt;

&lt;p&gt;The agent got stuck in a tool-retry loop at 2 AM. It kept calling a search tool that returned empty results, parsing the response, deciding to search again, and repeating. No exceptions, no crashes, every health check returned 200 OK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tracing tools like LangSmith would show you the traces after the fact. Nobody would have woken you up at 2 AM when it started.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're running LangChain or LangGraph agents in production, this is the gap between &lt;em&gt;observability&lt;/em&gt; and &lt;em&gt;runtime monitoring&lt;/em&gt;. Here's how to close it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LangChain agents need runtime monitoring
&lt;/h2&gt;

&lt;p&gt;LangChain agents fail differently from web services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stuck chains&lt;/strong&gt;: An HTTP tool call hangs indefinitely. The chain never completes. The process is alive, the health endpoint responds, but no work is happening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinite ReAct loops&lt;/strong&gt;: The agent keeps calling tools without converging. &lt;code&gt;max_iterations&lt;/code&gt; helps, but only caps iteration count — not cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent cost spikes&lt;/strong&gt;: A loop making 50 LLM calls in 30 seconds doesn't spike CPU. It spikes your API bill. By the time you see the invoice, the damage is done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zombie agents&lt;/strong&gt;: The callback thread is alive, traces are flowing to LangSmith, but the actual work loop is stuck on a deadlocked resource.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangSmith and Langfuse are excellent for tracing — understanding &lt;em&gt;what happened&lt;/em&gt; after the fact. But they don't answer the real-time question: &lt;strong&gt;is this agent alive and making progress right now?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Add ClevAgent to your LangChain agent in 3 lines
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; Install the SDK.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;clevagent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Initialize ClevAgent with your API key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clevagent&lt;/span&gt;

&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key-here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langchain-research-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Add the callback handler to your LLM or chain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;clevagent.integrations.langchain&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClevAgentCallbackHandler&lt;/span&gt;

&lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ClevAgentCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every LLM call now sends a heartbeat with token usage. If the agent stops calling the LLM — because a chain hung, a tool timed out, or the process crashed — ClevAgent detects the silence and alerts you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Complete example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clevagent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;clevagent.integrations.langchain&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClevAgentCallbackHandler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentExecutor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_react_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Tool&lt;/span&gt;

&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key-here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ClevAgentCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;search_web&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search the web&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;calculator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_react_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Every LLM call and tool use is now monitored
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research the latest AI agent frameworks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  LangGraph agents: use the node decorator
&lt;/h2&gt;

&lt;p&gt;For LangGraph's graph-based agents, ClevAgent provides a &lt;code&gt;@monitored_node&lt;/code&gt; decorator that wraps each node with automatic heartbeat monitoring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;clevagent.integrations.langgraph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;monitored_node&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;

&lt;span class="nd"&gt;@monitored_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;research_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="nd"&gt;@monitored_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;research_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summarize_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node execution sends a heartbeat. If a node hangs — because an API call never returns or an LLM request times out — ClevAgent detects the gap and alerts you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ClevAgent catches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Stuck chains and hung tools
&lt;/h3&gt;

&lt;p&gt;Your agent calls an external API inside a tool. The API hangs. The chain never completes. &lt;code&gt;systemctl status&lt;/code&gt; says "running" — but no heartbeats are arriving.&lt;/p&gt;

&lt;p&gt;ClevAgent detects the silence within your configured threshold (default: 120 seconds) and sends an alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infinite ReAct loops
&lt;/h3&gt;

&lt;p&gt;The agent enters a loop: call tool → parse result → decide to call tool again → repeat. An agent making 15 iterations of GPT-4o calls in 30 seconds burns through tokens fast.&lt;/p&gt;

&lt;p&gt;ClevAgent tracks cumulative token usage per heartbeat cycle. If tokens spike 10-100x above your agent's baseline, you get a cost alert — while the loop is still running, not after.&lt;/p&gt;

&lt;h3&gt;
  
  
  Silent exits
&lt;/h3&gt;

&lt;p&gt;The process gets OOM-killed at 3 AM. No traceback, no error log, no alert. ClevAgent expects a heartbeat every N seconds. When it stops arriving, you get an alert within one missed interval. Optional auto-restart brings the agent back without manual intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;pip install clevagent&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Get your API key from &lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=langchain-monitoring-guide" rel="noopener noreferrer"&gt;clevagent.io/signup&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;clevagent.init()&lt;/code&gt; and the callback handler&lt;/li&gt;
&lt;li&gt;Deploy — ClevAgent starts monitoring immediately&lt;/li&gt;
&lt;li&gt;Configure alerts in the dashboard: Telegram, Slack, Discord, or email&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Free for 3 agents. No credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/clevagent/three-ai-agent-failure-modes-that-traditional-monitoring-will-never-catch-2ik4"&gt;Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/clevagent/why-your-ai-agent-health-check-is-lying-to-you-2ib1"&gt;Why Your AI Agent Health Check Is Lying to You&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://clevagent.io/blog/track-llm-token-costs?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=langchain-monitoring-guide" rel="noopener noreferrer"&gt;How to Track LLM Token Costs in Production&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://clevagent.io/blog/crewai-monitoring-guide?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=langchain-monitoring-guide" rel="noopener noreferrer"&gt;How to Monitor CrewAI Agents in Production&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://clevagent.io?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=langchain-monitoring-guide" rel="noopener noreferrer"&gt;ClevAgent&lt;/a&gt; monitors LangChain and LangGraph agents with heartbeat detection, cost tracking, and auto-restart. Free for up to 3 agents — &lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=langchain-monitoring-guide" rel="noopener noreferrer"&gt;start monitoring →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch</title>
      <dc:creator>ClevAgent</dc:creator>
      <pubDate>Thu, 02 Apr 2026 19:23:31 +0000</pubDate>
      <link>https://forem.com/clevagent/three-ai-agent-failure-modes-that-traditional-monitoring-will-never-catch-2ik4</link>
      <guid>https://forem.com/clevagent/three-ai-agent-failure-modes-that-traditional-monitoring-will-never-catch-2ik4</guid>
      <description>&lt;p&gt;One of my agents exited cleanly at 3 AM, another sat "healthy" while doing zero useful work for four hours, and a third burned through $50 in API credits in 40 minutes without throwing a single error.&lt;/p&gt;

&lt;p&gt;Those incidents looked unrelated at first. They weren't. All three slipped past the usual stack of process checks, log watchers, and CPU or memory alerts because those tools were measuring infrastructure symptoms, not whether the agent was still doing useful work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure #1: The Silent Exit
&lt;/h2&gt;

&lt;p&gt;One of my agents exited cleanly at 3 AM. No traceback. No error log. No crash dump. The Python process simply stopped. My log monitoring saw nothing because there was nothing to log.&lt;/p&gt;

&lt;p&gt;I found out &lt;strong&gt;six hours later&lt;/strong&gt; when I noticed the bot hadn't posted since 3 AM.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened
&lt;/h3&gt;

&lt;p&gt;The OS killed the process for memory. The agent was slowly leaking — a library was caching LLM responses in memory without any eviction policy. RSS grew from 200MB to 4GB over a few days. The OOM killer sent SIGKILL, which leaves no Python traceback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why traditional monitoring missed it
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Process monitoring (systemd, supervisor):&lt;/strong&gt; Saw the exit code, but by the time you check alerts, the damage is done&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log monitoring (Datadog, CloudWatch):&lt;/strong&gt; Nothing to see — OOM kill happens below the application layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU/memory dashboards:&lt;/strong&gt; Would have caught it &lt;em&gt;if&lt;/em&gt; someone was watching. Nobody watches dashboards at 3 AM.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The pattern that catches this
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Positive heartbeat.&lt;/strong&gt; Instead of monitoring for bad signals (errors, crashes), monitor for the &lt;em&gt;absence&lt;/em&gt; of a good signal. The agent must actively report "I'm alive" every N seconds. If the heartbeat stops for any reason — clean exit, OOM, segfault, kernel panic — you know immediately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inside your agent's main loop
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;do_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;heartbeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# This is the line that matters
&lt;/span&gt;    &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;heartbeat()&lt;/code&gt; doesn't fire, something is wrong. You don't need to know &lt;em&gt;what&lt;/em&gt; — you need to know &lt;em&gt;when&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure #2: The Zombie Agent
&lt;/h2&gt;

&lt;p&gt;This one is more insidious. The process was running. CPU usage normal. Memory stable. Every health check said "healthy."&lt;/p&gt;

&lt;p&gt;But the agent hadn't done useful work in &lt;strong&gt;four hours&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened
&lt;/h3&gt;

&lt;p&gt;The agent was stuck on an HTTP request. An upstream API had rotated its TLS certificate, and the request was hanging — the socket was open, the connection was established, but the TLS handshake was deadlocked. No timeout was set on the request (a classic oversight).&lt;/p&gt;

&lt;p&gt;From the outside, the process was "running." From the inside, the main loop was blocked on line 47 of &lt;code&gt;api_client.py&lt;/code&gt;, and it would stay blocked forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why traditional monitoring missed it
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PID checks:&lt;/strong&gt; Process exists ✓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port checks:&lt;/strong&gt; Agent's HTTP server responds ✓ (the health endpoint runs on a separate thread)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU/memory:&lt;/strong&gt; Normal ✓&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The health check thread was fine. The &lt;em&gt;work&lt;/em&gt; thread was dead.&lt;/p&gt;

&lt;h3&gt;
  
  
  The pattern that catches this
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Work-progress heartbeat.&lt;/strong&gt; A background-thread heartbeat (like the one in Failure #1) catches crashes and OOM kills — it proves the &lt;em&gt;process&lt;/em&gt; is alive. But it can't catch zombies, because the health-check thread keeps running even when the work loop is stuck.&lt;/p&gt;

&lt;p&gt;For zombie detection, the heartbeat must come from &lt;em&gt;inside the work loop&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Level 1 — Liveness (background thread)
# Catches: crashes, OOM kills, clean exits
# Misses: zombies, hung calls, deadlocks
&lt;/span&gt;&lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;heartbeat&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Level 2 — Work-progress (inside the loop)
# Catches: everything above + zombies, hung API calls, logic deadlocks
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_api&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    &lt;span class="c1"&gt;# If this hangs...
&lt;/span&gt;    &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;heartbeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                &lt;span class="c1"&gt;# ...this never fires
&lt;/span&gt;    &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both levels are valid — they answer different questions. A background thread measures "is the process alive?" A work-loop heartbeat measures "is the agent making progress?" For full coverage, you want both.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try it now&lt;/strong&gt; — monitor your agent in 2 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;clevagent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clevagent&lt;/span&gt;
&lt;span class="n"&gt;clevagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;***&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free for 3 agents. No credit card required. &lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=three-failure-modes" rel="noopener noreferrer"&gt;Get your API key →&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure #3: The Runaway Loop
&lt;/h2&gt;

&lt;p&gt;This is the scariest failure mode because the agent looks &lt;em&gt;great&lt;/em&gt;. It's running. It's doing work. It's calling the LLM API, getting responses, processing them, and calling again. Every metric says "healthy."&lt;/p&gt;

&lt;p&gt;Except your bill is exploding.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened
&lt;/h3&gt;

&lt;p&gt;The agent received a malformed response from an API. It asked the LLM to parse it. The LLM returned a structured output that triggered the same code path again. The agent asked the LLM to re-parse. Same result. Repeat.&lt;/p&gt;

&lt;p&gt;Token usage went from 200/min (normal) to &lt;strong&gt;40,000/min&lt;/strong&gt;. In 40 minutes, it burned through about $50 of API credits. Not catastrophic for a single incident, but imagine this happening overnight with a larger model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why traditional monitoring missed it
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Process health:&lt;/strong&gt; Running ✓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heartbeat:&lt;/strong&gt; Firing normally ✓ (the loop is &lt;em&gt;running&lt;/em&gt;, just wastefully)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate:&lt;/strong&gt; Zero ✓ (no errors — the LLM is responding successfully every time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU/memory:&lt;/strong&gt; Normal ✓ (LLM calls are I/O-bound, not compute-bound)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The pattern that catches this
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost as a health metric.&lt;/strong&gt; Track token usage (or API cost) per heartbeat cycle. If it spikes 10-100x above baseline, flag it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;start_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_token_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;do_llm_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;end_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_token_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;heartbeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tokens_used&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;end_tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;cost_estimate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end_tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the one metric that's unique to LLM-backed agents. Traditional services don't have a per-request cost that can spike 200x. AI agents do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Monitoring Stack for AI Agents
&lt;/h2&gt;

&lt;p&gt;After dealing with all three failures, I realized the monitoring requirements for AI agents are fundamentally different from web services:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What to monitor&lt;/th&gt;
&lt;th&gt;Web service&lt;/th&gt;
&lt;th&gt;AI agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Is it alive?&lt;/td&gt;
&lt;td&gt;Process check&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Positive heartbeat&lt;/strong&gt; (agent must prove it's alive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is it working?&lt;/td&gt;
&lt;td&gt;Request latency&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Application-level heartbeat&lt;/strong&gt; (from inside the work loop)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is it healthy?&lt;/td&gt;
&lt;td&gt;Error rate&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Cost per cycle&lt;/strong&gt; (token usage as health signal)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The minimum viable version of this is surprisingly simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Put a heartbeat call inside your main loop (not in a health-check thread)&lt;/li&gt;
&lt;li&gt;Include token/cost data in each heartbeat&lt;/li&gt;
&lt;li&gt;Alert on silence (missed heartbeat) and on cost spikes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That alone would have caught all three of my failures within 60 seconds instead of hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where ClevAgent fits
&lt;/h2&gt;

&lt;p&gt;If you do not want to wire this yourself, ClevAgent packages the same operating pattern: heartbeat freshness, loop and cost-spike detection, auto-restart, and daily reporting for long-running agents.&lt;/p&gt;

&lt;p&gt;But the pattern matters more than the product mention here. Even if you roll your own with a webhook plus PagerDuty, the three signals above — heartbeat, work-progress freshness, and cost tracking — will catch most of the failures that basic infra monitoring misses.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The dangerous cases are not just crashes. They are the hours where the process still looks alive while useful work has stopped or spend has detached from baseline. If you want a runtime watchdog built around those signals, &lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=three-failure-modes" rel="noopener noreferrer"&gt;start monitoring with ClevAgent&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/clevagent/why-your-ai-agent-health-check-is-lying-to-you-2ib1"&gt;Why Your AI Agent Health Check Is Lying to You&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://clevagent.io/blog/track-llm-token-costs?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=three-failure-modes" rel="noopener noreferrer"&gt;How to Track LLM Token Costs in Production&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Start monitoring your agents for free
&lt;/h2&gt;

&lt;p&gt;ClevAgent is free for up to 3 agents — no credit card required. Add one line to your agent loop and get heartbeat monitoring, zombie detection, runaway cost alerts, and auto-restart in minutes.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://clevagent.io/signup?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=three-failure-modes" rel="noopener noreferrer"&gt;Start free at clevagent.io →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>Why Your AI Agent Health Check Is Lying to You</title>
      <dc:creator>ClevAgent</dc:creator>
      <pubDate>Wed, 01 Apr 2026 21:36:58 +0000</pubDate>
      <link>https://forem.com/clevagent/why-your-ai-agent-health-check-is-lying-to-you-2ib1</link>
      <guid>https://forem.com/clevagent/why-your-ai-agent-health-check-is-lying-to-you-2ib1</guid>
      <description>&lt;p&gt;Your monitoring dashboard shows green across the board. Process running. Port responding. CPU normal. Memory stable.&lt;/p&gt;

&lt;p&gt;But your AI agent hasn't done anything useful in four hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with traditional health checks
&lt;/h2&gt;

&lt;p&gt;Traditional health checks answer one question: "Is the process alive?" For web servers, that's usually enough. If Nginx is running and responding on port 80, it's probably serving pages.&lt;/p&gt;

&lt;p&gt;AI agents are different. An agent can be alive without being productive. The process is running, but the main work loop is stuck on a hung HTTP call, waiting on a deadlocked mutex, or spinning in a retry loop that will never succeed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three ways health checks lie
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. PID exists ≠ working
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;systemctl status my-agent&lt;/code&gt; says "active (running)". But the agent's main loop has been blocked on &lt;code&gt;requests.get()&lt;/code&gt; for three hours because an upstream API rotated its TLS certificate and the connection is hanging without a timeout.&lt;/p&gt;

&lt;p&gt;The health check thread runs independently and reports "I'm fine" every 30 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Port responds ≠ working
&lt;/h3&gt;

&lt;p&gt;Many agents expose an HTTP health endpoint. A load balancer pings &lt;code&gt;/health&lt;/code&gt;, gets &lt;code&gt;200 OK&lt;/code&gt;, and assumes everything is fine.&lt;/p&gt;

&lt;p&gt;But the &lt;code&gt;/health&lt;/code&gt; handler runs on a different thread from the agent's work loop. The work loop is dead. The health endpoint is alive. Two completely different things.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. No errors ≠ working
&lt;/h3&gt;

&lt;p&gt;Your error tracking shows zero exceptions. Must be healthy, right?&lt;/p&gt;

&lt;p&gt;Except the agent is caught in a logic loop: parse response → ask LLM to fix → get the same malformed response → repeat. Every request succeeds. Every response is valid. The agent just isn't making progress, and it's burning through API credits at 200x the normal rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually works
&lt;/h2&gt;

&lt;p&gt;There are two levels of heartbeat protection, and they catch different failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1 — Liveness heartbeat&lt;/strong&gt; (background thread or sidecar). This proves the process is alive. It catches crashes, OOM kills, and clean exits. But it doesn't catch zombies — the health-check thread keeps ticking even when the work loop is stuck on a hung API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2 — Work-progress heartbeat&lt;/strong&gt; (inside the work loop). This proves the agent is doing useful work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      &lt;span class="c1"&gt;# If this hangs...
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;heartbeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;              &lt;span class="c1"&gt;# ...this never fires
&lt;/span&gt;    &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;heartbeat()&lt;/code&gt; doesn't fire within the expected interval, something is wrong. You don't need to know what — you need to know when.&lt;/p&gt;

&lt;p&gt;A background-thread heartbeat is better than nothing because it solves the silent-exit problem. But for zombie failures, the heartbeat needs to come from inside the loop that does the actual work. For full coverage, use both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding cost as a health signal
&lt;/h2&gt;

&lt;p&gt;For LLM-backed agents, there's a third dimension: cost per cycle. A runaway loop doesn't spike CPU because LLM calls are I/O-bound. But it does spike token usage.&lt;/p&gt;

&lt;p&gt;Track tokens per heartbeat cycle. If it jumps 10-100x above baseline, you have a loop even if every other metric says "healthy."&lt;/p&gt;

&lt;h2&gt;
  
  
  The monitoring stack for AI agents
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Web server&lt;/th&gt;
&lt;th&gt;AI agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Is it alive?&lt;/td&gt;
&lt;td&gt;Process check&lt;/td&gt;
&lt;td&gt;Positive heartbeat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is it working?&lt;/td&gt;
&lt;td&gt;Request latency&lt;/td&gt;
&lt;td&gt;Heartbeat from work loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is it healthy?&lt;/td&gt;
&lt;td&gt;Error rate&lt;/td&gt;
&lt;td&gt;Cost per cycle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The minimum version is simple: put a heartbeat inside your main loop, include token count, and alert on silence and cost spikes. That catches most AI agent failures that traditional monitoring misses.&lt;/p&gt;

&lt;p&gt;I originally wrote this pattern up after debugging long-running agent failures in production. If you want the fuller walkthrough, the canonical version lives on the ClevAgent blog.&lt;/p&gt;

</description>
      <category>productivity</category>
    </item>
  </channel>
</rss>
