<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: tomas maiorino</title>
    <description>The latest articles on Forem by tomas maiorino (@tomasmaiorino).</description>
    <link>https://forem.com/tomasmaiorino</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F105630%2F081fcae6-cc80-480f-8648-4afa53d1ed59.jpeg</url>
      <title>Forem: tomas maiorino</title>
      <link>https://forem.com/tomasmaiorino</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tomasmaiorino"/>
    <language>en</language>
    <item>
      <title>Using AI Agents to Debug Distributed Systems in Under a Minute</title>
      <dc:creator>tomas maiorino</dc:creator>
      <pubDate>Wed, 01 Apr 2026 22:30:57 +0000</pubDate>
      <link>https://forem.com/tomasmaiorino/using-ai-agents-to-debug-distributed-systems-in-under-a-minute-4j20</link>
      <guid>https://forem.com/tomasmaiorino/using-ai-agents-to-debug-distributed-systems-in-under-a-minute-4j20</guid>
      <description>&lt;h2&gt;
  
  
  Using AI Agents to Debug Distributed Systems Faster
&lt;/h2&gt;

&lt;p&gt;At my company, we have a feature that allows customers to export large volumes of data to cloud providers.&lt;/p&gt;

&lt;p&gt;Under the hood, this export process is split into multiple &lt;strong&gt;tasks&lt;/strong&gt;, where each task is responsible for exporting a subset of objects. These tasks are executed by pods in a &lt;strong&gt;multi-tenant Kubernetes environment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;From time to time, we receive alerts indicating that some tasks are taking too long to start and remain in the queue for an extended period.&lt;/p&gt;

&lt;p&gt;When that happens, an investigation begins.&lt;/p&gt;

&lt;p&gt;The challenge is that this analysis is usually &lt;strong&gt;slow, manual, and repetitive&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A typical investigation involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checking the status of each task and validating key attributes&lt;/li&gt;
&lt;li&gt;Reviewing tenant configurations to identify values that may cause issues&lt;/li&gt;
&lt;li&gt;Inspecting overall cluster health&lt;/li&gt;
&lt;li&gt;Analyzing how many tasks each tenant has created&lt;/li&gt;
&lt;li&gt;Cross-checking configuration in Bitbucket&lt;/li&gt;
&lt;li&gt;Making multiple API calls across services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This process can easily take several minutes, and sometimes much longer, especially during active incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea: Automate the Investigation with an AI Agent
&lt;/h2&gt;

&lt;p&gt;We decided to speed things up by building an &lt;strong&gt;AI-powered agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Automatically gather the relevant data, analyze it, and provide a probable root cause in seconds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;To achieve this, we built two main components.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. MCP Server
&lt;/h3&gt;

&lt;p&gt;We created an MCP server that exposes a set of tools wrapping our internal APIs.&lt;/p&gt;

&lt;p&gt;These tools allow the agent to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query task status&lt;/li&gt;
&lt;li&gt;Fetch tenant configurations&lt;/li&gt;
&lt;li&gt;Inspect system limits such as max replicas&lt;/li&gt;
&lt;li&gt;Retrieve cluster-level information&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. AI Agent
&lt;/h3&gt;

&lt;p&gt;On top of that, we built an AI agent that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses the MCP tools&lt;/li&gt;
&lt;li&gt;Analyzes the collected data&lt;/li&gt;
&lt;li&gt;Produces a structured diagnostic report&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Input: From Alert Logs to Insight
&lt;/h2&gt;

&lt;p&gt;At the time, we did not have a direct integration with our alerting system.&lt;/p&gt;

&lt;p&gt;Because of that, we designed the agent to interpret the log lines included in the alert and generate a report from them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Input
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generate report for test environment
Long waiting in queue tasks amount:13 ; Affected tenants: tenant3, tenant2, tenant5, tenant4, tenant1,tenant6

Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: 528f72d7-4f5d-4559-a825-b05014114dc7 - taskId: 528f72d7-4f5d-4559-a825-b05014114dc7 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: e7b2449f-0ded-4627-a6b6-eb305e074503 - taskId: e7b2449f-0ded-4627-a6b6-eb305e074503 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant2 - groupId: b481240d-d166-45cd-92dd-9f45d99c17f0 - taskId: b481240d-d166-45cd-92dd-9f45d99c17f0 - 1h 33m 6s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant3 - groupId: e1514831-b1b3-444f-a209-118c82718fbe - taskId: e1514831-b1b3-444f-a209-118c82718fbe - 1h 32m 10s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - taskId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - 1h 32m 0s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: c875921a-03d7-498b-aba4-848703e398d7 - taskId: c875921a-03d7-498b-aba4-848703e398d7 - 1h 24m 40s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - taskId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - 1h 24m 26s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 29b656a1-e77b-41b9-bcdc-4d0ee9edc282 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 2fe2d3a1-0ac1-42d2-996e-0edc5a075fe7 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant6 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 45e198f9-e629-47c0-ac5f-cf8eabbdcaf6 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 9ba638b7-ed61-4977-991b-2aa691a38d73 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: cfc26a65-30eb-4264-88e0-307b878e4d3c - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: d3748c08-baf5-43e4-98e7-7dd75d1ce43d - 1h 21m 43s - com.export.tasks.local.ExportTask
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Example Response
&lt;/h2&gt;

&lt;p&gt;Below is an example of the report generated by the agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tenant Summary
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant11&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant Name:&lt;/strong&gt; &lt;code&gt;Unknown&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Tasks:&lt;/strong&gt; &lt;code&gt;58&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tasks Status
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;e4361de7-280e-4906-9cc5-7a56c056a959&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status:&lt;/strong&gt; &lt;code&gt;SCHEDULED&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tasks in Queue
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task ID:&lt;/strong&gt; &lt;code&gt;7b67dd23-fdca-4acb-ab28-88bf8597c7a0&lt;/code&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;951b184b-e639-43cc-b837-0768b1f447ab&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;parallelExecution:&lt;/strong&gt; &lt;code&gt;false&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tenant Config
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;maxPeriodicTasksPerTenant:&lt;/strong&gt; &lt;code&gt;35&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;maxTaskPartsCount:&lt;/strong&gt; &lt;code&gt;13&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Credit Balance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;priorityCredits:&lt;/strong&gt; &lt;code&gt;23408&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;standardSyncCredits:&lt;/strong&gt; &lt;code&gt;2796&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;standardAsyncCredits:&lt;/strong&gt; &lt;code&gt;-18368&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;internalCredits:&lt;/strong&gt; &lt;code&gt;-89636&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Problem Areas
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Tasks with &lt;code&gt;parallelExecution=false&lt;/code&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tenant:&lt;/strong&gt; &lt;code&gt;tenant1&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;ec39f0e0-2093-43c0-93ed-73cc2675337f&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task ID:&lt;/strong&gt; &lt;code&gt;5b9280cd-9a10-4786-9ae0-000a26e6b0ce&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Tenant:&lt;/strong&gt; &lt;code&gt;tenant3&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;46588442-a969-4568-9a81-a7356727180f&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task ID:&lt;/strong&gt; &lt;code&gt;56193050-578e-4803-bca1-5c059e64fe3e&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Tenant:&lt;/strong&gt; &lt;code&gt;tenant2&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;951b184b-e639-43cc-b837-0768b1f447ab&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task ID:&lt;/strong&gt; &lt;code&gt;7b67dd23-fdca-4acb-ab28-88bf8597c7a0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Tenants with &lt;code&gt;maxTaskPartsCount&lt;/code&gt; or &lt;code&gt;maxPeriodicTasksPerTenant&lt;/code&gt; equal to or higher than 30% of the max replicas
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Max Replicas:&lt;/strong&gt; &lt;code&gt;84&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tenant:&lt;/strong&gt; &lt;code&gt;tenant6&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;maxTaskPartsCount:&lt;/strong&gt; &lt;code&gt;30&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Usage:&lt;/strong&gt; &lt;code&gt;35.7% of max replicas&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Tenant:&lt;/strong&gt; &lt;code&gt;tenant5&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;maxPeriodicTasksPerTenant:&lt;/strong&gt; &lt;code&gt;52&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Usage:&lt;/strong&gt; &lt;code&gt;61.9% of max replicas&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Another Example of Agent Output
&lt;/h2&gt;

&lt;p&gt;In another test, the agent produced the following findings:&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem Areas
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Tasks with &lt;code&gt;parallelExecution=false&lt;/code&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant6&lt;/code&gt;, &lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;ab92df1b-9c82-4940-8e25-dfa77f275ebb&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant3&lt;/code&gt;, &lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;e8143f64-11b1-48a7-960f-b005e5805871&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant5&lt;/code&gt;, &lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;d5812c89-3961-4f4d-bf1b-b08c36833a06&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant4&lt;/code&gt;, &lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;c875921a-03d7-498b-aba4-848703e398d7&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant1&lt;/code&gt;, &lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;8ed4824d-3333-4f9b-b442-123446b2006d&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant2&lt;/code&gt;, &lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;63801bb0-b5f7-44a9-bd94-84ccb9cc845e&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cluster Saturation
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total active tasks:&lt;/strong&gt; &lt;code&gt;86&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Max replicas:&lt;/strong&gt; &lt;code&gt;38&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means the total number of active tasks was already higher than the available capacity for that environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Tasks with Lower Throughput
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ID:&lt;/strong&gt; &lt;code&gt;tenant1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group ID:&lt;/strong&gt; &lt;code&gt;528f72d7-4f5d-4559-a825-b05014114dc7&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Task Statuses Seen in the Analysis
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CANCELED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;COMPLETED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SCHEDULED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;PROCESSING&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;PAUSED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SCHEDULED_POLL&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Tenants with High Limits Relative to Cluster Capacity
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tenant:&lt;/strong&gt; &lt;code&gt;tenant1&lt;/code&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;maxPeriodicTasksPerTenant:&lt;/strong&gt; &lt;code&gt;86&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;This is equal to or higher than 30% of the max replicas for that environment.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the Agent Detects
&lt;/h2&gt;

&lt;p&gt;From reports like the ones above, the agent is able to highlight:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks that cannot run in parallel because &lt;code&gt;parallelExecution=false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Cluster saturation issues, where active tasks exceed available replicas&lt;/li&gt;
&lt;li&gt;Misconfigured tenants exceeding safe limits&lt;/li&gt;
&lt;li&gt;Credit imbalances that may impact execution&lt;/li&gt;
&lt;li&gt;Low-throughput or blocked task groups&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Impact
&lt;/h2&gt;

&lt;p&gt;This approach reduced investigation time from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Manual analysis that could take several minutes or longer&lt;br&gt;&lt;br&gt;
to&lt;br&gt;&lt;br&gt;
automated analysis that usually takes less than one minute&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It can take a bit longer when the agent needs to inspect more than 20 different tasks, but it is still significantly faster than the manual process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;p&gt;I created a Java project using &lt;strong&gt;Spring AI&lt;/strong&gt; and the following modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mockoon&lt;/strong&gt; for generating mock API data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;java-project&lt;/strong&gt; for the MCP tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agent-ui&lt;/strong&gt; for the user interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agent&lt;/strong&gt; for the orchestration and reasoning layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This agent was originally built in &lt;strong&gt;Python&lt;/strong&gt;, but I thought it would be interesting to create a small project using Spring AI and get more knowledge about the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;A lot of operational analysis follows a repeatable pattern.&lt;/p&gt;

&lt;p&gt;That makes it a good fit for an AI agent that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gather data from multiple sources&lt;/li&gt;
&lt;li&gt;Correlate information across APIs and configuration&lt;/li&gt;
&lt;li&gt;Suggest a likely root cause&lt;/li&gt;
&lt;li&gt;Produce a report that engineers can use immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real value comes from combining &lt;strong&gt;tooling&lt;/strong&gt; and &lt;strong&gt;reasoning&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The next step is to rebuild the project using an &lt;strong&gt;agent + agent.md design pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal is to make the solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More modular&lt;/li&gt;
&lt;li&gt;Easier to maintain&lt;/li&gt;
&lt;li&gt;Easier to evolve as new investigation paths are added&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project started from a very practical problem: incident investigations were taking too much time.&lt;/p&gt;

&lt;p&gt;By turning the investigation flow into a set of tools plus an AI reasoning layer, we were able to automate a large part of the process and dramatically reduce the time to get useful answers.&lt;/p&gt;

&lt;p&gt;Instead of manually checking dashboards, configs, and APIs, we now have an agent that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read alert context&lt;/li&gt;
&lt;li&gt;Investigate the system&lt;/li&gt;
&lt;li&gt;Summarize the findings&lt;/li&gt;
&lt;li&gt;Suggest likely root causes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All in under a minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚠️ For Disclosure
&lt;/h2&gt;

&lt;p&gt;At my company, the solution was originally designed in &lt;strong&gt;Python&lt;/strong&gt;, as it was the primary language used for working with MCP (Model Context Protocol) and AI agents at the time.&lt;/p&gt;

&lt;p&gt;However, I decided to build this demo in &lt;strong&gt;Java&lt;/strong&gt; to gain hands-on experience with &lt;strong&gt;Spring AI&lt;/strong&gt; while applying it to a real-world problem. You can find the demo code &lt;a href="https://github.com/tomasmaiorino/export-troubleshooting" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;All values and data shown in this article were generated using &lt;strong&gt;Mockoon&lt;/strong&gt; and do not represent real production data. They are simplified examples meant to reflect the kind of information handled by the actual application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback
&lt;/h2&gt;

&lt;p&gt;I would love to hear your thoughts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Would you trust an AI agent to help debug production issues?&lt;/li&gt;
&lt;li&gt;How would you design this differently?&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>java</category>
      <category>springai</category>
      <category>troubleshooting</category>
    </item>
  </channel>
</rss>
