<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Roopa Venkatesh</title>
    <description>The latest articles on Forem by Roopa Venkatesh (@roops).</description>
    <link>https://forem.com/roops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1227036%2F867e97e9-cf6b-4098-9239-8a4728c085f3.jpeg</url>
      <title>Forem: Roopa Venkatesh</title>
      <link>https://forem.com/roops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/roops"/>
    <language>en</language>
    <item>
      <title>Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis</title>
      <dc:creator>Roopa Venkatesh</dc:creator>
      <pubDate>Fri, 06 Mar 2026 06:25:06 +0000</pubDate>
      <link>https://forem.com/roops/topology-aware-ai-agents-for-observability-automating-slo-breach-root-cause-analysis-60i</link>
      <guid>https://forem.com/roops/topology-aware-ai-agents-for-observability-automating-slo-breach-root-cause-analysis-60i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrv656yu78ce48mfzncy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrv656yu78ce48mfzncy.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis
&lt;/h1&gt;

&lt;p&gt;Modern cloud systems are complex distributed architectures where a single user journey may depend on dozens of services running across multiple infrastructure layers.&lt;/p&gt;

&lt;p&gt;When a &lt;strong&gt;Service Level Objective (SLO)&lt;/strong&gt; breach occurs, identifying the root cause often requires navigating logs, metrics, traces, service dependencies, and infrastructure relationships.&lt;/p&gt;

&lt;p&gt;In many organizations, this investigation is still &lt;strong&gt;manual and time-consuming&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In a recent project, I explored how &lt;strong&gt;AI agents can automate incident investigation&lt;/strong&gt; by combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observability data&lt;/li&gt;
&lt;li&gt;Service topology&lt;/li&gt;
&lt;li&gt;Kubernetes infrastructure context&lt;/li&gt;
&lt;li&gt;Historical incident knowledge&lt;/li&gt;
&lt;li&gt;Graph-based reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach reduced investigation time from &lt;strong&gt;20–30 minutes to under a minute&lt;/strong&gt; for certain SLO breaches.&lt;/p&gt;

&lt;p&gt;This article introduces the concept of &lt;strong&gt;Topology-Aware AI Agents&lt;/strong&gt; and how such a system can be implemented using AWS services and graph-based system modeling.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Problem: Traditional Incident Investigation
&lt;/h1&gt;

&lt;p&gt;When an SLO breach occurs, SRE teams typically perform the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify the impacted user journey&lt;/li&gt;
&lt;li&gt;Check monitoring dashboards&lt;/li&gt;
&lt;li&gt;Inspect logs and traces&lt;/li&gt;
&lt;li&gt;Identify impacted services&lt;/li&gt;
&lt;li&gt;Traverse upstream and downstream dependencies&lt;/li&gt;
&lt;li&gt;Correlate incidents with infrastructure problems&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In large microservice environments, this investigation becomes difficult because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs lack &lt;strong&gt;system-wide context&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Metrics show &lt;strong&gt;symptoms but not relationships&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Service dependencies are hard to traverse quickly&lt;/li&gt;
&lt;li&gt;Infrastructure and application layers are often disconnected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even with powerful observability tools, &lt;strong&gt;humans still perform most correlation tasks manually&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Logs Alone Are Not Enough for AI
&lt;/h1&gt;

&lt;p&gt;Many AI troubleshooting systems rely on &lt;strong&gt;RAG (Retrieval Augmented Generation)&lt;/strong&gt; using logs or documentation.&lt;/p&gt;

&lt;p&gt;However, logs alone do not provide &lt;strong&gt;system relationships&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example log entry:&lt;br&gt;
Payment API latency spike&lt;/p&gt;

&lt;p&gt;Without topology context, an AI system cannot determine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which upstream service triggered the issue&lt;/li&gt;
&lt;li&gt;Which downstream dependency failed&lt;/li&gt;
&lt;li&gt;Whether the issue originated from infrastructure or application layers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To solve this, we need &lt;strong&gt;structural knowledge about the system architecture&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Introducing Topology-Aware AI Agents
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;Topology-Aware AI Agent&lt;/strong&gt; combines three major sources of context:&lt;/p&gt;

&lt;p&gt;Observability Data&lt;br&gt;
+&lt;br&gt;
Service Topology&lt;br&gt;
+&lt;br&gt;
Historical Incident Knowledge&lt;/p&gt;

&lt;p&gt;The agent uses this combined knowledge to automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify impacted services&lt;/li&gt;
&lt;li&gt;Traverse dependency graphs&lt;/li&gt;
&lt;li&gt;Correlate incidents&lt;/li&gt;
&lt;li&gt;Suggest root causes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This transforms incident troubleshooting from &lt;strong&gt;log searching&lt;/strong&gt; into &lt;strong&gt;graph-based reasoning&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Platform Context: Microservices Running on Amazon EKS
&lt;/h1&gt;

&lt;p&gt;In this environment, the application platform was built using &lt;strong&gt;Kubernetes&lt;/strong&gt; running on &lt;strong&gt;Amazon Elastic Kubernetes Service (EKS)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Each user request travels across multiple layers:&lt;/p&gt;

&lt;p&gt;User Request&lt;br&gt;
↓&lt;br&gt;
API Gateway / Entry Service&lt;br&gt;
↓&lt;br&gt;
Microservices running on Kubernetes&lt;br&gt;
↓&lt;br&gt;
Databases / external dependencies&lt;/p&gt;

&lt;p&gt;Each microservice runs inside containers deployed on Kubernetes pods.&lt;/p&gt;

&lt;p&gt;To enable automated incident analysis, the system needed visibility into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud infrastructure&lt;/li&gt;
&lt;li&gt;Kubernetes resources&lt;/li&gt;
&lt;li&gt;Application services&lt;/li&gt;
&lt;li&gt;Runtime service interactions&lt;/li&gt;
&lt;li&gt;Observability signals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These relationships were modeled as a &lt;strong&gt;graph database&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Building the Service Relationship Graph
&lt;/h1&gt;

&lt;p&gt;The system used &lt;strong&gt;Neo4j&lt;/strong&gt; to build a &lt;strong&gt;knowledge graph representing the full platform topology&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The graph captured relationships across multiple layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud infrastructure&lt;/li&gt;
&lt;li&gt;Kubernetes platform&lt;/li&gt;
&lt;li&gt;Application services&lt;/li&gt;
&lt;li&gt;Service interactions&lt;/li&gt;
&lt;li&gt;Historical incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure allowed the AI agent to reason about &lt;strong&gt;how failures propagate across the system&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Modeling the Infrastructure Layer
&lt;/h1&gt;

&lt;p&gt;The first layer of the graph represented the &lt;strong&gt;cloud infrastructure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example nodes:&lt;/p&gt;

&lt;p&gt;Cloud Provider&lt;br&gt;
AWS Account&lt;br&gt;
Region&lt;br&gt;
Availability Zone&lt;br&gt;
Host (EC2)&lt;/p&gt;

&lt;p&gt;Example relationships:&lt;br&gt;
AWS Account&lt;br&gt;
│&lt;br&gt;
DEPLOYS&lt;br&gt;
▼&lt;br&gt;
EKS Cluster&lt;br&gt;
│&lt;br&gt;
RUNS_ON&lt;br&gt;
▼&lt;br&gt;
EC2 Worker Node&lt;/p&gt;

&lt;p&gt;This enables the system to correlate incidents with infrastructure-level problems such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;node failures&lt;/li&gt;
&lt;li&gt;CPU saturation&lt;/li&gt;
&lt;li&gt;network issues&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Modeling the Kubernetes Platform
&lt;/h1&gt;

&lt;p&gt;The next layer represents Kubernetes resources running on the EKS cluster.&lt;/p&gt;

&lt;p&gt;Example nodes:&lt;/p&gt;

&lt;p&gt;EKS Cluster&lt;br&gt;
Namespace&lt;br&gt;
Pod&lt;br&gt;
Container&lt;br&gt;
Process Group&lt;/p&gt;

&lt;p&gt;Example relationships:&lt;/p&gt;

&lt;p&gt;EKS Cluster&lt;br&gt;
│&lt;br&gt;
CONTAINS&lt;br&gt;
▼&lt;br&gt;
Namespace&lt;br&gt;
│&lt;br&gt;
CONTAINS&lt;br&gt;
▼&lt;br&gt;
Pod&lt;br&gt;
│&lt;br&gt;
RUNS&lt;br&gt;
▼&lt;br&gt;
Container&lt;/p&gt;

&lt;p&gt;Each container instance is mapped to a &lt;strong&gt;process group representing a running microservice instance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This structure allows the graph to capture &lt;strong&gt;runtime relationships between services and infrastructure nodes&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Modeling Application Services
&lt;/h1&gt;

&lt;p&gt;At the application level, the graph represents each microservice as a service node.&lt;/p&gt;

&lt;p&gt;Example nodes:&lt;/p&gt;

&lt;p&gt;Service&lt;br&gt;
API&lt;br&gt;
Database&lt;br&gt;
External Dependency&lt;/p&gt;

&lt;p&gt;Services are connected to the runtime processes executing them.&lt;/p&gt;

&lt;p&gt;Example relationship:&lt;/p&gt;

&lt;p&gt;Checkout Service&lt;br&gt;
│&lt;br&gt;
RUNS_AS&lt;br&gt;
▼&lt;br&gt;
Process Group&lt;br&gt;
│&lt;br&gt;
HOSTED_ON&lt;br&gt;
▼&lt;br&gt;
Kubernetes Pod&lt;/p&gt;

&lt;p&gt;This mapping enables the system to trace incidents from &lt;strong&gt;application failures down to infrastructure components&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Modeling Caller–Callee Relationships
&lt;/h1&gt;

&lt;p&gt;One of the most critical aspects of the topology graph is capturing &lt;strong&gt;service interaction flows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Microservices communicate through APIs, forming &lt;strong&gt;caller–callee relationships&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Checkout Service&lt;br&gt;
│&lt;br&gt;
CALLS&lt;br&gt;
▼&lt;br&gt;
Payment Service&lt;br&gt;
│&lt;br&gt;
CALLS&lt;br&gt;
▼&lt;br&gt;
Payment Database&lt;/p&gt;

&lt;p&gt;These relationships represent the &lt;strong&gt;actual runtime service communication paths&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By modeling these relationships, the AI agent can identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;downstream dependencies&lt;/li&gt;
&lt;li&gt;cascading failures&lt;/li&gt;
&lt;li&gt;shared services impacting multiple user journeys&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Linking Observability Data to the Graph
&lt;/h1&gt;

&lt;p&gt;Observability signals such as logs and errors are attached to graph nodes.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Payment Service&lt;br&gt;
│&lt;br&gt;
HAS_ERROR&lt;br&gt;
▼&lt;br&gt;
Timeout Exception&lt;/p&gt;

&lt;p&gt;Infrastructure events can also be attached:&lt;/p&gt;

&lt;p&gt;EC2 Worker Node&lt;br&gt;
│&lt;br&gt;
HAS_EVENT&lt;br&gt;
▼&lt;br&gt;
CPU Spike&lt;/p&gt;

&lt;p&gt;This allows the agent to correlate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;infrastructure issues&lt;/li&gt;
&lt;li&gt;application errors&lt;/li&gt;
&lt;li&gt;service dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;within a &lt;strong&gt;single reasoning model&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Learning from Historical Incidents
&lt;/h1&gt;

&lt;p&gt;Each investigated incident is also stored in the graph.&lt;/p&gt;

&lt;p&gt;Example structure:&lt;/p&gt;

&lt;p&gt;Incident&lt;br&gt;
├ impacted service&lt;br&gt;
├ root cause&lt;br&gt;
├ infrastructure correlation&lt;br&gt;
└ resolution&lt;/p&gt;

&lt;p&gt;Over time, this builds a &lt;strong&gt;knowledge graph of operational incidents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The AI agent can then detect patterns such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recurring failures&lt;/li&gt;
&lt;li&gt;common dependency issues&lt;/li&gt;
&lt;li&gt;infrastructure patterns impacting multiple services&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Architecture Overview
&lt;/h1&gt;

&lt;p&gt;A simplified architecture for this approach looks like this:&lt;/p&gt;

&lt;p&gt;SLO Breach Alert&lt;br&gt;
        │&lt;br&gt;
        ▼&lt;br&gt;
Event Trigger (Monitoring / EventBridge)&lt;br&gt;
        │&lt;br&gt;
        ▼&lt;br&gt;
Incident AI Agent&lt;br&gt;
        │&lt;br&gt;
        ├── Service Topology Graph (Neo4j)&lt;br&gt;
        ├── Observability Data (Logs / Traces)&lt;br&gt;
        └── Historical Incident Knowledge&lt;br&gt;
        │&lt;br&gt;
        ▼&lt;br&gt;
     LLM Reasoning&lt;br&gt;
        │&lt;br&gt;
        ▼&lt;br&gt;
 Root Cause Hypothesis&lt;/p&gt;

&lt;p&gt;AWS services that can support this architecture include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon EKS&lt;/li&gt;
&lt;li&gt;AWS Lambda&lt;/li&gt;
&lt;li&gt;Amazon EventBridge&lt;/li&gt;
&lt;li&gt;Amazon Bedrock&lt;/li&gt;
&lt;li&gt;Amazon OpenSearch&lt;/li&gt;
&lt;li&gt;Amazon Neptune (as a managed graph alternative)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Agent Workflow
&lt;/h1&gt;

&lt;p&gt;When a new SLO breach occurs, the AI agent performs the following steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Detect SLO Breach
&lt;/h3&gt;

&lt;p&gt;Monitoring tools trigger an alert event.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Identify Impacted Services
&lt;/h3&gt;

&lt;p&gt;The agent queries the service topology graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Traverse Dependencies
&lt;/h3&gt;

&lt;p&gt;The graph traversal identifies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upstream services&lt;/li&gt;
&lt;li&gt;downstream dependencies&lt;/li&gt;
&lt;li&gt;infrastructure nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4 — Retrieve Observability Signals
&lt;/h3&gt;

&lt;p&gt;Logs and errors are retrieved from observability platforms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — LLM Reasoning
&lt;/h3&gt;

&lt;p&gt;Structured context is sent to the LLM.&lt;/p&gt;

&lt;p&gt;Example prompt:&lt;/p&gt;

&lt;p&gt;SLO breach detected in Checkout Service&lt;/p&gt;

&lt;p&gt;Impacted services:&lt;br&gt;
Checkout Service&lt;br&gt;
Payment Service&lt;br&gt;
Payment Database&lt;/p&gt;

&lt;p&gt;Recent errors:&lt;br&gt;
Timeout errors in Payment Service&lt;/p&gt;

&lt;p&gt;Historical incident:&lt;br&gt;
Database connection pool exhaustion&lt;/p&gt;

&lt;p&gt;The LLM then generates a &lt;strong&gt;root cause hypothesis&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Results from the Prototype
&lt;/h1&gt;

&lt;p&gt;In the prototype implementation:&lt;/p&gt;

&lt;p&gt;Manual investigation time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;20–30 minutes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI-assisted investigation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Under 1 minute&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a specific &lt;strong&gt;platinum user journey SLO&lt;/strong&gt;, the agent achieved:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;~52% correlation accuracy&lt;/strong&gt; between SLO breaches and underlying service problems.&lt;/p&gt;

&lt;p&gt;While not perfect, it significantly accelerates incident triage.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Graph-Based Observability Matters
&lt;/h1&gt;

&lt;p&gt;Traditional observability focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metrics&lt;/li&gt;
&lt;li&gt;logs&lt;/li&gt;
&lt;li&gt;traces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However modern systems also require &lt;strong&gt;relationship awareness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Graph-based models enable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dependency reasoning&lt;/li&gt;
&lt;li&gt;cross-service correlation&lt;/li&gt;
&lt;li&gt;historical incident learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combining &lt;strong&gt;graph knowledge with LLM reasoning&lt;/strong&gt; enables a new class of systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-assisted incident response agents.&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Future Directions
&lt;/h1&gt;

&lt;p&gt;This concept can evolve further with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;autonomous remediation agents&lt;/li&gt;
&lt;li&gt;continuous incident learning&lt;/li&gt;
&lt;li&gt;multi-agent observability systems&lt;/li&gt;
&lt;li&gt;integration with CI/CD pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As distributed architectures continue to grow in complexity, &lt;strong&gt;topology-aware AI agents may become an essential part of SRE operations&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;AI-powered incident investigation is still in its early stages.&lt;/p&gt;

&lt;p&gt;However combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;observability data&lt;/li&gt;
&lt;li&gt;service topology graphs&lt;/li&gt;
&lt;li&gt;Kubernetes infrastructure knowledge&lt;/li&gt;
&lt;li&gt;historical incident intelligence&lt;/li&gt;
&lt;li&gt;LLM reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;creates a powerful approach to &lt;strong&gt;automated root cause analysis&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Topology-aware AI agents represent a promising direction for improving &lt;strong&gt;SRE productivity and incident response time in modern cloud-native systems&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;If you're exploring AI for SRE, observability, or incident automation, I would love to hear your thoughts or experiences.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>From Chatbot to Cloud CFO: Building an Autonomous FinOps Agent with Amazon Bedrock</title>
      <dc:creator>Roopa Venkatesh</dc:creator>
      <pubDate>Wed, 14 Jan 2026 02:30:46 +0000</pubDate>
      <link>https://forem.com/roops/from-chatbot-to-cloud-cfo-building-an-autonomous-finops-agent-with-amazon-bedrock-2epc</link>
      <guid>https://forem.com/roops/from-chatbot-to-cloud-cfo-building-an-autonomous-finops-agent-with-amazon-bedrock-2epc</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1e2duujddg4k9f0w4f7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1e2duujddg4k9f0w4f7c.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Learn how to build an AI agent that autonomously optimises your AWS costs by analysing CloudWatch metrics, identifying underutilised resources, and making intelligent decisions—all while maintaining production safety through AWS X-Ray observability and human-in-the-loop approval workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to build a Bedrock Agent with custom action groups&lt;/li&gt;
&lt;li&gt;Implementing X-Ray tracing to audit AI decision-making&lt;/li&gt;
&lt;li&gt;Production safety patterns for autonomous infrastructure agents&lt;/li&gt;
&lt;li&gt;Human-in-the-loop approval workflows for high-risk actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS account with Bedrock access (ap-southeast-2 recommended for Australia)&lt;/li&gt;
&lt;li&gt;Python 3.9+ and AWS CLI configured&lt;/li&gt;
&lt;li&gt;Basic understanding of Lambda, EC2, and CloudWatch&lt;/li&gt;
&lt;li&gt;Estimated cost: ~$20/month for development use&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;We have all been there. You spin up a p3.2xlarge instance for a quick Friday afternoon experiment. You go to happy hour, the weekend hits, and you forget about it. Two weeks later, the AWS bill arrives, and panic sets in.&lt;/p&gt;

&lt;p&gt;For years, we solved this with "dumb" scripts—Cron jobs that shut down everything tagged dev at 7 PM. But scripts lack context. They kill long-running training jobs just as often as they save money.&lt;/p&gt;

&lt;p&gt;We don't need a script. We need an &lt;strong&gt;Agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk through how to build an &lt;strong&gt;Autonomous FinOps Agent&lt;/strong&gt; using &lt;strong&gt;Amazon Bedrock&lt;/strong&gt; and &lt;strong&gt;Python&lt;/strong&gt;. More importantly, I will show you how to use &lt;strong&gt;AWS X-Ray&lt;/strong&gt; to "audit the brain" of the agent, ensuring it never deletes production resources by mistake.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Difference: Automation vs. Agentic AI
&lt;/h2&gt;

&lt;p&gt;Why use an Agent instead of a Lambda function triggered by a CloudWatch Alarm?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automation (The Script):&lt;/strong&gt; "If CPU &amp;lt; 5% for 1 hour, terminate instance."&lt;br&gt;
Risk: It kills a critical database waiting for connections.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agentic AI (The CFO):&lt;/strong&gt; "I see this instance has low CPU. Let me check the tags. It belongs to the 'Data Science' team. Let me check the git logs on the attached volume. It seems inactive. I will slack the owner, and if they don't reply in 24 hours, I will snapshot and terminate it."&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Agent adds &lt;strong&gt;reasoning&lt;/strong&gt; to the automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;We will use the Amazon Bedrock Agents framework, which simplifies the orchestration of tools and provides built-in reasoning capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Component Overview:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Brain:&lt;/strong&gt; Amazon Bedrock (Model: Claude 3.5 Sonnet) - Handles reasoning and decision-making&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Hands:&lt;/strong&gt; Python Lambda function (Action Group) - Executes AWS API calls via Boto3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Eyes:&lt;/strong&gt; AWS X-Ray + CloudWatch Logs - Traces every decision and API call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Safety Net:&lt;/strong&gt; SNS notifications for human approval on destructive actions
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────┐
│                     User Query                           │
│        "Find under-utilised resources in ap-southeast-2" │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Amazon Bedrock Agent │
         │  (Claude 3.5 Sonnet)  │──── X-Ray Tracing
         └───────────┬───────────┘
                     │
                     │ Invokes Action Group
                     ▼
         ┌───────────────────────┐
         │   Lambda Function     │
         │  (Action Router)      │──── CloudWatch Logs
         └───────────┬───────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
         ▼                       ▼
    ┌─────────┐           ┌─────────┐
    │ EC2 API │           │ CW API  │
    └─────────┘           └─────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Building the Action Group Lambda
&lt;/h2&gt;

&lt;p&gt;In Bedrock, an "Action Group" is an OpenAPI schema that maps natural language intents to Lambda functions. The Lambda acts as a &lt;strong&gt;central router&lt;/strong&gt; that executes different tools based on what the agent decides.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt; We use &lt;code&gt;aws_xray_sdk&lt;/code&gt; to patch all Boto3 calls. In Agentic AI, observability isn't optional—it's the only way to debug hallucinations and verify the agent's reasoning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Lambda Handler (Action Router)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_xray_sdk.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patch_all&lt;/span&gt;

&lt;span class="c1"&gt;# Automatically trace all AWS API calls
&lt;/span&gt;&lt;span class="nf"&gt;patch_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;ec2_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ec2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Central router for Bedrock Agent action groups.
    Receives function name + parameters, executes the tool, returns structured response.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;function_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])}&lt;/span&gt;

    &lt;span class="c1"&gt;# Start X-Ray subsegment to track this specific tool execution
&lt;/span&gt;    &lt;span class="n"&gt;subsegment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;begin_subsegment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ToolExecution:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;subsegment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_annotation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;function_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyse_underutilised_resources&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check_cpu_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;function_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stop_resource&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# CRITICAL SAFETY CHECK: Never stop production instances
&lt;/span&gt;            &lt;span class="n"&gt;resource_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;resource_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_production&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DENIED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Production resources require manual approval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; Agent attempted to stop PROD: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stop_ec2_instance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool execution failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;subsegment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Internal tool error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end_subsegment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Return in Bedrock's expected format
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;messageVersion&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1.0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;actionGroup&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;actionGroup&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;functionResponse&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;responseBody&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TEXT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_body&lt;/span&gt;&lt;span class="p"&gt;)}}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Design Patterns:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Subsegments for granular tracing&lt;/strong&gt; - Each tool execution gets its own X-Ray subsegment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production safety guard&lt;/strong&gt; - Tag-based checks prevent accidental destruction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured responses&lt;/strong&gt; - JSON format allows the agent to reason about results&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 2: The OpenAPI Schema (Connecting Agent to Tools)
&lt;/h2&gt;

&lt;p&gt;Before the agent can call your Lambda, you need to define an OpenAPI schema that describes the available tools. This is what Bedrock uses to understand &lt;strong&gt;when&lt;/strong&gt; and &lt;strong&gt;how&lt;/strong&gt; to invoke your functions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openapi&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3.0.0&lt;/span&gt;
&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FinOps Agent Tools&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;/analyse&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;post&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Analyse underutilised EC2 resources&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scans a region for instances with low CPU utilisation over the past 7 days&lt;/span&gt;
      &lt;span class="na"&gt;operationId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;analyse_underutilised_resources&lt;/span&gt;
      &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;region&lt;/span&gt;
          &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;
          &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
            &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ap-southeast-2&lt;/span&gt;
      &lt;span class="na"&gt;responses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200'&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Analysis results with list of underutilised instances&lt;/span&gt;

  &lt;span class="na"&gt;/stop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;post&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Stop an EC2 instance&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Stops a non-production EC2 instance to save costs&lt;/span&gt;
      &lt;span class="na"&gt;operationId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stop_resource&lt;/span&gt;
      &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;resource_id&lt;/span&gt;
          &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;
          &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;The EC2 instance ID to stop&lt;/span&gt;
      &lt;span class="na"&gt;responses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200'&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Instance stop status&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User asks: &lt;em&gt;"Find idle instances in ap-southeast-2"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Bedrock maps this to &lt;code&gt;analyse_underutilised_resources&lt;/code&gt; with &lt;code&gt;region=ap-southeast-2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Lambda receives the function name + parameters&lt;/li&gt;
&lt;li&gt;Executes &lt;code&gt;check_cpu_metrics('ap-southeast-2')&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Returns structured JSON back to Bedrock&lt;/li&gt;
&lt;li&gt;Agent reasons about the results and responds to the user&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 3: Implementing the Cost Analysis Tool
&lt;/h2&gt;

&lt;p&gt;Now let's implement the &lt;code&gt;check_cpu_metrics()&lt;/code&gt; function that does the actual CloudWatch analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_cpu_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Analyses EC2 instances for low CPU utilisation.
    Returns actionable insights for cost optimisation.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cw_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cloudwatch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ec2_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ec2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;underutilised&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Get all running instances
&lt;/span&gt;    &lt;span class="n"&gt;instances&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ec2_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_instances&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;instance-state-name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Values&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;running&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Analyse last 7 days of metrics
&lt;/span&gt;    &lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aestnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;reservation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Reservations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Instances&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;instance_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InstanceId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="c1"&gt;# Query CloudWatch for CPU metrics
&lt;/span&gt;            &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cw_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_metric_statistics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AWS/EC2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;MetricName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CPUUtilisation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InstanceId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
                &lt;span class="n"&gt;StartTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;EndTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;end_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Period&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Daily aggregation
&lt;/span&gt;                &lt;span class="n"&gt;Statistics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Average&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Datapoints&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="n"&gt;avg_cpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Average&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Datapoints&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Datapoints&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

                &lt;span class="c1"&gt;# Flag instances below 10% CPU
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;avg_cpu&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tags&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])}&lt;/span&gt;
                    &lt;span class="n"&gt;underutilised&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;instance_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;instance_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InstanceType&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recommendation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Consider downsizing or stopping&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;underutilised_instances&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;underutilised&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;potential_monthly_savings&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;underutilised&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this approach works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7-day analysis window&lt;/strong&gt; catches weekend/holiday idle time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily aggregation&lt;/strong&gt; reduces API costs while maintaining accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag extraction&lt;/strong&gt; gives the agent context about ownership and environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured output&lt;/strong&gt; allows the agent to present findings naturally&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;** Pro Tip:** For production, add network I/O metrics and disk usage to avoid flagging batch processing instances that are I/O bound but CPU-light.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 4: X-Ray Observability - Auditing the Agent's Brain
&lt;/h2&gt;

&lt;p&gt;When you deploy a chatbot, users give feedback via thumbs up/down. When you deploy an infrastructure agent, the "feedback" might be &lt;strong&gt;your production database going offline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We need &lt;strong&gt;deep observability&lt;/strong&gt; to audit every decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  What X-Ray Gives You
&lt;/h3&gt;

&lt;p&gt;By using &lt;code&gt;patch_all()&lt;/code&gt; and custom subsegments, we generate a full trace of the agent's decision-making process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    ↓
Bedrock Agent (Reasoning: "Low CPU detected, checking tags...")
    ↓
Lambda Action Group (Executing: analyse_underutilised_resources)
    ↓
CloudWatch API (Fetching metrics)
    ↓
EC2 API (Reading instance tags)
    ↓
Response back to agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;If the agent fails to stop an instance, CloudWatch Logs alone won't tell you &lt;strong&gt;why&lt;/strong&gt;. You need to know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Did the tool fail?&lt;/strong&gt; (e.g., Boto3 &lt;code&gt;AccessDenied&lt;/code&gt; error)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Did the agent fail to reason correctly?&lt;/strong&gt; (e.g., The agent concluded "CPU at 10% is actually high load for this workload" and didn't call the stop function)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;X-Ray lets you overlay the &lt;strong&gt;Reasoning Trace&lt;/strong&gt; (Bedrock) with the &lt;strong&gt;Execution Trace&lt;/strong&gt; (Lambda/AWS APIs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example X-Ray Insight:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"subsegment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ToolExecution:stop_resource"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"annotation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FinOpsAgent-v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"resource_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"i-0123456789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DENIED - Production tag detected"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;245&lt;/span&gt;&lt;span class="err"&gt;ms&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you the agent &lt;strong&gt;correctly refused&lt;/strong&gt; to stop a production instance—critical for compliance audits.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: The "Human-in-the-Loop" Safety Net
&lt;/h2&gt;

&lt;p&gt;The biggest fear with Agentic AI is the &lt;strong&gt;"Runaway Robot"&lt;/strong&gt; scenario—what if the agent misinterprets data and terminates a critical database?&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;Don't give the agent destructive permissions&lt;/strong&gt;. Instead, use an approval workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation: Request Approval Tool
&lt;/h3&gt;

&lt;p&gt;Add a third function to your Lambda that sends termination requests to humans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;sns_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SNS_TOPIC_ARN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arn:aws:sns:ap-southeast-2:123456789:FinOpsApprovals&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;request_termination_approval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;justification&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Requests human approval before terminating an instance.
    Publishes to SNS topic monitored by DevOps team.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Get instance details for context
&lt;/span&gt;    &lt;span class="n"&gt;ec2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ec2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_instances&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;InstanceIds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Reservations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Instances&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tags&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])}&lt;/span&gt;
    &lt;span class="n"&gt;instance_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InstanceType&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate estimated savings
&lt;/span&gt;    &lt;span class="c1"&gt;# Rough estimates: t3.medium=$30/mo, t3.large=$60/mo, etc.
&lt;/span&gt;    &lt;span class="n"&gt;hourly_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_instance_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;monthly_savings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hourly_cost&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;730&lt;/span&gt;  &lt;span class="c1"&gt;# hours/month
&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
 FinOps Agent Termination Request

Instance: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Environment: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

AI Analysis:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;justification&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Estimated Monthly Savings: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;monthly_savings&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Actions:
✅ Approve: Reply with &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPROVE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
❌ Deny: Reply with &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DENY &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
⏸️ Snooze 7 days: Reply with &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SNOOZE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

View X-Ray Trace: https://console.aws.amazon.com/xray/...
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;sns_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;TopicArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SNS_TOPIC_ARN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; FinOps Agent: Approval needed for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;approval_requested&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;instance_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Notification sent to DevOps team&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Update the OpenAPI Schema
&lt;/h3&gt;

&lt;p&gt;Add this to your schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;/request-termination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;post&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Request approval to terminate an instance&lt;/span&gt;
      &lt;span class="na"&gt;operationId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;request_termination_approval&lt;/span&gt;
      &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;instance_id&lt;/span&gt;
          &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;
          &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;justification&lt;/span&gt;
          &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;
          &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AI-generated explanation for why termination is recommended&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Agent's New Workflow
&lt;/h3&gt;

&lt;p&gt;Now when the agent finds a zombie instance, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Analyses&lt;/strong&gt; the metrics (CPU, network, disk)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasons&lt;/strong&gt; about the context (tags, uptime, cost)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requests approval&lt;/strong&gt; instead of acting immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Includes the X-Ray trace link&lt;/strong&gt; so humans can audit the decision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Waits&lt;/strong&gt; for human confirmation before taking action&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This gives you the &lt;strong&gt;speed of AI analysis&lt;/strong&gt; with the &lt;strong&gt;safety of human judgment&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployment &amp;amp; Testing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Create the Lambda Function
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Package dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;aws-xray-sdk boto3 &lt;span class="nt"&gt;-t&lt;/span&gt; ./package
&lt;span class="nb"&gt;cd &lt;/span&gt;package &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; zip &lt;span class="nt"&gt;-r&lt;/span&gt; ../lambda.zip &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd&lt;/span&gt; ..
zip &lt;span class="nt"&gt;-g&lt;/span&gt; lambda.zip lambda_function.py

&lt;span class="c"&gt;# Deploy to AWS&lt;/span&gt;
aws lambda create-function &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function-name&lt;/span&gt; FinOpsAgentTools &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--runtime&lt;/span&gt; python3.11 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt; arn:aws:iam::YOUR_ACCOUNT:role/FinOpsLambdaRole &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--handler&lt;/span&gt; lambda_function.lambda_handler &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zip-file&lt;/span&gt; fileb://lambda.zip &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt; 60 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory-size&lt;/span&gt; 256 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tracing-config&lt;/span&gt; &lt;span class="nv"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Active
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Create IAM Policy for Lambda
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"EC2ReadOnly"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"ec2:DescribeInstances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"ec2:DescribeTags"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"EC2StopNonProd"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ec2:StopInstances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:ec2:*:*:instance/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringNotEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ec2:ResourceTag/Environment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Production"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CloudWatchMetrics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cloudwatch:GetMetricStatistics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SNSPublish"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sns:Publish"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:sns:*:*:FinOpsApprovals"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"XRayTracing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"xray:PutTraceSegments"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"xray:PutTelemetryRecords"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Security Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EC2 stop is tag-restricted&lt;/strong&gt; - Cannot stop instances with &lt;code&gt;Environment=Production&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read-only for metrics&lt;/strong&gt; - No write access to CloudWatch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SNS scope limited&lt;/strong&gt; - Can only publish to approval topic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Create the Bedrock Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create the agent&lt;/span&gt;
aws bedrock-agent create-agent &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-name&lt;/span&gt; FinOpsAgent &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--foundation-model&lt;/span&gt; anthropic.claude-3-5-sonnet-20241022-v2:0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--instruction&lt;/span&gt; &lt;span class="s2"&gt;"You are a FinOps specialist. Analyse AWS costs and recommend optimisations. Always request approval before suggesting destructive actions."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--idle-session-ttl&lt;/span&gt; 600

&lt;span class="c"&gt;# Create action group with OpenAPI schema&lt;/span&gt;
aws bedrock-agent create-agent-action-group &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-id&lt;/span&gt; YOUR_AGENT_ID &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--action-group-name&lt;/span&gt; FinOpsTools &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--action-group-executor&lt;/span&gt; &lt;span class="nv"&gt;lambda&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:lambda:ap-southeast-2:ACCOUNT:function:FinOpsAgentTools &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--api-schema&lt;/span&gt; file://openapi-schema.yaml

&lt;span class="c"&gt;# Prepare and create alias&lt;/span&gt;
aws bedrock-agent prepare-agent &lt;span class="nt"&gt;--agent-id&lt;/span&gt; YOUR_AGENT_ID
aws bedrock-agent create-agent-alias &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-id&lt;/span&gt; YOUR_AGENT_ID &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-alias-name&lt;/span&gt; prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Test the Agent
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Test Query 1: Analyse Resources&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "Find underutilised instances in ap-southeast-2 that are costing us money"

Agent Response:
"I found 3 underutilised instances in ap-southeast-2:

1. i-0abc123 (t3.large) - Avg CPU: 3.2% - Est. savings: $60/mo
2. i-0def456 (m5.xlarge) - Avg CPU: 7.8% - Est. savings: $140/mo
3. i-0ghi789 (t3.medium) - Avg CPU: 5.1% - Est. savings: $30/mo

Total potential monthly savings: $230

Would you like me to request approval to stop these instances?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test Query 2: Safety Check&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "Stop instance i-prod-db-001"

Agent Response:
"I cannot stop i-prod-db-001 because it has the Environment=Production tag.
Production resources require manual approval through the change management process."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Monitor with X-Ray
&lt;/h3&gt;

&lt;p&gt;Navigate to AWS X-Ray console and filter by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;annotation.Agent = "FinOpsAgent-v1"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see traces showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which tools the agent invoked&lt;/li&gt;
&lt;li&gt;How long each API call took&lt;/li&gt;
&lt;li&gt;Whether any errors occurred&lt;/li&gt;
&lt;li&gt;The full decision path from user query → response&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building Agentic AI on AWS is about more than just Prompt Engineering—it's about &lt;strong&gt;Reliability Engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By treating the agent's "thoughts" as loggable events (X-Ray traces) and wrapping its "hands" (tools) in strict safety checks (tag validation, human approvals), we can build a FinOps assistant that is not only smart but &lt;strong&gt;trustworthy&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agents &amp;gt; Scripts&lt;/strong&gt; - Reasoning capabilities allow context-aware decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability is mandatory&lt;/strong&gt; - X-Ray tracing is the only way to audit AI decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety through architecture&lt;/strong&gt; - Use IAM policies and human-in-the-loop for destructive actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure matters&lt;/strong&gt; - Well-designed tool outputs enable better agent reasoning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The era of static Cron jobs is ending. The era of the &lt;strong&gt;Cloud CFO Agent&lt;/strong&gt; has begun.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Estimate
&lt;/h2&gt;

&lt;p&gt;Running this setup for development/testing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock Agent (Claude 3.5 Sonnet)&lt;/td&gt;
&lt;td&gt;~$5 (100 queries)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda invocations&lt;/td&gt;
&lt;td&gt;~$0.20 (1000 invocations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch API calls&lt;/td&gt;
&lt;td&gt;~$0.10 (detailed monitoring)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;X-Ray tracing&lt;/td&gt;
&lt;td&gt;~$2 (10,000 traces)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNS notifications&lt;/td&gt;
&lt;td&gt;~$0.50 (100 emails)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$8/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For production use with 10,000 queries/month: &lt;strong&gt;~$50-75/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Potential savings identified: &lt;strong&gt;Hundreds to thousands per month&lt;/strong&gt; depending on environment size.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next Steps &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Enhancements to Consider
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region support&lt;/strong&gt; - Extend analysis to all AWS regions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Additional metrics&lt;/strong&gt; - Network I/O, disk usage, memory utilisation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical trending&lt;/strong&gt; - Track savings over time in DynamoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack integration&lt;/strong&gt; - Send reports to team channels instead of email&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-remediation&lt;/strong&gt; - After 30 days of approvals, enable auto-stop for specific tags&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html" rel="noopener noreferrer"&gt;Amazon Bedrock Agents Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/xray/latest/devguide/" rel="noopener noreferrer"&gt;AWS X-Ray Developer Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.finops.org/" rel="noopener noreferrer"&gt;FinOps Foundation Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/claude" rel="noopener noreferrer"&gt;Anthropic Claude 3.5 Model Card&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Migrating Traditional SFTP Servers to AWS SFTP Transfer Family: A Secure and Serverless Approach</title>
      <dc:creator>Roopa Venkatesh</dc:creator>
      <pubDate>Mon, 16 Dec 2024 07:12:56 +0000</pubDate>
      <link>https://forem.com/roops/migrating-traditional-sftp-servers-to-aws-sftp-transfer-family-a-secure-and-serverless-approach-2jp8</link>
      <guid>https://forem.com/roops/migrating-traditional-sftp-servers-to-aws-sftp-transfer-family-a-secure-and-serverless-approach-2jp8</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Managing traditional SFTP servers on-premises often comes with its share of challenges. Organizations struggle with maintaining the infrastructure, ensuring high availability, scaling storage, and securing user access. These systems require regular patching, upgrades, and constant monitoring &lt;br&gt;
to prevent downtime or security breaches. For businesses handling increasing file transfer demands, these limitations can result in operational inefficiencies and spiraling costs.&lt;/p&gt;

&lt;p&gt;Thankfully, AWS SFTP Transfer Family offers a modern solution to these issues. With its serverless and fully managed setup, you can eliminate the overhead of managing hardware while leveraging the scalability and cost-effectiveness of the AWS Cloud. This blog post will guide you through migrating your traditional SFTP server to AWS SFTP Transfer Family, focusing on a secure, scalable, and highly available architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Migrate to AWS SFTP Transfer Family?
&lt;/h2&gt;

&lt;p&gt;AWS SFTP Transfer Family provides a robust, serverless alternative to traditional SFTP servers. Here are some key benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ease of Management: AWS handles the underlying infrastructure, reducing operational burden.&lt;/li&gt;
&lt;li&gt;High Availability: Native multi-AZ support ensures uninterrupted service.&lt;/li&gt;
&lt;li&gt;Scalability: Seamless integration with Amazon S3 allows for virtually unlimited storage capacity.&lt;/li&gt;
&lt;li&gt;Security: Built-in key-based authentication and IAM for granular control over user access.&lt;/li&gt;
&lt;li&gt;Cost-Effectiveness: Pay only for the resources you use, with no upfront investment in hardware.&lt;/li&gt;
&lt;li&gt;Customizable architecture design: AWS Cloud enables you to design your SFTP architecture based on specific customer requirements, utilizing network components such as VPCs, security groups, and NLBs to achieve the most secure and stringent configurations as needed.&lt;/li&gt;
&lt;li&gt;Seamless integration for microservices: Files available on S3 storage supported SFTP Transfer family allows for seamless integrations for microservice development on AWS.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Proposed Architecture
&lt;/h2&gt;

&lt;p&gt;For this migration, I propose a secure and serverless architecture with the following components based on one of our Financial Services customer's on-premises to AWS Migration scenario:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Internal Setup with a Firewall: Use a firewall (e.g., Fortinet) in front of your SFTP server to secure and inspect incoming traffic.&lt;br&gt;
Route traffic through a Network Load Balancer (NLB) configured for high availability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SFTP Server in a VPC: Deploy the SFTP server in a Virtual Private Cloud (VPC) for network isolation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;High Availability with 2 AZs: Configure the SFTP server to operate across two Availability Zones (AZs) for resilience.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Service-Managed Users: Use service-managed user accounts with key-based authentication for secure access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;S3 Backend Storage: Store files in Amazon S3 for scalability and durability, and leverage lifecycle policies to optimize storage costs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fine-Grained Access Control with IAM: Use IAM roles to enforce folder-level permissions for secure and organized access.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Implementation Steps
&lt;/h2&gt;

&lt;p&gt;Following is the step by step implementation details.&lt;/p&gt;
&lt;h4&gt;
  
  
  Set Up AWS SFTP Transfer Family
&lt;/h4&gt;

&lt;p&gt;a. Navigate to the AWS Transfer Family service in the AWS Management Console.&lt;br&gt;
b. Create a new SFTP server and configure it as VPC Hosted and internal access with two (or three) Availability Zones.&lt;br&gt;
c. Attach a security group that permits traffic only from the firewall.&lt;br&gt;
d. Configure a Server host key by adding an already generated private SSH key, which will be presented when users access the sftp server.&lt;/p&gt;
&lt;h4&gt;
  
  
  Configure Network Load Balancer (NLB)
&lt;/h4&gt;

&lt;p&gt;a. Deploy an NLB in front of the SFTP server.&lt;br&gt;
b. Configure the NLB to route traffic to the SFTP endpoint on port 22.&lt;br&gt;
c. Set up health checks for continuous monitoring.&lt;/p&gt;
&lt;h4&gt;
  
  
  Integrate the Firewall
&lt;/h4&gt;

&lt;p&gt;a. Use a Fortinet (or similar) firewall to control, inspect and monitor incoming requests.&lt;br&gt;
b. Allow only specific IP ranges or VPN traffic through the firewall to the NLB.&lt;br&gt;
c. Whitelist sftp partner/customer IP addresses to restrict access to only required inbound connections&lt;/p&gt;
&lt;h4&gt;
  
  
  Set Up Service-Managed Users
&lt;/h4&gt;

&lt;p&gt;a. Request SFTP partner/customer to provide the SSH Public key for secure connection access as a prerequisite step, or reuse the SSH public key from on-premises or current SFTP setup.&lt;br&gt;
b. Define user accounts in AWS Transfer Family and assign each user a unique SSH public key.&lt;br&gt;
c. Map each user to an S3 bucket or folder for isolated file access.&lt;br&gt;
d. Setup Home directory for each user with the bucket or individual username folder or any directory structure you want to use, check 'Restricted' checkbox for the sftp user to not access anything outside this folder. The user will not be able to see the s3 bucket or folder name when &lt;em&gt;Restricted&lt;/em&gt; is checked.&lt;/p&gt;

&lt;p&gt;AWS SFTP Server summary page for reference:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy7f8iu6ol1bd5tobqbu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy7f8iu6ol1bd5tobqbu.png" alt="Image description" width="800" height="604"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Connect S3 as Backend Storage
&lt;/h4&gt;

&lt;p&gt;a. Attach Amazon S3 as the backend storage for the SFTP server.&lt;br&gt;
b. Configure lifecycle policies to transition data to lower-cost storage classes (e.g., S3 Glacier).&lt;br&gt;
c. Configure S3 copy or backup to copy files to another s3 bucket if needed so you can directly process the files from the SFTP bucket from your applications.&lt;/p&gt;
&lt;h4&gt;
  
  
  Configure IAM Roles for Fine-Grained Access Control
&lt;/h4&gt;

&lt;p&gt;a. Define IAM roles to control access to specific S3 folders with read/write/delete permissions.&lt;br&gt;
b. You can use different IAM roles for different users if you want to provide additional access like 's3:DeleteObject' for certain folders within the home directory.&lt;/p&gt;

&lt;p&gt;Example policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::your-bucket",
      "Condition": {
        "StringLike": {
          "s3:prefix": ["user-folder/*"]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::your-bucket/user-folder/*"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trust policy for the IAM role should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "transfer.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Test and Validate
&lt;/h4&gt;

&lt;p&gt;a. Enable CloudWatch Logs for logging and troubleshooting of SFTP connections.&lt;br&gt;
b. Test user connections through the firewall and NLB. Ask your external sftp users to connect via any SFTP clients such as FileZilla or WinSCP. Note that the AWS SFTP server host doesn't allow SSH connection, you need sftp command to connect.&lt;br&gt;
c. Validate user access permissions to S3 folders.&lt;br&gt;
d. Simulate failover scenarios to confirm high availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Plan
&lt;/h2&gt;

&lt;p&gt;Use this simple migration plan for migrating from traditional/on-premises SFTP server to AWS SFTP Transfer Family&lt;/p&gt;

&lt;h4&gt;
  
  
  Inform SFTP Users About Changes
&lt;/h4&gt;

&lt;p&gt;a. Notify all existing SFTP users about the migration to the new AWS SFTP setup.&lt;br&gt;
b. Share details on timelines, new connection endpoints, and any required actions from their side.&lt;/p&gt;

&lt;h4&gt;
  
  
  Transition to Key-Based Authentication
&lt;/h4&gt;

&lt;p&gt;a. Convert all users from password-based authentication to SSH key-based authentication, as AWS SFTP Transfer Family does not support password-based logins as it is not a secure way to access.&lt;br&gt;
b. Assist users in generating and uploading their SSH public keys.&lt;/p&gt;

&lt;h4&gt;
  
  
  Onboard and Migrate Users
&lt;/h4&gt;

&lt;p&gt;a. Create service-managed user accounts in AWS Transfer Family.&lt;br&gt;
b. Migrate users' home directories and set up their specific access permissions in Amazon S3.&lt;/p&gt;

&lt;h4&gt;
  
  
  Set Up and Validate Access
&lt;/h4&gt;

&lt;p&gt;a. Validate that all users can access their respective directories and files as expected.&lt;br&gt;
b. Conduct testing to ensure smooth operations and troubleshoot any access issues.&lt;/p&gt;

&lt;h4&gt;
  
  
  Go Live
&lt;/h4&gt;

&lt;p&gt;a. Update DNS or endpoint configurations to point to the new endpoint, in this case a public IP (setup DNS for this IP) on Fortinet Firewall.&lt;br&gt;
b. Officially transition all users to the new setup and monitor for any post-migration issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Optimization Tips
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Enable S3 lifecycle policies to automatically move infrequently accessed data to lower-cost storage classes.&lt;/li&gt;
&lt;li&gt;Monitor usage with AWS Cost Explorer and set up budgets for cost control.&lt;/li&gt;
&lt;li&gt;Use Savings Plans for additional savings.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Migrating from traditional SFTP servers to AWS SFTP Transfer Family offers significant advantages in terms of scalability, security, and cost efficiency. By leveraging the architecture and steps outlined in this guide, you can seamlessly transition to a serverless, fully managed solution that simplifies operations and improves reliability. Embrace AWS SFTP Transfer Family and future-proof your file transfer needs today.&lt;/p&gt;

</description>
      <category>awssftptransferfamily</category>
      <category>migratesftptoaws</category>
      <category>finegrainedsftpsetuponaws</category>
      <category>awssftpfortinetfirewall</category>
    </item>
    <item>
      <title>Presenting at DataEngBytes 2024 Sydney: Building a Transactional Data Lakehouse on AWS with Apache Iceberg</title>
      <dc:creator>Roopa Venkatesh</dc:creator>
      <pubDate>Sat, 09 Nov 2024 01:52:54 +0000</pubDate>
      <link>https://forem.com/roops/presenting-at-dataengbytes-2024-sydney-building-a-transactional-data-lakehouse-on-aws-with-apache-iceberg-1f7a</link>
      <guid>https://forem.com/roops/presenting-at-dataengbytes-2024-sydney-building-a-transactional-data-lakehouse-on-aws-with-apache-iceberg-1f7a</guid>
      <description>&lt;p&gt;I had the pleasure of presenting at DataEngBytes 2024 in Sydney, where I discussed an exciting topic that’s transforming the data management landscape: Building a Transactional Data Lakehouse on AWS with Apache Iceberg. &lt;br&gt;
This blog post captures the key content and insights shared during the session for those who couldn’t attend and as a record of my talk.&lt;/p&gt;

&lt;p&gt;Why a Data Lakehouse?&lt;br&gt;
As organisations scale and diversify their data sources, they increasingly seek the flexibility of a data lake combined with the transactional reliability of a data warehouse. The data lakehouse architecture bridges this gap by delivering a unified platform that supports both analytical and transactional workloads, making it ideal for managing structured, semi-structured, and unstructured data at scale.&lt;/p&gt;

&lt;p&gt;During my talk, I explained that a data lakehouse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensures ACID compliance for data consistency and reliability.&lt;/li&gt;
&lt;li&gt;Supports time travel to query historical data.&lt;/li&gt;
&lt;li&gt;Provides real-time insights by processing batch and streaming data seamlessly.&lt;/li&gt;
&lt;li&gt;Reduces storage costs by leveraging data lakes for large volumes of data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key Challenges in Traditional Data Lakes and Warehouses:&lt;/p&gt;

&lt;p&gt;I highlighted the challenges organisations often face with traditional data lakes, such as the lack of transaction support, complex schema management, and inconsistent data views. At the same time, data warehouses, though highly consistent, can be expensive and struggle with scalability when handling semi-structured and unstructured data.&lt;/p&gt;

&lt;p&gt;To solve these challenges, I introduced the concept of a data lakehouse built with Apache Iceberg on AWS, combining the benefits of both lakes and warehouses.&lt;/p&gt;

&lt;p&gt;Why Apache Iceberg?&lt;/p&gt;

&lt;p&gt;Apache Iceberg is an open table format that makes it possible to manage large-scale, transactional data in data lake environments. Here’s why it’s ideal for a lakehouse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACID Transactions: Iceberg supports ACID compliance, allowing for consistent data updates, deletes, and inserts.&lt;/li&gt;
&lt;li&gt;Schema Evolution: It gracefully handles schema changes, a common requirement in dynamic data environments.&lt;/li&gt;
&lt;li&gt;Partitioning and Performance: Automatic partitioning optimises query performance, making it efficient even for large datasets.&lt;/li&gt;
&lt;li&gt;Time Travel: Iceberg’s time travel functionality enables querying historical data versions, making it invaluable for auditing, troubleshooting, and compliance. It's like Git for managing code with versioning and reverting to any commit Id.
These features make Iceberg a strong foundation for building a transactional lakehouse that balances flexibility and consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How Iceberg Integrates with AWS:&lt;/p&gt;

&lt;p&gt;One of the session's focal points was explaining how Apache Iceberg works within the AWS ecosystem. Here’s a quick recap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage in Amazon S3: Iceberg tables are stored in Amazon S3, benefiting from scalable and cost-effective object storage.&lt;/li&gt;
&lt;li&gt;Data Processing with AWS Glue: AWS Glue allows serverless ETL processing of data into Iceberg tables, making it possible to handle batch and real-time updates.&lt;/li&gt;
&lt;li&gt;Querying with Amazon Athena: Athena supports SQL queries on Iceberg tables directly from S3, making it easy to query and analyse data without dedicated infrastructure.&lt;/li&gt;
&lt;li&gt;Governance with AWS Lake Formation: Lake Formation provides fine-grained access control, ensuring data security and governance within the lakehouse.
Together, these services create a robust lakehouse environment on AWS, leveraging Iceberg for consistency and scalability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use Case: Financial Data Lakehouse:&lt;/p&gt;

&lt;p&gt;To illustrate how a transactional data lakehouse works in practice, I shared a use case in the financial services industry. Financial institutions need real-time data consistency, compliance, and performance for analytics and regulatory reporting. In this scenario, a data lakehouse with Iceberg allows for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time analytics with consistent, ACID-compliant data.&lt;/li&gt;
&lt;li&gt;Historical data access through time travel for auditing and compliance.&lt;/li&gt;
&lt;li&gt;Cost efficiency by storing data in S3 and using Athena for on-demand queries.
This use case highlighted the lakehouse’s potential to streamline data management in industries requiring both performance and data governance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Architectural Overview:&lt;/p&gt;

&lt;p&gt;In my session, I walked through an architectural diagram illustrating how to build a lakehouse on AWS with Iceberg:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingestion Layer: Data is ingested from multiple sources into S3 using AWS Glue or Kinesis.&lt;/li&gt;
&lt;li&gt;Storage Layer: Iceberg tables reside in Amazon S3, with metadata management to handle partitions, schema evolution, and versioning.&lt;/li&gt;
&lt;li&gt;Processing Layer: Glue ETL jobs process and transform data, supporting both batch and streaming.&lt;/li&gt;
&lt;li&gt;Query Layer: Athena enables SQL-based querying of Iceberg tables for flexible analytics.&lt;/li&gt;
&lt;li&gt;Governance Layer: AWS Lake Formation secures and governs access to sensitive data within the lakehouse.
This architecture demonstrates a scalable, cost-effective approach to building a transactional lakehouse that supports data consistency and flexibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lessons Learned:&lt;/p&gt;

&lt;p&gt;From working with Iceberg on AWS, I shared a few key lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partitioning Strategy: Efficient partitioning is essential for Iceberg to deliver high performance. Planning for your data distribution patterns is crucial.&lt;/li&gt;
&lt;li&gt;Schema Evolution: Although Iceberg handles schema changes well, backward compatibility is vital to avoid breaking data pipelines.&lt;/li&gt;
&lt;li&gt;Cost Management: Data lakehouses on S3 are cost-effective, but monitoring Glue jobs and optimising Athena queries help keep costs in check.&lt;/li&gt;
&lt;li&gt;Data Governance: Fine-grained access control with Lake Formation ensures data security, which is particularly important for multi-user environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best Practices for Building a Data Lakehouse with Iceberg:&lt;/p&gt;

&lt;p&gt;To wrap up my talk, I outlined some best practices for those considering building a lakehouse with Iceberg on AWS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data Modelling: Design Iceberg tables with a strong partitioning strategy to optimise performance and query efficiency.&lt;/li&gt;
&lt;li&gt;Governance: Leverage Lake Formation for access control to ensure secure data access.&lt;/li&gt;
&lt;li&gt;Time Travel for Compliance: Use Iceberg’s time travel feature to maintain historical records for regulatory compliance.&lt;/li&gt;
&lt;li&gt;Optimise Glue Jobs: Efficiently schedule Glue jobs to process incremental updates and avoid unnecessary compute costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Closing:&lt;/p&gt;

&lt;p&gt;Presenting at DataEngBytes 2024 Sydney was a fantastic opportunity to share insights into building a transactional data lakehouse on AWS with Apache Iceberg. This architecture offers a powerful approach to managing and analysing data with both the flexibility of a lake and the consistency of a warehouse, unlocking new possibilities for data-driven organisations.&lt;/p&gt;

&lt;p&gt;If you’re exploring a lakehouse approach in your own organisation, I’d highly recommend considering Apache Iceberg and AWS as the foundation. Combining Iceberg’s transactional capabilities with AWS’s scalability, you can build a data lakehouse that adapts to your evolving data needs while ensuring reliability and governance.&lt;/p&gt;

&lt;p&gt;I hope this recap provides a clear overview of the content and insights from my talk. If you have questions or want to learn more about building data lakehouses, feel free to reach out or stay tuned for more blog posts on advanced data architectures!&lt;/p&gt;

</description>
      <category>apacheiceberg</category>
      <category>aws</category>
      <category>etl</category>
      <category>datalakehouse</category>
    </item>
    <item>
      <title>Replacing E or G Drive On-Premises Shared Drives with AWS FSx for Windows</title>
      <dc:creator>Roopa Venkatesh</dc:creator>
      <pubDate>Wed, 03 Jan 2024 10:16:03 +0000</pubDate>
      <link>https://forem.com/roops/replacing-e-or-g-drive-on-premises-shared-drives-with-aws-fsx-for-windows-23pm</link>
      <guid>https://forem.com/roops/replacing-e-or-g-drive-on-premises-shared-drives-with-aws-fsx-for-windows-23pm</guid>
      <description>&lt;h2&gt;
  
  
  Migrate Your File Shares to the AWS Cloud with Ease
&lt;/h2&gt;

&lt;p&gt;On-premises shared drives, like E: and G:, have been the norm for file storage for years. But as businesses move to the cloud, these traditional file systems can become a bottleneck. They're often cumbersome to manage, scale, and secure.&lt;/p&gt;

&lt;p&gt;That's where AWS FSx for Windows comes in. It's a fully managed, highly available, and scalable file system service that runs on Windows Server. This is a business use case when you migrate a Windows server for your traditional Windows-based application from on-premise to AWS cloud, where you encounter the drives and shares to be migrated along with servers.&lt;/p&gt;

&lt;p&gt;Why Migrate to AWS FSx for Windows?&lt;/p&gt;

&lt;p&gt;Scalability and Flexibility:&lt;br&gt;
AWS FSx for Windows provides scalable file storage, allowing you to easily adjust your storage capacity based on your organization's growing needs.&lt;br&gt;
Flexible deployment options enable you to choose the appropriate file system size and performance characteristics for your applications.&lt;/p&gt;

&lt;p&gt;High Performance:&lt;br&gt;
FSx for Windows delivers high-performance file systems that can support a wide range of workloads, including Windows applications and enterprise applications like SQL Server and SAP.&lt;/p&gt;

&lt;p&gt;Integration with Active Directory:&lt;br&gt;
Seamlessly integrate with your existing Active Directory, ensuring a smooth transition for users and maintaining a unified access control system.&lt;/p&gt;

&lt;p&gt;Data Durability:&lt;br&gt;
Benefit from robust data durability and automatic daily backups, reducing the risk of data loss and ensuring business continuity.&lt;/p&gt;

&lt;p&gt;Reduced Management Overhead:&lt;br&gt;
AWS FSx for Windows takes care of routine maintenance tasks, reducing the burden on your IT team and allowing them to focus on more strategic initiatives.&lt;/p&gt;

&lt;p&gt;In this blog post, we'll walk you through the steps of planning and migrating your on-premises shared drives to AWS FSx for Windows.&lt;/p&gt;

&lt;p&gt;Planning Your Migration&lt;/p&gt;

&lt;p&gt;Before you start migrating your files, it's important to do some planning. Here are a few things to consider:&lt;/p&gt;

&lt;p&gt;What data will you migrate? Not all data needs to be migrated to the cloud. Consider which files are actively used and which can be archived on-premises.&lt;br&gt;
How will you migrate the data? There are a few different ways to migrate your data to FSx, such as using AWS Transfer for SFTP, AWS Data Migration Service, or my favorite Robocopy.&lt;br&gt;
What will you do with your old on-premises shared drives? Once you've migrated your data to FSx, you can decide whether to keep your old shared drives online or decommission them.&lt;br&gt;
Steps to Migrate Your Shared Drives.&lt;/p&gt;

&lt;p&gt;Before initiating the migration process, conduct a thorough assessment of your current on-premise shared drives. Identify the data volume, access patterns, and any specific dependencies on existing configurations to define the FSx configuration.&lt;/p&gt;

&lt;p&gt;Once you've done your planning, you're ready to start migrating your data. Here are the steps involved:&lt;/p&gt;

&lt;p&gt;Create an FSx file system. Choose the right file system size and performance tier for your needs. Create it in AWS console or use the Cloudformation template to provision it.&lt;br&gt;
Connect your FSx file system to your on-premises network. You can use AWS Direct Connect or a VPN to connect your FSx file system to your on-premises network.&lt;br&gt;
Migrate your data. Choose the migration method that's right for you and start migrating your data to FSx.&lt;br&gt;
Test and verify your migration. Once your data is migrated, test it to make sure everything is working correctly.&lt;br&gt;
Decommission your old on-premises shared drives (optional). Once you're sure you're happy with your FSx file system, you can decommission your old on-premises shared drives.&lt;/p&gt;

&lt;p&gt;The file shares are usually used to share files with AD groups or individual users, you may need to consider migrating the sharings and permissions too along with data/files. Once the Fsx is provisioned in AWS and connected to the on-premise server, you can start Robocopy which copies data along with security and permissions of sharing. Mount the FSx as a new drive on your on-prem windows server and use this command to run on the command prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;robocopy e:\source f:\Share\dst /e /mir
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;/e copies empty directories&lt;br&gt;
/mir mirrors a directory tree&lt;br&gt;
For more parameters on robocopy, refer &lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/robocopy" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--iMpGPUDN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://learn.microsoft.com/en-us/media/open-graph-image.png" height="420" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/robocopy" rel="noopener noreferrer" class="c-link"&gt;
          robocopy | Microsoft Learn
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Reference article for the robocopy command, which copies file data from one location to another.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
        learn.microsoft.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;Note that AD groups for file permissions need to be accessible on FSx through the domain controllers, it depends on your AWS setup for domain join.&lt;/p&gt;

&lt;p&gt;Ensure that all client applications and users are pointed to the new FSx for Windows file system. Update network mappings, drive letters, and any references to the old on-premise shared drives&lt;/p&gt;

&lt;p&gt;Here are some additional tips for a successful migration:&lt;/p&gt;

&lt;p&gt;Start with a small subset of data and test it out before migrating everything. Test out the parameters to come up with the final command and use it to migrate at once, you can run robocopy command before migration to copy any new delta files/folders.&lt;br&gt;
Use a migration tool like Robocopy. This tool can automate the process and make it easier to keep track of your progress.&lt;br&gt;
Communicate with your users. Let your users know what you're doing and why you're doing it. This will help to minimize disruption during the migration process.&lt;br&gt;
Migrating your on-premises shared drives to AWS FSx for Windows can be a great way to improve the manageability, scalability, and security of your file storage. By following the tips in this blog post, you can make the migration process smooth and successful.&lt;/p&gt;

&lt;p&gt;Additional Resources&lt;/p&gt;

&lt;p&gt;AWS FSx for Windows: &lt;a href="https://aws.amazon.com/fsx/windows/"&gt;https://aws.amazon.com/fsx/windows/&lt;/a&gt;&lt;br&gt;
AWS Transfer for SFTP: &lt;a href="https://aws.amazon.com/aws-transfer-family/"&gt;https://aws.amazon.com/aws-transfer-family/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope this blog post has been helpful. If you have any questions, please feel free to leave a comment below.&lt;/p&gt;

</description>
      <category>awsfsx</category>
      <category>migrateshareddrivetoaws</category>
      <category>windowsworkloadmigrationtoaws</category>
      <category>onpremtoaws</category>
    </item>
    <item>
      <title>Decoding Gen AI Platforms on AWS: Navigating the Landscape with AWS SageMaker, Bedrock, and AI on EKS</title>
      <dc:creator>Roopa Venkatesh</dc:creator>
      <pubDate>Sun, 24 Dec 2023 04:46:51 +0000</pubDate>
      <link>https://forem.com/roops/decoding-gen-ai-platforms-on-aws-navigating-the-landscape-with-aws-sagemaker-bedrock-and-ai-on-eks-4nd6</link>
      <guid>https://forem.com/roops/decoding-gen-ai-platforms-on-aws-navigating-the-landscape-with-aws-sagemaker-bedrock-and-ai-on-eks-4nd6</guid>
      <description>&lt;p&gt;Having recently immersed myself in the cutting-edge realm of Gen AI at this year's AWS re:Invent at Las Vegas, I eagerly dived into workshops and sessions dedicated to three standout AI platforms: AWS SageMaker, AWS Bedrock, and AI on EKS. The wealth of insights gained from these experiences prompted me to document and compare the diverse AI options available on AWS. Join me on this journey as we dissect the strengths, nuances, and unique features of each platform, offering a comprehensive guide for navigating the dynamic landscape of Gen AI solutions within the AWS ecosystem.&lt;/p&gt;

&lt;p&gt;Generative Artificial Intelligence (Gen AI) has become a cornerstone of modern technology, enabling businesses to harness the power of Machine Learning and Large Language Models (LLM) for various applications. As the demand for AI solutions continues to grow, several platforms have emerged to facilitate the development and deployment of AI models. In this blog post, we will compare three prominent AI platforms available on AWS Cloud: AWS SageMaker, Bedrock, and AI on EKS (Elastic Kubernetes Service).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SageMaker: The Veteran Warrior&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e9a3svkA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uvokobw4wfiv2qryzmi4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e9a3svkA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uvokobw4wfiv2qryzmi4.png" alt="Sagemaker logo" width="338" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amazon SageMaker is the granddaddy of AWS AI, a fully managed platform designed for end-to-end machine learning (ML) workflows. From data preparation and model training to deployment and monitoring, SageMaker streamlines the process, making it accessible to both ML experts and novices with more than 150 open-source models. Its pre-built algorithms, notebooks, and tooling take the grunt work out of ML, while robust scalability handles demanding workloads.&lt;/p&gt;

&lt;p&gt;But is SageMaker the one-size-fits-all hero? Not quite. While powerful, it sacrifices flexibility. One needs to know how to use the service and is suitable for ML engineers for flexible model development.&lt;/p&gt;

&lt;p&gt;Key features of SageMaker include:&lt;/p&gt;

&lt;p&gt;End-to-End Workflow: SageMaker provides a seamless workflow from data labeling and preparation to model training and deployment.&lt;/p&gt;

&lt;p&gt;Built-in Algorithms: A variety of pre-built algorithms are available, reducing the need for custom development and accelerating model training.&lt;/p&gt;

&lt;p&gt;Scalability: With SageMaker, you can build, train, and deploy ML models at scale using tools like notebooks, debuggers, profilers, pipelines, MLOps, and more – all in one integrated development environment (IDE).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bedrock: The New Kid on the Block&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AdJGc_wc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n3geiaklug17jdp3mhjq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AdJGc_wc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n3geiaklug17jdp3mhjq.png" alt="Bedrock Logo" width="478" height="272"&gt;&lt;/a&gt;&lt;br&gt;
Bedrock takes a different route, focusing on the cutting-edge world of generative AI. This fully managed service provides access to pre-trained "foundation models" (FMs) like language and image generators. No need to build models from scratch - simply fine-tune these FMs to your specific needs. Bedrock promises speed and ease of use.&lt;/p&gt;

&lt;p&gt;It offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon.&lt;/p&gt;

&lt;p&gt;So, bedrock is the magic solution? Not entirely. Bedrock is young, and its FM options are currently limited when compared to Sagemaker.&lt;/p&gt;

&lt;p&gt;Notable features of Bedrock include:&lt;/p&gt;

&lt;p&gt;Ease of use: Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources&lt;/p&gt;

&lt;p&gt;AutoML Capabilities: Bedrock includes AutoML functionality, allowing users to automate the model selection and hyperparameter tuning processes.&lt;/p&gt;

&lt;p&gt;Serverless: Amazon Bedrock is serverless, you don't have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI on EKS (Elastic Kubernetes Service): The Custom Craftsman&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qQSdXIr0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qcz4w26ah16mg3anc5rn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qQSdXIr0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qcz4w26ah16mg3anc5rn.png" alt="AI on EKS" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For those who need ultimate control and flexibility, AI on EKS comes knocking. This approach leverages Amazon Elastic Kubernetes Service (EKS) to deploy and manage containerized AI workloads. You get granular control over infrastructure, tools, and models, allowing for bespoke solutions.&lt;/p&gt;

&lt;p&gt;Running AI workloads on Kubernetes has gained popularity due to its flexibility and scalability. AI on EKS brings Kubernetes to the forefront of AI development.&lt;/p&gt;

&lt;p&gt;But is EKS the DIY path to AI nirvana? Be warned, this path requires significant expertise in Kubernetes and ML. It's a complex beast, demanding ongoing maintenance and potentially higher costs.&lt;/p&gt;

&lt;p&gt;There is a vast ecosystem of tools available to build and run models, even within the Kubernetes landscape. One emerging stack on Kubernetes is Jupyterhub, Argo Workflows, Ray, and Kubernetes. AWS and the community call it the JARK stack and you can run this entire stack on Amazon EKS. You can also integrate with other workloads on EKS.&lt;br&gt;
Refer for more details - &lt;a href="https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/"&gt;https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing the Right Tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best choice for you depends on your needs and priorities.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;SageMaker&lt;/u&gt;: Ideal for established ML workflows, businesses prioritizing ease of use and scalability, and those comfortable with AWS tooling. Cost is based on Pay-as-you-go and depends on compute instances used for training and inference.&lt;br&gt;
&lt;u&gt;Bedrock&lt;/u&gt;: Perfect for experimenting with generative AI, those seeking quick solutions with pre-trained models, and businesses already using SageMaker. Cost is based on model inference and customization, with a choice of two pricing plans for inference: On-Demand and Provisioned Throughput, this is less expensive when compared to Sagemaker.&lt;br&gt;
&lt;u&gt;AI on EKS&lt;/u&gt;: Best for organizations with deep technical expertise, complex AI needs requiring specific customization, and those comfortable managing Kubernetes infrastructure and using EKS as their strategic platform for all of their Data and AI needs. Cost is based on the size and instance type of the cluster Infrastructure (Kubernetes-based).&lt;/p&gt;

&lt;p&gt;Remember, there's no silver bullet. Each approach has its advantages and drawbacks. Carefully assess your requirements, evaluate your resources, and choose the tool that empowers you to conquer the AI frontier.&lt;/p&gt;

&lt;p&gt;This post is just the beginning of your AI adventure. Stay tuned for further explorations into specific use cases and deep dives into these powerful tools!&lt;/p&gt;

&lt;p&gt;Conclusion:&lt;/p&gt;

&lt;p&gt;Choosing the right AI platform depends on specific project requirements, team expertise, organizational goals, and the existing ecosystem. AWS SageMaker, Bedrock, and AI on EKS each offer unique advantages, catering to different use cases and preferences. As the AI landscape continues to evolve, staying informed about the strengths and limitations of these platforms is crucial for making informed decisions in the rapidly advancing field of artificial intelligence.&lt;/p&gt;

</description>
      <category>awssagemaker</category>
      <category>awsbedrock</category>
      <category>awseks</category>
      <category>aiplatform</category>
    </item>
  </channel>
</rss>
