Forem: Roopa Venkatesh

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Roopa Venkatesh — Fri, 06 Mar 2026 06:25:06 +0000

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Modern cloud systems are complex distributed architectures where a single user journey may depend on dozens of services running across multiple infrastructure layers.

When a Service Level Objective (SLO) breach occurs, identifying the root cause often requires navigating logs, metrics, traces, service dependencies, and infrastructure relationships.

In many organizations, this investigation is still manual and time-consuming.

In a recent project, I explored how AI agents can automate incident investigation by combining:

Observability data
Service topology
Kubernetes infrastructure context
Historical incident knowledge
Graph-based reasoning

This approach reduced investigation time from 20–30 minutes to under a minute for certain SLO breaches.

This article introduces the concept of Topology-Aware AI Agents and how such a system can be implemented using AWS services and graph-based system modeling.

The Problem: Traditional Incident Investigation

When an SLO breach occurs, SRE teams typically perform the following steps:

Identify the impacted user journey
Check monitoring dashboards
Inspect logs and traces
Identify impacted services
Traverse upstream and downstream dependencies
Correlate incidents with infrastructure problems

In large microservice environments, this investigation becomes difficult because:

Logs lack system-wide context
Metrics show symptoms but not relationships
Service dependencies are hard to traverse quickly
Infrastructure and application layers are often disconnected

Even with powerful observability tools, humans still perform most correlation tasks manually.

Why Logs Alone Are Not Enough for AI

Many AI troubleshooting systems rely on RAG (Retrieval Augmented Generation) using logs or documentation.

However, logs alone do not provide system relationships.

Example log entry:
Payment API latency spike

Without topology context, an AI system cannot determine:

Which upstream service triggered the issue
Which downstream dependency failed
Whether the issue originated from infrastructure or application layers

To solve this, we need structural knowledge about the system architecture.

Introducing Topology-Aware AI Agents

A Topology-Aware AI Agent combines three major sources of context:

Observability Data
+
Service Topology
+
Historical Incident Knowledge

The agent uses this combined knowledge to automatically:

Identify impacted services
Traverse dependency graphs
Correlate incidents
Suggest root causes

This transforms incident troubleshooting from log searching into graph-based reasoning.

Platform Context: Microservices Running on Amazon EKS

In this environment, the application platform was built using Kubernetes running on Amazon Elastic Kubernetes Service (EKS).

Each user request travels across multiple layers:

User Request
↓
API Gateway / Entry Service
↓
Microservices running on Kubernetes
↓
Databases / external dependencies

Each microservice runs inside containers deployed on Kubernetes pods.

To enable automated incident analysis, the system needed visibility into:

Cloud infrastructure
Kubernetes resources
Application services
Runtime service interactions
Observability signals

These relationships were modeled as a graph database.

Building the Service Relationship Graph

The system used Neo4j to build a knowledge graph representing the full platform topology.

The graph captured relationships across multiple layers:

Cloud infrastructure
Kubernetes platform
Application services
Service interactions
Historical incidents

This structure allowed the AI agent to reason about how failures propagate across the system.

Modeling the Infrastructure Layer

The first layer of the graph represented the cloud infrastructure.

Example nodes:

Cloud Provider
AWS Account
Region
Availability Zone
Host (EC2)

Example relationships:
AWS Account
│
DEPLOYS
▼
EKS Cluster
│
RUNS_ON
▼
EC2 Worker Node

This enables the system to correlate incidents with infrastructure-level problems such as:

node failures
CPU saturation
network issues

Modeling the Kubernetes Platform

The next layer represents Kubernetes resources running on the EKS cluster.

Example nodes:

EKS Cluster
Namespace
Pod
Container
Process Group

Example relationships:

EKS Cluster
│
CONTAINS
▼
Namespace
│
CONTAINS
▼
Pod
│
RUNS
▼
Container

Each container instance is mapped to a process group representing a running microservice instance.

This structure allows the graph to capture runtime relationships between services and infrastructure nodes.

Modeling Application Services

At the application level, the graph represents each microservice as a service node.

Example nodes:

Service
API
Database
External Dependency

Services are connected to the runtime processes executing them.

Example relationship:

Checkout Service
│
RUNS_AS
▼
Process Group
│
HOSTED_ON
▼
Kubernetes Pod

This mapping enables the system to trace incidents from application failures down to infrastructure components.

Modeling Caller–Callee Relationships

One of the most critical aspects of the topology graph is capturing service interaction flows.

Microservices communicate through APIs, forming caller–callee relationships.

Example:

Checkout Service
│
CALLS
▼
Payment Service
│
CALLS
▼
Payment Database

These relationships represent the actual runtime service communication paths.

By modeling these relationships, the AI agent can identify:

downstream dependencies
cascading failures
shared services impacting multiple user journeys

Linking Observability Data to the Graph

Observability signals such as logs and errors are attached to graph nodes.

Example:

Payment Service
│
HAS_ERROR
▼
Timeout Exception

Infrastructure events can also be attached:

EC2 Worker Node
│
HAS_EVENT
▼
CPU Spike

This allows the agent to correlate:

infrastructure issues
application errors
service dependencies

within a single reasoning model.

Learning from Historical Incidents

Each investigated incident is also stored in the graph.

Example structure:

Incident
├ impacted service
├ root cause
├ infrastructure correlation
└ resolution

Over time, this builds a knowledge graph of operational incidents.

The AI agent can then detect patterns such as:

recurring failures
common dependency issues
infrastructure patterns impacting multiple services

Architecture Overview

A simplified architecture for this approach looks like this:

SLO Breach Alert
│
▼
Event Trigger (Monitoring / EventBridge)
│
▼
Incident AI Agent
│
├── Service Topology Graph (Neo4j)
├── Observability Data (Logs / Traces)
└── Historical Incident Knowledge
│
▼
LLM Reasoning
│
▼
Root Cause Hypothesis

AWS services that can support this architecture include:

Amazon EKS
AWS Lambda
Amazon EventBridge
Amazon Bedrock
Amazon OpenSearch
Amazon Neptune (as a managed graph alternative)

Agent Workflow

When a new SLO breach occurs, the AI agent performs the following steps.

Step 1 — Detect SLO Breach

Monitoring tools trigger an alert event.

Step 2 — Identify Impacted Services

The agent queries the service topology graph.

Step 3 — Traverse Dependencies

The graph traversal identifies:

upstream services
downstream dependencies
infrastructure nodes

Step 4 — Retrieve Observability Signals

Logs and errors are retrieved from observability platforms.

Step 5 — LLM Reasoning

Structured context is sent to the LLM.

Example prompt:

SLO breach detected in Checkout Service

Impacted services:
Checkout Service
Payment Service
Payment Database

Recent errors:
Timeout errors in Payment Service

Historical incident:
Database connection pool exhaustion

The LLM then generates a root cause hypothesis.

Results from the Prototype

In the prototype implementation:

Manual investigation time:

20–30 minutes

AI-assisted investigation:

Under 1 minute

For a specific platinum user journey SLO, the agent achieved:

~52% correlation accuracy between SLO breaches and underlying service problems.

While not perfect, it significantly accelerates incident triage.

Why Graph-Based Observability Matters

Traditional observability focuses on:

metrics
logs
traces

However modern systems also require relationship awareness.

Graph-based models enable:

dependency reasoning
cross-service correlation
historical incident learning

Combining graph knowledge with LLM reasoning enables a new class of systems:

AI-assisted incident response agents.

Future Directions

This concept can evolve further with:

autonomous remediation agents
continuous incident learning
multi-agent observability systems
integration with CI/CD pipelines

As distributed architectures continue to grow in complexity, topology-aware AI agents may become an essential part of SRE operations.

Final Thoughts

AI-powered incident investigation is still in its early stages.

However combining:

observability data
service topology graphs
Kubernetes infrastructure knowledge
historical incident intelligence
LLM reasoning

creates a powerful approach to automated root cause analysis.

Topology-aware AI agents represent a promising direction for improving SRE productivity and incident response time in modern cloud-native systems.

If you're exploring AI for SRE, observability, or incident automation, I would love to hear your thoughts or experiences.

From Chatbot to Cloud CFO: Building an Autonomous FinOps Agent with Amazon Bedrock

Roopa Venkatesh — Wed, 14 Jan 2026 02:30:46 +0000

TL;DR

Learn how to build an AI agent that autonomously optimises your AWS costs by analysing CloudWatch metrics, identifying underutilised resources, and making intelligent decisions—all while maintaining production safety through AWS X-Ray observability and human-in-the-loop approval workflows.

What you'll learn:

How to build a Bedrock Agent with custom action groups
Implementing X-Ray tracing to audit AI decision-making
Production safety patterns for autonomous infrastructure agents
Human-in-the-loop approval workflows for high-risk actions

Prerequisites:

AWS account with Bedrock access (ap-southeast-2 recommended for Australia)
Python 3.9+ and AWS CLI configured
Basic understanding of Lambda, EC2, and CloudWatch
Estimated cost: ~$20/month for development use

The Problem

We have all been there. You spin up a p3.2xlarge instance for a quick Friday afternoon experiment. You go to happy hour, the weekend hits, and you forget about it. Two weeks later, the AWS bill arrives, and panic sets in.

For years, we solved this with "dumb" scripts—Cron jobs that shut down everything tagged dev at 7 PM. But scripts lack context. They kill long-running training jobs just as often as they save money.

We don't need a script. We need an Agent.

In this post, I’ll walk through how to build an Autonomous FinOps Agent using Amazon Bedrock and Python. More importantly, I will show you how to use AWS X-Ray to "audit the brain" of the agent, ensuring it never deletes production resources by mistake.

The Difference: Automation vs. Agentic AI

Why use an Agent instead of a Lambda function triggered by a CloudWatch Alarm?

Automation (The Script): "If CPU < 5% for 1 hour, terminate instance."
Risk: It kills a critical database waiting for connections.
Agentic AI (The CFO): "I see this instance has low CPU. Let me check the tags. It belongs to the 'Data Science' team. Let me check the git logs on the attached volume. It seems inactive. I will slack the owner, and if they don't reply in 24 hours, I will snapshot and terminate it."

The Agent adds reasoning to the automation.

The Architecture

We will use the Amazon Bedrock Agents framework, which simplifies the orchestration of tools and provides built-in reasoning capabilities.

Component Overview:

The Brain: Amazon Bedrock (Model: Claude 3.5 Sonnet) - Handles reasoning and decision-making
The Hands: Python Lambda function (Action Group) - Executes AWS API calls via Boto3
The Eyes: AWS X-Ray + CloudWatch Logs - Traces every decision and API call
The Safety Net: SNS notifications for human approval on destructive actions

┌──────────────────────────────────────────────────────────┐
│                     User Query                           │
│        "Find under-utilised resources in ap-southeast-2" │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Amazon Bedrock Agent │
         │  (Claude 3.5 Sonnet)  │──── X-Ray Tracing
         └───────────┬───────────┘
                     │
                     │ Invokes Action Group
                     ▼
         ┌───────────────────────┐
         │   Lambda Function     │
         │  (Action Router)      │──── CloudWatch Logs
         └───────────┬───────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
         ▼                       ▼
    ┌─────────┐           ┌─────────┐
    │ EC2 API │           │ CW API  │
    └─────────┘           └─────────┘

Step 1: Building the Action Group Lambda

In Bedrock, an "Action Group" is an OpenAPI schema that maps natural language intents to Lambda functions. The Lambda acts as a central router that executes different tools based on what the agent decides.

Critical: We use aws_xray_sdk to patch all Boto3 calls. In Agentic AI, observability isn't optional—it's the only way to debug hallucinations and verify the agent's reasoning.

The Lambda Handler (Action Router)

import json
import boto3
from aws_xray_sdk.core import xray_recorder, patch_all

# Automatically trace all AWS API calls
patch_all()

ec2_client = boto3.client('ec2')

def lambda_handler(event, context):
    """
    Central router for Bedrock Agent action groups.
    Receives function name + parameters, executes the tool, returns structured response.
    """
    function_name = event['function']
    parameters = {p['name']: p['value'] for p in event.get('parameters', [])}

    # Start X-Ray subsegment to track this specific tool execution
    subsegment = xray_recorder.begin_subsegment(f"ToolExecution:{function_name}")
    subsegment.put_annotation("Agent", event['agent']['name'])

    try:
        if function_name == 'analyse_underutilised_resources':
            response_body = check_cpu_metrics(parameters.get('region', 'us-east-1'))

        elif function_name == 'stop_resource':
            # CRITICAL SAFETY CHECK: Never stop production instances
            resource_id = parameters['resource_id']
            if is_production(resource_id):
                response_body = {
                    "status": "DENIED",
                    "reason": "Production resources require manual approval"
                }
                logger.warning(f" Agent attempted to stop PROD: {resource_id}")
            else:
                response_body = stop_ec2_instance(resource_id)

    except Exception as e:
        logger.error(f"Tool execution failed: {e}")
        subsegment.put_metadata("Error", str(e))
        response_body = {"error": "Internal tool error"}
    finally:
        xray_recorder.end_subsegment()

    # Return in Bedrock's expected format
    return {
        'messageVersion': '1.0',
        'response': {
            'actionGroup': event['actionGroup'],
            'function': function_name,
            'functionResponse': {
                'responseBody': {'TEXT': {'body': json.dumps(response_body)}}
            }
        }
    }

Key Design Patterns:

Subsegments for granular tracing - Each tool execution gets its own X-Ray subsegment
Production safety guard - Tag-based checks prevent accidental destruction
Structured responses - JSON format allows the agent to reason about results

Step 2: The OpenAPI Schema (Connecting Agent to Tools)

Before the agent can call your Lambda, you need to define an OpenAPI schema that describes the available tools. This is what Bedrock uses to understand when and how to invoke your functions.

openapi: 3.0.0
info:
  title: FinOps Agent Tools
  version: 1.0.0
paths:
  /analyse:
    post:
      summary: Analyse underutilised EC2 resources
      description: Scans a region for instances with low CPU utilisation over the past 7 days
      operationId: analyse_underutilised_resources
      parameters:
        - name: region
          in: query
          required: false
          schema:
            type: string
            default: ap-southeast-2
      responses:
        '200':
          description: Analysis results with list of underutilised instances

  /stop:
    post:
      summary: Stop an EC2 instance
      description: Stops a non-production EC2 instance to save costs
      operationId: stop_resource
      parameters:
        - name: resource_id
          in: query
          required: true
          schema:
            type: string
          description: The EC2 instance ID to stop
      responses:
        '200':
          description: Instance stop status

How it works:

User asks: "Find idle instances in ap-southeast-2"
Bedrock maps this to analyse_underutilised_resources with region=ap-southeast-2
Lambda receives the function name + parameters
Executes check_cpu_metrics('ap-southeast-2')
Returns structured JSON back to Bedrock
Agent reasons about the results and responds to the user

Step 3: Implementing the Cost Analysis Tool

Now let's implement the check_cpu_metrics() function that does the actual CloudWatch analysis:

from datetime import datetime, timedelta

def check_cpu_metrics(region):
    """
    Analyses EC2 instances for low CPU utilisation.
    Returns actionable insights for cost optimisation.
    """
    cw_client = boto3.client('cloudwatch', region_name=region)
    ec2_client = boto3.client('ec2', region_name=region)

    underutilised = []

    # Get all running instances
    instances = ec2_client.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    # Analyse last 7 days of metrics
    end_time = datetime.aestnow()
    start_time = end_time - timedelta(days=7)

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Query CloudWatch for CPU metrics
            metrics = cw_client.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilisation',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=start_time,
                EndTime=end_time,
                Period=86400,  # Daily aggregation
                Statistics=['Average']
            )

            if metrics['Datapoints']:
                avg_cpu = sum(dp['Average'] for dp in metrics['Datapoints']) / len(metrics['Datapoints'])

                # Flag instances below 10% CPU
                if avg_cpu < 10.0:
                    tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
                    underutilised.append({
                        'instance_id': instance_id,
                        'instance_type': instance['InstanceType'],
                        'avg_cpu_percent': round(avg_cpu, 2),
                        'environment': tags.get('Environment', 'Unknown'),
                        'recommendation': 'Consider downsizing or stopping'
                    })

    return {
        'status': 'success',
        'underutilised_instances': underutilised,
        'potential_monthly_savings': f"${len(underutilised) * 50}"
    }

Why this approach works:

7-day analysis window catches weekend/holiday idle time
Daily aggregation reduces API costs while maintaining accuracy
Tag extraction gives the agent context about ownership and environment
Structured output allows the agent to present findings naturally

** Pro Tip:** For production, add network I/O metrics and disk usage to avoid flagging batch processing instances that are I/O bound but CPU-light.

Step 4: X-Ray Observability - Auditing the Agent's Brain

When you deploy a chatbot, users give feedback via thumbs up/down. When you deploy an infrastructure agent, the "feedback" might be your production database going offline.

We need deep observability to audit every decision.

What X-Ray Gives You

By using patch_all() and custom subsegments, we generate a full trace of the agent's decision-making process:

User Query
    ↓
Bedrock Agent (Reasoning: "Low CPU detected, checking tags...")
    ↓
Lambda Action Group (Executing: analyse_underutilised_resources)
    ↓
CloudWatch API (Fetching metrics)
    ↓
EC2 API (Reading instance tags)
    ↓
Response back to agent

Why This Matters

If the agent fails to stop an instance, CloudWatch Logs alone won't tell you why. You need to know:

Did the tool fail? (e.g., Boto3 AccessDenied error)
Did the agent fail to reason correctly? (e.g., The agent concluded "CPU at 10% is actually high load for this workload" and didn't call the stop function)

X-Ray lets you overlay the Reasoning Trace (Bedrock) with the Execution Trace (Lambda/AWS APIs).

Example X-Ray Insight:

{
  "subsegment": "ToolExecution:stop_resource",
  "annotation": {
    "Agent": "FinOpsAgent-v1",
    "resource_id": "i-0123456789",
    "decision": "DENIED - Production tag detected"
  },
  "duration": 245ms
}

This tells you the agent correctly refused to stop a production instance—critical for compliance audits.

Step 5: The "Human-in-the-Loop" Safety Net

The biggest fear with Agentic AI is the "Runaway Robot" scenario—what if the agent misinterprets data and terminates a critical database?

The solution: Don't give the agent destructive permissions. Instead, use an approval workflow.

Implementation: Request Approval Tool

Add a third function to your Lambda that sends termination requests to humans:

import boto3

sns_client = boto3.client('sns')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-southeast-2:123456789:FinOpsApprovals'

def request_termination_approval(instance_id, justification):
    """
    Requests human approval before terminating an instance.
    Publishes to SNS topic monitored by DevOps team.
    """
    # Get instance details for context
    ec2 = boto3.client('ec2')
    instance = ec2.describe_instances(InstanceIds=[instance_id])['Reservations'][0]['Instances'][0]

    tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
    instance_type = instance['InstanceType']

    # Calculate estimated savings
    # Rough estimates: t3.medium=$30/mo, t3.large=$60/mo, etc.
    hourly_cost = get_instance_cost(instance_type)
    monthly_savings = hourly_cost * 730  # hours/month

    message = f"""
 FinOps Agent Termination Request

Instance: {instance_id}
Type: {instance_type}
Name: {tags.get('Name', 'N/A')}
Environment: {tags.get('Environment', 'Unknown')}

AI Analysis:
{justification}

Estimated Monthly Savings: ${monthly_savings:.2f}

Actions:
✅ Approve: Reply with "APPROVE {instance_id}"
❌ Deny: Reply with "DENY {instance_id}"
⏸️ Snooze 7 days: Reply with "SNOOZE {instance_id}"

View X-Ray Trace: https://console.aws.amazon.com/xray/...
    """

    sns_client.publish(
        TopicArn=SNS_TOPIC_ARN,
        Subject=f' FinOps Agent: Approval needed for {instance_id}',
        Message=message
    )

    return {
        'status': 'approval_requested',
        'instance_id': instance_id,
        'message': 'Notification sent to DevOps team'
    }

Update the OpenAPI Schema

Add this to your schema:

  /request-termination:
    post:
      summary: Request approval to terminate an instance
      operationId: request_termination_approval
      parameters:
        - name: instance_id
          in: query
          required: true
          schema:
            type: string
        - name: justification
          in: query
          required: true
          schema:
            type: string
          description: AI-generated explanation for why termination is recommended

The Agent's New Workflow

Now when the agent finds a zombie instance, it:

Analyses the metrics (CPU, network, disk)
Reasons about the context (tags, uptime, cost)
Requests approval instead of acting immediately
Includes the X-Ray trace link so humans can audit the decision
Waits for human confirmation before taking action

This gives you the speed of AI analysis with the safety of human judgment.

Deployment & Testing

Step 1: Create the Lambda Function

# Package dependencies
pip install aws-xray-sdk boto3 -t ./package
cd package && zip -r ../lambda.zip . && cd ..
zip -g lambda.zip lambda_function.py

# Deploy to AWS
aws lambda create-function \
  --function-name FinOpsAgentTools \
  --runtime python3.11 \
  --role arn:aws:iam::YOUR_ACCOUNT:role/FinOpsLambdaRole \
  --handler lambda_function.lambda_handler \
  --zip-file fileb://lambda.zip \
  --timeout 60 \
  --memory-size 256 \
  --tracing-config Mode=Active

Step 2: Create IAM Policy for Lambda

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EC2ReadOnly",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeTags"
      ],
      "Resource": "*"
    },
    {
      "Sid": "EC2StopNonProd",
      "Effect": "Allow",
      "Action": "ec2:StopInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringNotEquals": {
          "ec2:ResourceTag/Environment": "Production"
        }
      }
    },
    {
      "Sid": "CloudWatchMetrics",
      "Effect": "Allow",
      "Action": "cloudwatch:GetMetricStatistics",
      "Resource": "*"
    },
    {
      "Sid": "SNSPublish",
      "Effect": "Allow",
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:*:*:FinOpsApprovals"
    },
    {
      "Sid": "XRayTracing",
      "Effect": "Allow",
      "Action": [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords"
      ],
      "Resource": "*"
    }
  ]
}

Key Security Features:

EC2 stop is tag-restricted - Cannot stop instances with Environment=Production
Read-only for metrics - No write access to CloudWatch
SNS scope limited - Can only publish to approval topic

Step 3: Create the Bedrock Agent

# Create the agent
aws bedrock-agent create-agent \
  --agent-name FinOpsAgent \
  --foundation-model anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --instruction "You are a FinOps specialist. Analyse AWS costs and recommend optimisations. Always request approval before suggesting destructive actions." \
  --idle-session-ttl 600

# Create action group with OpenAPI schema
aws bedrock-agent create-agent-action-group \
  --agent-id YOUR_AGENT_ID \
  --action-group-name FinOpsTools \
  --action-group-executor lambda=arn:aws:lambda:ap-southeast-2:ACCOUNT:function:FinOpsAgentTools \
  --api-schema file://openapi-schema.yaml

# Prepare and create alias
aws bedrock-agent prepare-agent --agent-id YOUR_AGENT_ID
aws bedrock-agent create-agent-alias \
  --agent-id YOUR_AGENT_ID \
  --agent-alias-name prod

Step 4: Test the Agent

Test Query 1: Analyse Resources

User: "Find underutilised instances in ap-southeast-2 that are costing us money"

Agent Response:
"I found 3 underutilised instances in ap-southeast-2:

1. i-0abc123 (t3.large) - Avg CPU: 3.2% - Est. savings: $60/mo
2. i-0def456 (m5.xlarge) - Avg CPU: 7.8% - Est. savings: $140/mo
3. i-0ghi789 (t3.medium) - Avg CPU: 5.1% - Est. savings: $30/mo

Total potential monthly savings: $230

Would you like me to request approval to stop these instances?"

Test Query 2: Safety Check

User: "Stop instance i-prod-db-001"

Agent Response:
"I cannot stop i-prod-db-001 because it has the Environment=Production tag.
Production resources require manual approval through the change management process."

Step 5: Monitor with X-Ray

Navigate to AWS X-Ray console and filter by:

annotation.Agent = "FinOpsAgent-v1"

You'll see traces showing:

Which tools the agent invoked
How long each API call took
Whether any errors occurred
The full decision path from user query → response

Conclusion

Building Agentic AI on AWS is about more than just Prompt Engineering—it's about Reliability Engineering.

By treating the agent's "thoughts" as loggable events (X-Ray traces) and wrapping its "hands" (tools) in strict safety checks (tag validation, human approvals), we can build a FinOps assistant that is not only smart but trustworthy.

Key Takeaways

Agents > Scripts - Reasoning capabilities allow context-aware decisions
Observability is mandatory - X-Ray tracing is the only way to audit AI decisions
Safety through architecture - Use IAM policies and human-in-the-loop for destructive actions
Structure matters - Well-designed tool outputs enable better agent reasoning

The era of static Cron jobs is ending. The era of the Cloud CFO Agent has begun.

Cost Estimate

Running this setup for development/testing:

Service	Monthly Cost
Bedrock Agent (Claude 3.5 Sonnet)	~$5 (100 queries)
Lambda invocations	~$0.20 (1000 invocations)
CloudWatch API calls	~$0.10 (detailed monitoring)
X-Ray tracing	~$2 (10,000 traces)
SNS notifications	~$0.50 (100 emails)
Total	~$8/month

For production use with 10,000 queries/month: ~$50-75/month

Potential savings identified: Hundreds to thousands per month depending on environment size.

Next Steps & Resources

Enhancements to Consider

Multi-region support - Extend analysis to all AWS regions
Additional metrics - Network I/O, disk usage, memory utilisation
Historical trending - Track savings over time in DynamoDB
Slack integration - Send reports to team channels instead of email
Auto-remediation - After 30 days of approvals, enable auto-stop for specific tags

Migrating Traditional SFTP Servers to AWS SFTP Transfer Family: A Secure and Serverless Approach

Roopa Venkatesh — Mon, 16 Dec 2024 07:12:56 +0000

Introduction

Managing traditional SFTP servers on-premises often comes with its share of challenges. Organizations struggle with maintaining the infrastructure, ensuring high availability, scaling storage, and securing user access. These systems require regular patching, upgrades, and constant monitoring
to prevent downtime or security breaches. For businesses handling increasing file transfer demands, these limitations can result in operational inefficiencies and spiraling costs.

Thankfully, AWS SFTP Transfer Family offers a modern solution to these issues. With its serverless and fully managed setup, you can eliminate the overhead of managing hardware while leveraging the scalability and cost-effectiveness of the AWS Cloud. This blog post will guide you through migrating your traditional SFTP server to AWS SFTP Transfer Family, focusing on a secure, scalable, and highly available architecture.

Why Migrate to AWS SFTP Transfer Family?

AWS SFTP Transfer Family provides a robust, serverless alternative to traditional SFTP servers. Here are some key benefits:

Ease of Management: AWS handles the underlying infrastructure, reducing operational burden.
High Availability: Native multi-AZ support ensures uninterrupted service.
Scalability: Seamless integration with Amazon S3 allows for virtually unlimited storage capacity.
Security: Built-in key-based authentication and IAM for granular control over user access.
Cost-Effectiveness: Pay only for the resources you use, with no upfront investment in hardware.
Customizable architecture design: AWS Cloud enables you to design your SFTP architecture based on specific customer requirements, utilizing network components such as VPCs, security groups, and NLBs to achieve the most secure and stringent configurations as needed.
Seamless integration for microservices: Files available on S3 storage supported SFTP Transfer family allows for seamless integrations for microservice development on AWS.

Proposed Architecture

For this migration, I propose a secure and serverless architecture with the following components based on one of our Financial Services customer's on-premises to AWS Migration scenario:

Internal Setup with a Firewall: Use a firewall (e.g., Fortinet) in front of your SFTP server to secure and inspect incoming traffic.
Route traffic through a Network Load Balancer (NLB) configured for high availability.
SFTP Server in a VPC: Deploy the SFTP server in a Virtual Private Cloud (VPC) for network isolation.
High Availability with 2 AZs: Configure the SFTP server to operate across two Availability Zones (AZs) for resilience.
Service-Managed Users: Use service-managed user accounts with key-based authentication for secure access.
S3 Backend Storage: Store files in Amazon S3 for scalability and durability, and leverage lifecycle policies to optimize storage costs.
Fine-Grained Access Control with IAM: Use IAM roles to enforce folder-level permissions for secure and organized access.

Implementation Steps

Following is the step by step implementation details.

Set Up AWS SFTP Transfer Family

a. Navigate to the AWS Transfer Family service in the AWS Management Console.
b. Create a new SFTP server and configure it as VPC Hosted and internal access with two (or three) Availability Zones.
c. Attach a security group that permits traffic only from the firewall.
d. Configure a Server host key by adding an already generated private SSH key, which will be presented when users access the sftp server.

Configure Network Load Balancer (NLB)

a. Deploy an NLB in front of the SFTP server.
b. Configure the NLB to route traffic to the SFTP endpoint on port 22.
c. Set up health checks for continuous monitoring.

Integrate the Firewall

a. Use a Fortinet (or similar) firewall to control, inspect and monitor incoming requests.
b. Allow only specific IP ranges or VPN traffic through the firewall to the NLB.
c. Whitelist sftp partner/customer IP addresses to restrict access to only required inbound connections

Set Up Service-Managed Users

a. Request SFTP partner/customer to provide the SSH Public key for secure connection access as a prerequisite step, or reuse the SSH public key from on-premises or current SFTP setup.
b. Define user accounts in AWS Transfer Family and assign each user a unique SSH public key.
c. Map each user to an S3 bucket or folder for isolated file access.
d. Setup Home directory for each user with the bucket or individual username folder or any directory structure you want to use, check 'Restricted' checkbox for the sftp user to not access anything outside this folder. The user will not be able to see the s3 bucket or folder name when Restricted is checked.

AWS SFTP Server summary page for reference:

Connect S3 as Backend Storage

a. Attach Amazon S3 as the backend storage for the SFTP server.
b. Configure lifecycle policies to transition data to lower-cost storage classes (e.g., S3 Glacier).
c. Configure S3 copy or backup to copy files to another s3 bucket if needed so you can directly process the files from the SFTP bucket from your applications.

Configure IAM Roles for Fine-Grained Access Control

a. Define IAM roles to control access to specific S3 folders with read/write/delete permissions.
b. You can use different IAM roles for different users if you want to provide additional access like 's3:DeleteObject' for certain folders within the home directory.

Example policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::your-bucket",
      "Condition": {
        "StringLike": {
          "s3:prefix": ["user-folder/*"]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::your-bucket/user-folder/*"
    }
  ]
}

The trust policy for the IAM role should be:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "transfer.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}

Test and Validate

a. Enable CloudWatch Logs for logging and troubleshooting of SFTP connections.
b. Test user connections through the firewall and NLB. Ask your external sftp users to connect via any SFTP clients such as FileZilla or WinSCP. Note that the AWS SFTP server host doesn't allow SSH connection, you need sftp command to connect.
c. Validate user access permissions to S3 folders.
d. Simulate failover scenarios to confirm high availability.

Migration Plan

Use this simple migration plan for migrating from traditional/on-premises SFTP server to AWS SFTP Transfer Family

Inform SFTP Users About Changes

a. Notify all existing SFTP users about the migration to the new AWS SFTP setup.
b. Share details on timelines, new connection endpoints, and any required actions from their side.

Transition to Key-Based Authentication

a. Convert all users from password-based authentication to SSH key-based authentication, as AWS SFTP Transfer Family does not support password-based logins as it is not a secure way to access.
b. Assist users in generating and uploading their SSH public keys.

Onboard and Migrate Users

a. Create service-managed user accounts in AWS Transfer Family.
b. Migrate users' home directories and set up their specific access permissions in Amazon S3.

Set Up and Validate Access

a. Validate that all users can access their respective directories and files as expected.
b. Conduct testing to ensure smooth operations and troubleshoot any access issues.

Go Live

a. Update DNS or endpoint configurations to point to the new endpoint, in this case a public IP (setup DNS for this IP) on Fortinet Firewall.
b. Officially transition all users to the new setup and monitor for any post-migration issues.

Cost Optimization Tips

Enable S3 lifecycle policies to automatically move infrequently accessed data to lower-cost storage classes.
Monitor usage with AWS Cost Explorer and set up budgets for cost control.
Use Savings Plans for additional savings.

Conclusion

Migrating from traditional SFTP servers to AWS SFTP Transfer Family offers significant advantages in terms of scalability, security, and cost efficiency. By leveraging the architecture and steps outlined in this guide, you can seamlessly transition to a serverless, fully managed solution that simplifies operations and improves reliability. Embrace AWS SFTP Transfer Family and future-proof your file transfer needs today.

Presenting at DataEngBytes 2024 Sydney: Building a Transactional Data Lakehouse on AWS with Apache Iceberg

Roopa Venkatesh — Sat, 09 Nov 2024 01:52:54 +0000

I had the pleasure of presenting at DataEngBytes 2024 in Sydney, where I discussed an exciting topic that’s transforming the data management landscape: Building a Transactional Data Lakehouse on AWS with Apache Iceberg.
This blog post captures the key content and insights shared during the session for those who couldn’t attend and as a record of my talk.

Why a Data Lakehouse?
As organisations scale and diversify their data sources, they increasingly seek the flexibility of a data lake combined with the transactional reliability of a data warehouse. The data lakehouse architecture bridges this gap by delivering a unified platform that supports both analytical and transactional workloads, making it ideal for managing structured, semi-structured, and unstructured data at scale.

During my talk, I explained that a data lakehouse:

Ensures ACID compliance for data consistency and reliability.
Supports time travel to query historical data.
Provides real-time insights by processing batch and streaming data seamlessly.
Reduces storage costs by leveraging data lakes for large volumes of data.

Key Challenges in Traditional Data Lakes and Warehouses:

I highlighted the challenges organisations often face with traditional data lakes, such as the lack of transaction support, complex schema management, and inconsistent data views. At the same time, data warehouses, though highly consistent, can be expensive and struggle with scalability when handling semi-structured and unstructured data.

To solve these challenges, I introduced the concept of a data lakehouse built with Apache Iceberg on AWS, combining the benefits of both lakes and warehouses.

Why Apache Iceberg?

Apache Iceberg is an open table format that makes it possible to manage large-scale, transactional data in data lake environments. Here’s why it’s ideal for a lakehouse:

ACID Transactions: Iceberg supports ACID compliance, allowing for consistent data updates, deletes, and inserts.
Schema Evolution: It gracefully handles schema changes, a common requirement in dynamic data environments.
Partitioning and Performance: Automatic partitioning optimises query performance, making it efficient even for large datasets.
Time Travel: Iceberg’s time travel functionality enables querying historical data versions, making it invaluable for auditing, troubleshooting, and compliance. It's like Git for managing code with versioning and reverting to any commit Id. These features make Iceberg a strong foundation for building a transactional lakehouse that balances flexibility and consistency.

How Iceberg Integrates with AWS:

One of the session's focal points was explaining how Apache Iceberg works within the AWS ecosystem. Here’s a quick recap:

Storage in Amazon S3: Iceberg tables are stored in Amazon S3, benefiting from scalable and cost-effective object storage.
Data Processing with AWS Glue: AWS Glue allows serverless ETL processing of data into Iceberg tables, making it possible to handle batch and real-time updates.
Querying with Amazon Athena: Athena supports SQL queries on Iceberg tables directly from S3, making it easy to query and analyse data without dedicated infrastructure.
Governance with AWS Lake Formation: Lake Formation provides fine-grained access control, ensuring data security and governance within the lakehouse. Together, these services create a robust lakehouse environment on AWS, leveraging Iceberg for consistency and scalability.

Use Case: Financial Data Lakehouse:

To illustrate how a transactional data lakehouse works in practice, I shared a use case in the financial services industry. Financial institutions need real-time data consistency, compliance, and performance for analytics and regulatory reporting. In this scenario, a data lakehouse with Iceberg allows for:

Real-time analytics with consistent, ACID-compliant data.
Historical data access through time travel for auditing and compliance.
Cost efficiency by storing data in S3 and using Athena for on-demand queries. This use case highlighted the lakehouse’s potential to streamline data management in industries requiring both performance and data governance.

Architectural Overview:

In my session, I walked through an architectural diagram illustrating how to build a lakehouse on AWS with Iceberg:

Ingestion Layer: Data is ingested from multiple sources into S3 using AWS Glue or Kinesis.
Storage Layer: Iceberg tables reside in Amazon S3, with metadata management to handle partitions, schema evolution, and versioning.
Processing Layer: Glue ETL jobs process and transform data, supporting both batch and streaming.
Query Layer: Athena enables SQL-based querying of Iceberg tables for flexible analytics.
Governance Layer: AWS Lake Formation secures and governs access to sensitive data within the lakehouse. This architecture demonstrates a scalable, cost-effective approach to building a transactional lakehouse that supports data consistency and flexibility.

Lessons Learned:

From working with Iceberg on AWS, I shared a few key lessons:

Partitioning Strategy: Efficient partitioning is essential for Iceberg to deliver high performance. Planning for your data distribution patterns is crucial.
Schema Evolution: Although Iceberg handles schema changes well, backward compatibility is vital to avoid breaking data pipelines.
Cost Management: Data lakehouses on S3 are cost-effective, but monitoring Glue jobs and optimising Athena queries help keep costs in check.
Data Governance: Fine-grained access control with Lake Formation ensures data security, which is particularly important for multi-user environments.

Best Practices for Building a Data Lakehouse with Iceberg:

To wrap up my talk, I outlined some best practices for those considering building a lakehouse with Iceberg on AWS:

Data Modelling: Design Iceberg tables with a strong partitioning strategy to optimise performance and query efficiency.
Governance: Leverage Lake Formation for access control to ensure secure data access.
Time Travel for Compliance: Use Iceberg’s time travel feature to maintain historical records for regulatory compliance.
Optimise Glue Jobs: Efficiently schedule Glue jobs to process incremental updates and avoid unnecessary compute costs.

In Closing:

Presenting at DataEngBytes 2024 Sydney was a fantastic opportunity to share insights into building a transactional data lakehouse on AWS with Apache Iceberg. This architecture offers a powerful approach to managing and analysing data with both the flexibility of a lake and the consistency of a warehouse, unlocking new possibilities for data-driven organisations.

If you’re exploring a lakehouse approach in your own organisation, I’d highly recommend considering Apache Iceberg and AWS as the foundation. Combining Iceberg’s transactional capabilities with AWS’s scalability, you can build a data lakehouse that adapts to your evolving data needs while ensuring reliability and governance.

I hope this recap provides a clear overview of the content and insights from my talk. If you have questions or want to learn more about building data lakehouses, feel free to reach out or stay tuned for more blog posts on advanced data architectures!

Replacing E or G Drive On-Premises Shared Drives with AWS FSx for Windows

Roopa Venkatesh — Wed, 03 Jan 2024 10:16:03 +0000

Migrate Your File Shares to the AWS Cloud with Ease

On-premises shared drives, like E: and G:, have been the norm for file storage for years. But as businesses move to the cloud, these traditional file systems can become a bottleneck. They're often cumbersome to manage, scale, and secure.

That's where AWS FSx for Windows comes in. It's a fully managed, highly available, and scalable file system service that runs on Windows Server. This is a business use case when you migrate a Windows server for your traditional Windows-based application from on-premise to AWS cloud, where you encounter the drives and shares to be migrated along with servers.

Why Migrate to AWS FSx for Windows?

Scalability and Flexibility:
AWS FSx for Windows provides scalable file storage, allowing you to easily adjust your storage capacity based on your organization's growing needs.
Flexible deployment options enable you to choose the appropriate file system size and performance characteristics for your applications.

High Performance:
FSx for Windows delivers high-performance file systems that can support a wide range of workloads, including Windows applications and enterprise applications like SQL Server and SAP.

Integration with Active Directory:
Seamlessly integrate with your existing Active Directory, ensuring a smooth transition for users and maintaining a unified access control system.

Data Durability:
Benefit from robust data durability and automatic daily backups, reducing the risk of data loss and ensuring business continuity.

Reduced Management Overhead:
AWS FSx for Windows takes care of routine maintenance tasks, reducing the burden on your IT team and allowing them to focus on more strategic initiatives.

In this blog post, we'll walk you through the steps of planning and migrating your on-premises shared drives to AWS FSx for Windows.

Planning Your Migration

Before you start migrating your files, it's important to do some planning. Here are a few things to consider:

What data will you migrate? Not all data needs to be migrated to the cloud. Consider which files are actively used and which can be archived on-premises.
How will you migrate the data? There are a few different ways to migrate your data to FSx, such as using AWS Transfer for SFTP, AWS Data Migration Service, or my favorite Robocopy.
What will you do with your old on-premises shared drives? Once you've migrated your data to FSx, you can decide whether to keep your old shared drives online or decommission them.
Steps to Migrate Your Shared Drives.

Before initiating the migration process, conduct a thorough assessment of your current on-premise shared drives. Identify the data volume, access patterns, and any specific dependencies on existing configurations to define the FSx configuration.

Once you've done your planning, you're ready to start migrating your data. Here are the steps involved:

Create an FSx file system. Choose the right file system size and performance tier for your needs. Create it in AWS console or use the Cloudformation template to provision it.
Connect your FSx file system to your on-premises network. You can use AWS Direct Connect or a VPN to connect your FSx file system to your on-premises network.
Migrate your data. Choose the migration method that's right for you and start migrating your data to FSx.
Test and verify your migration. Once your data is migrated, test it to make sure everything is working correctly.
Decommission your old on-premises shared drives (optional). Once you're sure you're happy with your FSx file system, you can decommission your old on-premises shared drives.

The file shares are usually used to share files with AD groups or individual users, you may need to consider migrating the sharings and permissions too along with data/files. Once the Fsx is provisioned in AWS and connected to the on-premise server, you can start Robocopy which copies data along with security and permissions of sharing. Mount the FSx as a new drive on your on-prem windows server and use this command to run on the command prompt:

robocopy e:\source f:\Share\dst /e /mir

/e copies empty directories
/mir mirrors a directory tree
For more parameters on robocopy, refer

robocopy | Microsoft Learn

Reference article for the robocopy command, which copies file data from one location to another.

learn.microsoft.com

Note that AD groups for file permissions need to be accessible on FSx through the domain controllers, it depends on your AWS setup for domain join.

Ensure that all client applications and users are pointed to the new FSx for Windows file system. Update network mappings, drive letters, and any references to the old on-premise shared drives

Here are some additional tips for a successful migration:

Start with a small subset of data and test it out before migrating everything. Test out the parameters to come up with the final command and use it to migrate at once, you can run robocopy command before migration to copy any new delta files/folders.
Use a migration tool like Robocopy. This tool can automate the process and make it easier to keep track of your progress.
Communicate with your users. Let your users know what you're doing and why you're doing it. This will help to minimize disruption during the migration process.
Migrating your on-premises shared drives to AWS FSx for Windows can be a great way to improve the manageability, scalability, and security of your file storage. By following the tips in this blog post, you can make the migration process smooth and successful.

Additional Resources

AWS FSx for Windows: https://aws.amazon.com/fsx/windows/
AWS Transfer for SFTP: https://aws.amazon.com/aws-transfer-family/

I hope this blog post has been helpful. If you have any questions, please feel free to leave a comment below.

Decoding Gen AI Platforms on AWS: Navigating the Landscape with AWS SageMaker, Bedrock, and AI on EKS

Roopa Venkatesh — Sun, 24 Dec 2023 04:46:51 +0000

Having recently immersed myself in the cutting-edge realm of Gen AI at this year's AWS re:Invent at Las Vegas, I eagerly dived into workshops and sessions dedicated to three standout AI platforms: AWS SageMaker, AWS Bedrock, and AI on EKS. The wealth of insights gained from these experiences prompted me to document and compare the diverse AI options available on AWS. Join me on this journey as we dissect the strengths, nuances, and unique features of each platform, offering a comprehensive guide for navigating the dynamic landscape of Gen AI solutions within the AWS ecosystem.

Generative Artificial Intelligence (Gen AI) has become a cornerstone of modern technology, enabling businesses to harness the power of Machine Learning and Large Language Models (LLM) for various applications. As the demand for AI solutions continues to grow, several platforms have emerged to facilitate the development and deployment of AI models. In this blog post, we will compare three prominent AI platforms available on AWS Cloud: AWS SageMaker, Bedrock, and AI on EKS (Elastic Kubernetes Service).

SageMaker: The Veteran Warrior

Amazon SageMaker is the granddaddy of AWS AI, a fully managed platform designed for end-to-end machine learning (ML) workflows. From data preparation and model training to deployment and monitoring, SageMaker streamlines the process, making it accessible to both ML experts and novices with more than 150 open-source models. Its pre-built algorithms, notebooks, and tooling take the grunt work out of ML, while robust scalability handles demanding workloads.

But is SageMaker the one-size-fits-all hero? Not quite. While powerful, it sacrifices flexibility. One needs to know how to use the service and is suitable for ML engineers for flexible model development.

Key features of SageMaker include:

End-to-End Workflow: SageMaker provides a seamless workflow from data labeling and preparation to model training and deployment.

Built-in Algorithms: A variety of pre-built algorithms are available, reducing the need for custom development and accelerating model training.

Scalability: With SageMaker, you can build, train, and deploy ML models at scale using tools like notebooks, debuggers, profilers, pipelines, MLOps, and more – all in one integrated development environment (IDE).

Bedrock: The New Kid on the Block

Bedrock takes a different route, focusing on the cutting-edge world of generative AI. This fully managed service provides access to pre-trained "foundation models" (FMs) like language and image generators. No need to build models from scratch - simply fine-tune these FMs to your specific needs. Bedrock promises speed and ease of use.

It offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon.

So, bedrock is the magic solution? Not entirely. Bedrock is young, and its FM options are currently limited when compared to Sagemaker.

Notable features of Bedrock include:

Ease of use: Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources

AutoML Capabilities: Bedrock includes AutoML functionality, allowing users to automate the model selection and hyperparameter tuning processes.

Serverless: Amazon Bedrock is serverless, you don't have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications.

AI on EKS (Elastic Kubernetes Service): The Custom Craftsman

For those who need ultimate control and flexibility, AI on EKS comes knocking. This approach leverages Amazon Elastic Kubernetes Service (EKS) to deploy and manage containerized AI workloads. You get granular control over infrastructure, tools, and models, allowing for bespoke solutions.

Running AI workloads on Kubernetes has gained popularity due to its flexibility and scalability. AI on EKS brings Kubernetes to the forefront of AI development.

But is EKS the DIY path to AI nirvana? Be warned, this path requires significant expertise in Kubernetes and ML. It's a complex beast, demanding ongoing maintenance and potentially higher costs.

There is a vast ecosystem of tools available to build and run models, even within the Kubernetes landscape. One emerging stack on Kubernetes is Jupyterhub, Argo Workflows, Ray, and Kubernetes. AWS and the community call it the JARK stack and you can run this entire stack on Amazon EKS. You can also integrate with other workloads on EKS.
Refer for more details - https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/

Choosing the Right Tool

The best choice for you depends on your needs and priorities.

SageMaker: Ideal for established ML workflows, businesses prioritizing ease of use and scalability, and those comfortable with AWS tooling. Cost is based on Pay-as-you-go and depends on compute instances used for training and inference.
Bedrock: Perfect for experimenting with generative AI, those seeking quick solutions with pre-trained models, and businesses already using SageMaker. Cost is based on model inference and customization, with a choice of two pricing plans for inference: On-Demand and Provisioned Throughput, this is less expensive when compared to Sagemaker.
AI on EKS: Best for organizations with deep technical expertise, complex AI needs requiring specific customization, and those comfortable managing Kubernetes infrastructure and using EKS as their strategic platform for all of their Data and AI needs. Cost is based on the size and instance type of the cluster Infrastructure (Kubernetes-based).

Remember, there's no silver bullet. Each approach has its advantages and drawbacks. Carefully assess your requirements, evaluate your resources, and choose the tool that empowers you to conquer the AI frontier.

This post is just the beginning of your AI adventure. Stay tuned for further explorations into specific use cases and deep dives into these powerful tools!

Conclusion:

Choosing the right AI platform depends on specific project requirements, team expertise, organizational goals, and the existing ecosystem. AWS SageMaker, Bedrock, and AI on EKS each offer unique advantages, catering to different use cases and preferences. As the AI landscape continues to evolve, staying informed about the strengths and limitations of these platforms is crucial for making informed decisions in the rapidly advancing field of artificial intelligence.

Forem: Roopa Venkatesh

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

The Problem: Traditional Incident Investigation

Why Logs Alone Are Not Enough for AI

Introducing Topology-Aware AI Agents

Platform Context: Microservices Running on Amazon EKS

Building the Service Relationship Graph

Modeling the Infrastructure Layer

Modeling the Kubernetes Platform

Modeling Application Services

Modeling Caller–Callee Relationships

Linking Observability Data to the Graph

Learning from Historical Incidents

Architecture Overview

Agent Workflow

Step 1 — Detect SLO Breach

Step 2 — Identify Impacted Services

Step 3 — Traverse Dependencies

Step 4 — Retrieve Observability Signals

Step 5 — LLM Reasoning

Results from the Prototype

Why Graph-Based Observability Matters

Future Directions

Final Thoughts

From Chatbot to Cloud CFO: Building an Autonomous FinOps Agent with Amazon Bedrock

TL;DR

The Problem

The Difference: Automation vs. Agentic AI

The Architecture

Step 1: Building the Action Group Lambda

The Lambda Handler (Action Router)

Step 2: The OpenAPI Schema (Connecting Agent to Tools)

Step 3: Implementing the Cost Analysis Tool

Step 4: X-Ray Observability - Auditing the Agent's Brain

What X-Ray Gives You

Why This Matters

Step 5: The "Human-in-the-Loop" Safety Net

Implementation: Request Approval Tool

Update the OpenAPI Schema

The Agent's New Workflow

Deployment & Testing

Step 1: Create the Lambda Function

Step 2: Create IAM Policy for Lambda

Step 3: Create the Bedrock Agent

Step 4: Test the Agent

Step 5: Monitor with X-Ray

Conclusion

Key Takeaways

Cost Estimate

Next Steps & Resources

Enhancements to Consider

Further Reading

Migrating Traditional SFTP Servers to AWS SFTP Transfer Family: A Secure and Serverless Approach

Introduction

Why Migrate to AWS SFTP Transfer Family?

Proposed Architecture

Implementation Steps

Set Up AWS SFTP Transfer Family

Configure Network Load Balancer (NLB)

Integrate the Firewall

Set Up Service-Managed Users

Connect S3 as Backend Storage

Configure IAM Roles for Fine-Grained Access Control

Test and Validate

Migration Plan

Inform SFTP Users About Changes

Transition to Key-Based Authentication

Onboard and Migrate Users

Set Up and Validate Access

Go Live

Cost Optimization Tips

Conclusion

Presenting at DataEngBytes 2024 Sydney: Building a Transactional Data Lakehouse on AWS with Apache Iceberg

Replacing E or G Drive On-Premises Shared Drives with AWS FSx for Windows

Migrate Your File Shares to the AWS Cloud with Ease

robocopy | Microsoft Learn

Decoding Gen AI Platforms on AWS: Navigating the Landscape with AWS SageMaker, Bedrock, and AI on EKS