Forem: Mark Laszlo

Enterprise-Hardening: Memory, Secure Tools, and Observability

Mark Laszlo — Thu, 16 Oct 2025 07:53:44 +0000

In the previous parts, we built a customer support agent with Strands and deployed it to the cloud using Bedrock Agentcore Runtime. We now have a scalable, secure service. However, to be truly enterprise-ready, our agent needs to overcome several final hurdles: it needs a persistent memory, a way to securely connect to real backend systems, and a "black box" flight recorder to understand its behavior. In this final installment, we will use the modular services of Bedrock Agentcore to add these critical capabilities, transforming our application into a robust, stateful, and observable system.

From Amnesia to Awareness: Implementing Agentcore Memory

Our agent can currently maintain context within a single session, but once that session ends, all is forgotten. This is a poor user experience. Amazon Bedrock Agentcore Memory solves this by providing a fully managed, persistent memory store that operates on two levels :

Short-Term Memory: Captures the raw turn-by-turn conversation history within a session. This is used for immediate conversational context.
Long-Term Memory: Intelligently extracts and stores persistent insights across many sessions. Agentcore provides built-in strategies to automatically identify and save user preferences, semantic facts, and conversation summaries.

The most elegant way to integrate Agentcore Memory with our Strands agent is by using Strands' powerful hook system. Hooks allow us to inject custom logic at specific points in the agent's lifecycle without cluttering the main agent code. We will create two hooks :

on_agent_initiate: Before the agent processes a new request, this hook will retrieve relevant long-term memories and short-term conversation history for the user and inject them into the agent's context.
on_message_add: After each turn of the conversation (both user and agent messages), this hook will save the interaction to Agentcore Memory.

Here is a conceptual implementation of how these hooks would use the Boto3 SDK to interact with the Memory service API :

# Code for memory_hooks.py
import boto3

memory_client = boto3.client("bedrock-agentcore-data")
MEMORY_ID = "your-memory-store-id" # Created via the AWS console or SDK

def on_agent_initiate(agent_context):
    """Hook to load memory before the agent runs."""
    user_id = agent_context.get("user_id")

    # Retrieve long-term facts about the user
    long_term_memories = memory_client.retrieve_memories(
        memoryId=MEMORY_ID,
        namespace=f"/facts/{user_id}",
        query=agent_context.get("current_prompt")
    )

    # Retrieve the last 5 turns of the conversation
    short_term_history = memory_client.list_events(
        memoryId=MEMORY_ID,
        actorId=user_id,
        sessionId=agent_context.get("session_id"),
        maxResults=5
    )

    # Inject this context into the agent's system prompt or message history
    #... logic to format and add context...

def on_message_add(message):
    """Hook to save conversation turns to memory."""
    memory_client.create_event(
        memoryId=MEMORY_ID,
        actorId=message.get("user_id"),
        sessionId=message.get("session_id"),
        messages=[(message.get("text"), message.get("role"))] # e.g., ("Hello", "USER")
    )

# In the agent setup, you would register these hooks:
# support_agent.add_hook("on_agent_initiate", on_agent_initiate)
# support_agent.add_hook("on_message_add", on_message_add)

With this pattern, memory management becomes an automatic, background process, making the agent instantly more intelligent and context-aware.

Beyond Python Functions: Securely Connecting to APIs with Agentcore Gateway

Our current agent's tools are Python functions deployed within the same container. This is fine for a prototype, but in an enterprise environment, tools are often separate microservices, Lambda functions, or third-party APIs. Tightly coupling them to the agent code creates maintenance bottlenecks and security risks.

Amazon Bedrock Agentcore Gateway solves this by acting as a managed, centralized tool server for your agents. It can take any existing REST API (defined by an OpenAPI spec) or AWS Lambda function and instantly transform it into a secure, discoverable tool that speaks the Model Context Protocol (MCP).

Let's imagine our get_order_status logic is now a dedicated Lambda function. To expose it through Gateway, we would:

Navigate to the Agentcore Gateway console.
Create a new Gateway.
Add a new "target," selecting "Lambda function" as the type.
Provide the ARN of our order status Lambda function.

The Gateway provides a single, stable MCP endpoint. Our Strands agent can now discover and use this tool without any code changes. This decouples the tool's implementation from the agent's logic, allowing different teams to own and update their respective services independently.

Implementing Zero-Trust: Securing Tools with Agentcore Identity

Connecting to APIs is one thing; connecting securely is another. Agentcore Identity provides a robust framework for managing authentication and authorization for agents and their tools, following zero-trust principles. It handles the complex machinery of credential management and token exchange.

We can secure our Gateway using a dual-sided approach :

Inbound Authorization: We can protect the Gateway itself by requiring that any client (our agent) present a valid OAuth 2.0 token. We can configure this in the Gateway settings to use an identity provider like Amazon Cognito. Only agents that have successfully authenticated with Cognito can invoke our tools.
Outbound Authentication: If our backend Lambda function is also protected (as it should be), the Gateway needs to authenticate itself. We can configure the Gateway target to fetch an API key or OAuth token from Agentcore Identity's secure token vault and include it in the downstream call to the Lambda. This ensures that credentials are never hardcoded and are managed centrally.

This architecture ensures that every step of the tool invocation process is authenticated and authorized, providing enterprise-grade security for our agent's actions.

Opening the Black Box: Monitoring with Agentcore Observability

The final piece of the production puzzle is knowing what your agent is doing. Agentic systems can be complex, and when something goes wrong, you need to be able to trace the chain of reasoning. Amazon Bedrock Agentcore Observability provides deep, real-time visibility into agent performance and behavior out of the box.

When our agent is deployed on Agentcore Runtime, it automatically sends detailed telemetry data. In the Amazon CloudWatch console, we can access pre-built dashboards to :

Trace the Workflow: See a step-by-step visualization of a single user request, from the initial prompt to the final response. This trace shows every thought, every tool the agent considered, the exact parameters it used for the tool it called, and the tool's output. This is invaluable for debugging and auditing.
Monitor Performance: Track key operational metrics like invocation latency, error rates, and token usage across all your agent sessions. This helps identify performance bottlenecks and manage costs.
Inspect Payloads: For each step in the trace, you can drill down to see the exact input and output, helping you understand precisely why the agent made a particular decision.

This level of insight is critical for building trust in agentic systems and for iterating on their performance over time.

Conclusion: The Dawn of Production-Ready Agentic AI

Our journey is now complete. We began with a simple idea for a customer support agent and a few lines of Python code. Using the Strands SDK, we rapidly built the agent's core logic on our local machine. Then, with a single command, we deployed it to the secure and scalable Agentcore Runtime. Finally, using the modular services of Bedrock Agentcore, we progressively hardened our application, adding persistent memory, secure API integration via a central gateway, and comprehensive observability.

This architecture represents a fundamental shift. The modular services of Agentcore create a decoupled, microservices-like pattern for agentic systems. The agent's reasoning (Runtime), its memory (Memory), and its tools (Gateway) are independent components that can be developed, scaled, and secured separately. This separation of concerns is the key to building complex, maintainable, and future-proof AI applications.

The era of struggling to bridge the gap between AI prototypes and production systems is ending. With powerful, developer-focused frameworks like Strands and a robust, enterprise-grade platform like Amazon Bedrock Agentcore, builders can now move from idea to production in hours, not quarters, and focus on what they do best: creating the next generation of intelligent, world-changing applications.

The Engine Room: Deploying to the Cloud with Bedrock Agentcore Runtime

Mark Laszlo — Thu, 09 Oct 2025 07:05:30 +0000

The Engine Room: Deploying to the Cloud with Bedrock Agentcore Runtime
In Part 2 , we successfully built and tested the "brain" of our customer support agent using the Strands SDK. It can understand user requests, reason about which tool to use, and execute actions—all on our local machine. Now, it's time to move this agent from our terminal to the cloud. This is where Bedrock Agentcore Runtime shines, providing a secure, serverless engine designed specifically for agentic workloads and enabling us to deploy our agent with minimal code changes and a single command.

Introduction to Agentcore Runtime: The Secure, Serverless Engine

Amazon Bedrock Agentcore Runtime is the foundational compute layer of the Agentcore suite. It is not a generic container service; it is purpose-built infrastructure that addresses the unique demands of AI agents. Its core value propositions are :

- Serverless and Scalable: You deploy your agent's code, and Runtime handles everything else: provisioning infrastructure, managing capacity, and automatically scaling based on demand. You pay only for the compute time you consume, eliminating the need for idle servers.

- Secure by Design: Security is paramount, especially when agents handle sensitive customer data. Runtime provides complete session isolation by running each user's conversation in its own dedicated microVM with isolated CPU, memory, and filesystem resources. This robust separation prevents data leakage and cross-session contamination, a critical requirement for multi-tenant applications.

- Flexible and Resilient: Runtime is framework-agnostic, supporting agents built with Strands, LangGraph, CrewAI, or custom logic. It also supports long-running, asynchronous tasks for up to 8 hours, making it suitable for complex workflows that might involve multi-step reasoning or batch processing.

Adapting the Strands Agent for Cloud Deployment

One of the most compelling aspects of the Strands and Agentcore combination is how little you need to change your agent code to make it cloud-ready. We don't need to write a web server using FastAPI or Flask, manage API routing, or even build a Dockerfile manually. The bedrock-agentcore SDK provides a simple, declarative way to expose our agent.

We will create a new file, app.py, which will serve as the entry point for Agentcore Runtime. This file imports our support_agent from main.py and wraps its invocation logic with a few lines of code.

# app.py
from bedrock_agentcore import BedrockAgentCoreApp
from main import support_agent # Import the agent we already built

# 1. Instantiate the AgentCore App
app = BedrockAgentCoreApp()

# 2. Define the entrypoint for the runtime
@app.entrypoint
def invoke_agent(payload: dict, context) -> dict:
    """
    This function is the main entry point for the Agentcore Runtime.
    It receives the invocation payload and returns the agent's response.
    """
    try:
        user_message = payload.get("prompt")
        if not user_message:
            return {"error": "Prompt not provided."}

        # Call our existing Strands agent
        result = support_agent(user_message)

        return {"response": result.message}

    except Exception as e:
        print(f"Error invoking agent: {e}")
        return {"error": "An internal error occurred."}

# 3. Add a run block for local testing (optional but recommended)
if __name__ == "__main__":
    app.run()

That's it. The @app.entrypoint decorator is the key. It transforms our invoke_agent function into a standardized endpoint that the Agentcore Runtime service knows how to call. This simple pattern abstracts away all the underlying web server and networking complexity. We also need a requirements.txt file so the deployment toolkit knows which packages to install in the container.

# Create the requirements.txt file
echo "strands-agents" > requirements.txt
echo "strands-agents-tools" >> requirements.txt
echo "boto3" >> requirements.txt
echo "bedrock-agentcore" >> requirements.txt

One-Command Deployment: The Magic of the Starter Toolkit

With our code prepared, we can now deploy it using the bedrock-agentcore-starter-toolkit, a powerful command-line interface (CLI) that acts as a domain-specific Infrastructure-as-Code (IaC) tool for agentic workloads. It bridges the gap between our Python code and the required cloud infrastructure, automating a series of complex steps into two simple commands.

Step 1: Configure the Deployment

First, we run agentcore configure. This command inspects our project and interactively prompts us for any necessary configuration details, which it then saves to a local .bedrock_agentcore.yaml file.

# Make sure you have the toolkit installed: pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint app.py

The CLI will guide you through the setup, asking for an agent name and confirming the AWS resources it will create.

Step 2: Launch to the Cloud

Next, we run the agentcore launch command. This single command triggers a fully automated deployment pipeline :

agentcore launch

Here's what's happening "under the hood" while you wait:

1. Containerization: The toolkit uses AWS CodeBuild to build an ARM64-architected container image based on your code and requirements.txt. This happens in the cloud, so you don't even need Docker installed locally.
2. Infrastructure Provisioning: If this is your first deployment, the toolkit creates the necessary AWS resources, including an Amazon ECR repository to store your container images and an IAM execution role with the permissions your agent needs to run and access services like Amazon Bedrock.
3. Deployment: The container image is pushed to the ECR repository, and a new Agentcore Runtime is provisioned and started using your image.
4. Logging: CloudWatch Log groups are automatically configured, so you can immediately monitor your agent's logs.

This process encapsulates best practices for container-based deployments on AWS, saving developers from the steep learning curve of manually managing these resources.

Invoking and Interacting with the Cloud Agent

Once the launch command completes, it will output the ARN (Amazon Resource Name) of your deployed agent. You can now interact with it from anywhere.

Invocation via the CLI

The starter toolkit provides a simple invoke command for quick testing. This is a great way to verify that the agent is live and responsive.

agentcore invoke '{"prompt": "What is the return policy for apparel?"}'

You should receive a JSON response containing the agent's answer, just like you did locally.

Programmatic Invocation via the AWS SDK (Boto3)
For real-world applications, you'll invoke the agent programmatically. The following Python snippet shows how to do this using Boto3. The most important parameter here is sessionId. Agentcore Runtime uses this ID to route all requests for a given conversation to the same isolated microVM, thereby maintaining the conversation's state (in-memory variables, temporary files, etc.) for its duration. To carry on a conversation, you simply use the same sessionId for each subsequent call.

import boto3
import json
import uuid

# Configuration
AGENT_ARN = "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/YOUR_AGENT_ID" # Replace with your agent's ARN
AWS_REGION = "us-east-1"

# Create a client for the Agentcore Runtime data plane
agentcore_runtime_client = boto3.client(
    "bedrock-agentcore-runtime", 
    region_name=AWS_REGION
)

# Generate a unique session ID for a new conversation
session_id = str(uuid.uuid4())
print(f"Starting new session: {session_id}")

def ask_agent(prompt: str, session_id: str):
    """Invokes the agent and returns the response."""
    response = agentcore_runtime_client.invoke_agent_runtime(
        agentRuntimeArn=AGENT_ARN,
        sessionId=session_id,
        payload=json.dumps({"prompt": prompt}).encode('utf-8')
    )

    response_body = json.loads(response['body'].read().decode('utf-8'))
    return response_body.get("response")

# Start the conversation
response1 = ask_agent("Hi, what is the status of my order 67890?", session_id)
print(f"Agent Response 1: {response1}")

# Continue the same conversation
response2 = ask_agent("Great, can you also tell me the return policy for electronics?", session_id)
print(f"Agent Response 2: {response2}")

With our agent now running securely and scalably in the cloud, we have successfully bridged the gap from prototype to a production-grade service. However, our agent still suffers from amnesia between sessions and its tools are simple functions bundled with its code. In the final part of our series, we will elevate our solution to a truly enterprise-ready state by integrating Agentcore Memory, Gateway, Identity, and Observability.

Building the Brains: Crafting a Customer Support Agent with Strands

Mark Laszlo — Tue, 30 Sep 2025 10:12:00 +0000

In Part 1, we defined the challenge of productionizing AI agents and introduced our modern stack: Strands for building the agent's logic and Bedrock Agentcore for running it at scale. Now, we dive into the heart of the development process—crafting the agent's intelligence. This part is a hands-on guide to using the Strands SDK to build a functional customer support agent locally, before we even think about the cloud.

A Deep Dive into the Strands SDK

The philosophy behind Strands is to get out of the developer's way and leverage the reasoning power of modern LLMs. It eschews complex, hardcoded workflows in favor of a simple, model-driven approach. At its core, every Strands agent is composed of three simple components :

A Model: The LLM that provides the reasoning capabilities. Strands is model-agnostic, supporting models from Amazon Bedrock, Anthropic, OpenAI, and local models via Ollama, among others.
A Prompt: A system_prompt that defines the agent's persona, its purpose, and its rules of engagement. This is where you shape the agent's behavior.
A Set of Tools: Python functions or external services that the agent can call to perform actions or retrieve information from the outside world.

These components work together in a continuous agentic loop. When a user sends a message, the agent's LLM analyzes the request in the context of the conversation history and the available tools. It then decides whether to respond directly, ask a clarifying question, or call one or more tools to gather information. If a tool is called, its output is fed back into the loop, allowing the agent to reason over the new information and decide on the next step, continuing until it arrives at a final answer.

Setting Up the Local Development Environment

Before writing any agent code, let's establish a clean and reproducible development environment. This is a critical first step that prevents common configuration issues down the line.

First, create a project directory and set up a Python virtual environment to isolate our dependencies:

# Create a new directory for your project
mkdir customer-support-agent
cd customer-support-agent

# Create and activate a virtual environment
python3 -m venv.venv
source.venv/bin/activate
# On Windows use:.venv\Scripts\activate

Next, install the necessary Python packages. We need strands-agents _for the core SDK, _strands-agents-tools for useful pre-built tools, and boto3 to interact with AWS services like Amazon Bedrock :

pip install strands-agents strands-agents-tools boto3

Finally, ensure your AWS credentials are configured correctly. The best practice for local development is to use a dedicated AWS CLI profile. This isolates permissions and makes it easy to switch between different AWS accounts or roles. If you haven't already, install the AWS CLI and run:

# Configure a specific profile for our agent
aws configure --profile bedrock-agent

# Verify your identity and check which models you have access to
aws sts get-caller-identity --profile bedrock-agent
aws bedrock list-foundation-models --profile bedrock-agent

Pro Tip: A common pitfall is misconfigured AWS regions or IAM permissions. Ensure your bedrock-agent profile has permissions for bedrock:InvokeModel and that you have requested access to the desired foundation model (e.g., Anthropic's Claude 4 Sonnet) in the Amazon Bedrock console for your default region.

Architecting Agent Capabilities: Creating Custom Tools

Tools are how we extend our agent's capabilities beyond the knowledge of its LLM. With Strands, creating a custom tool is as simple as writing a Python function and decorating it with @tool. The decorator is a powerful abstraction that transforms a standard function into a capability that the LLM can discover and call. Strands automatically parses the function's name, parameters (including type hints), and its docstring to create a description for the model. This means the docstring is not just for developers; it's a crucial part of the prompt that guides the LLM's tool selection.

Let's create a tools.py file and define a suite of tools for our customer support agent, inspired by real-world use cases. For this tutorial, these tools will interact with a mock database, but they could easily be adapted to call real APIs.

# tools.py
from strands import tool
import json

# A mock database for demonstration purposes
MOCK_ORDERS_DB = {
    "12345": {"status": "Shipped", "estimated_delivery": "2025-10-28"},
    "67890": {"status": "Processing", "items":},
}

@tool
def get_order_status(order_id: str) -> str:
    """
    Retrieves the current status and details for a given order ID.
    Use this tool to answer customer questions about their orders.
    """
    print(f"--- Tool: get_order_status called with order_id={order_id} ---")
    status = MOCK_ORDERS_DB.get(order_id)
    if status:
        return json.dumps(status)
    return json.dumps({"error": "Order not found."})

@tool
def lookup_return_policy(product_category: str) -> str:
    """
    Looks up the return policy for a specific product category.
    Valid categories are 'electronics', 'apparel', and 'home_goods'.
    """
    print(f"--- Tool: lookup_return_policy called with category={product_category} ---")
    policies = {
        "electronics": "Electronics can be returned within 30 days of purchase with original packaging.",
        "apparel": "Apparel can be returned within 60 days, provided it is unworn with tags attached.",
        "home_goods": "Home goods have a 90-day return policy."
    }
    return policies.get(product_category.lower(), "Sorry, I could not find a policy for that category.")

@tool
def initiate_refund(order_id: str, reason: str) -> str:
    """
    Initiates a refund process for a given order ID and reason.
    Only use this tool when a customer explicitly requests a refund.
    Returns a confirmation number for the refund request.
    """
    print(f"--- Tool: initiate_refund called for order_id={order_id} ---")
    if order_id in MOCK_ORDERS_DB:
        confirmation_number = f"REF-{order_id}-{hash(reason)[:6]}"
        return json.dumps({"status": "Refund initiated", "confirmation_number": confirmation_number})
    return json.dumps({"error": "Cannot initiate refund. Order not found."})

Assembling and Prompting the Agent

With our tools defined, we can now assemble the agent in a main.py file. We will import the Agent class and our custom tools. The most critical step here is crafting the system_prompt. This prompt acts as the agent's constitution, defining its personality, capabilities, and constraints. A well-crafted prompt is specific enough to guide the agent but flexible enough to let it reason effectively.

# main.py
import os
import logging
from strands import Agent
from strands_tools import calculator # Import a pre-built tool
from tools import get_order_status, lookup_return_policy, initiate_refund

# Set AWS profile to use the one we configured
os.environ = "bedrock-agent"

# Enable debug logging to see the agent's thought process
logging.basicConfig(level=logging.INFO)
logging.getLogger("strands").setLevel(logging.DEBUG)

# Define the agent's persona and instructions
SYSTEM_PROMPT = """
You are a helpful and efficient customer support assistant for an online retailer.
Your goal is to resolve customer issues accurately and quickly.
You have access to the following tools:
- get_order_status: To check the status of a customer's order.
- lookup_return_policy: To provide information about return policies for different product categories.
- initiate_refund: To start the refund process for an order.
- calculator: To perform any necessary calculations.

Follow these rules:
1. Be polite and empathetic in all your responses.
2. Before using a tool that requires an order ID, always confirm the order ID with the customer if it was not provided.
3. Do not make up information. If you cannot answer a question with your available tools, say so.
"""

# Instantiate the agent
support_agent = Agent(
    tools=[
        get_order_status,
        lookup_return_policy,
        initiate_refund,
        calculator
    ],
    system_prompt=SYSTEM_PROMPT
)

def main():
    print("Customer Support Agent is ready. Type 'exit' to quit.")
    while True:
        user_input = input("You: ")
        if user_input.lower() in ["exit", "quit"]:
            break

        # Invoke the agent
        response = support_agent(user_input)
        print(f"Agent: {response.message}")

if __name__ == "__main__":
    main()

Local Testing and the "Aha!" Moment

Now, run the agent from your terminal:

python -u main.py

You can now interact with your agent. Try a few queries to see its reasoning in action:

Query 1: Hi, what's the status of order 12345? The agent will directly call the get_order_status tool and return the status. The debug logs will show the LLM's decision to use the tool and the parameters it supplied.
Query 2: I need to return a laptop. The agent will recognize that "laptop" falls under the "electronics" category and will call the lookup_return_policy tool to provide the correct information.
Query 3: Can you please refund my order? The agent, following the rules in its system prompt, will ask for the order ID before attempting to call the initiate_refund tool. This demonstrates its ability to follow instructions and engage in multi-turn conversation.

This local feedback loop is the "aha!" moment—seeing the agent correctly interpret natural language, select the right tool from its toolkit, and take action. With just a few Python files, we have built the brain of a sophisticated AI assistant. In Part 3, we will take this exact code and deploy it to the cloud with Bedrock Agentcore, transforming our local prototype into a scalable, production-ready service.

From Prototype to Production: A Modern Blueprint for AI Agents with Strands and AWS Bedrock Agentcore

Mark Laszlo — Tue, 23 Sep 2025 11:10:57 +0000

Introduction: The Agentic AI Revolution in Customer Support

In today's digital landscape, customer service remains a critical battleground for brand loyalty. Yet, traditional support models often fall short, characterized by long wait times, fragmented conversations across siloed channels, and limited 24/7 availability. Customers are forced to repeat themselves, and human agents, burdened by routine queries, have less time for complex, high-value interactions.
Enter the era of agentic AI. This is not just about chatbots answering simple questions. It's about sophisticated AI agents that can reason, plan, and autonomously use tools to execute complex, multi-step tasks. Imagine an agent that doesn't just look up an order status but can also analyze the issue, check the return policy, initiate a refund, and update the customer's profile, all within a single, seamless conversation. This is the promise of agentic AI: a move from reactive, scripted responses to proactive, goal-oriented problem-solving, delivering personalized, efficient, and always-on support.
This four-part series provides a blueprint for building and deploying such an intelligent customer support agent. We will navigate the entire lifecycle, from a local prototype to a secure, scalable, and production-ready application on AWS.

The Production Valley of Despair for AI Agents

For many developers, the journey of building an AI agent begins with a moment of triumph. A proof-of-concept (PoC), running on a local machine, flawlessly demonstrates the agent's core capabilities. It understands user intent, calls a few Python functions as tools, and provides intelligent responses. The demo is a success.
Then comes the "reality check". The path from this promising PoC to a reliable production application is fraught with challenges, a chasm many projects fail to cross—the "Production Valley of Despair." The core questions that emerge are daunting:

- Statelessness and Session Management: How do you manage conversations for thousands of concurrent users without their contexts interfering? The agent that works for one user locally becomes an amnesiac in a stateless cloud environment.
- Scalability and Performance: How do you host the agent's endpoint? How do you ensure low latency and automatically scale to handle unpredictable traffic spikes?
- Persistent Memory: How does the agent remember a customer's preferences or the context from a conversation last week? Building and managing a reliable memory system often requires integrating and maintaining complex components like vector databases.
- Secure Tool Integration: How do you move from calling local Python functions to securely interacting with production APIs and databases? This involves managing credentials, handling authentication, and ensuring tools are reliable under load.
- Observability and Auditing: When the agent behaves unexpectedly, how do you trace its reasoning process? Without deep visibility into the agent's "thoughts" and tool calls, debugging becomes nearly impossible, and auditing for compliance is a non-starter.

Tackling this "undifferentiated heavy lifting" of building enterprise-grade infrastructure can take months, diverting focus from what truly matters: the agent's intelligence and the user experience.

The Modern Stack: Strands for Agility, Agentcore for Durability

To navigate the Production Valley of Despair, developers need a modern stack that separates the logic of the agent from the infrastructure that runs it. This series introduces a powerful combination that achieves precisely this:

1. Strands Agents for Building: Strands is an open-source, developer-first Python SDK for building the agent's logic. It champions a model-driven approach, where instead of hardcoding complex workflows, you provide a large language model (LLM) with a prompt and a set of tools. The agent then uses its own reasoning capabilities to plan and execute tasks. Its simplicity and flexibility make it ideal for rapidly developing and iterating on the agent's core intelligence.
2. Amazon Bedrock Agentcore for Running: Bedrock Agentcore is a suite of fully managed, enterprise-grade services for running any AI agent in production. It is framework-agnostic, meaning it works seamlessly with agents built using Strands, LangChain, or any other framework. Its modular services—Runtime, Memory, Gateway, Identity, and Observability—are purpose-built to solve the exact production challenges outlined above, handling the heavy lifting of security, scalability, and operations.
This complementary relationship is key: "Strands gives you the tools to build the agent, Agentcore gives you the infrastructure to run it at scale".

Solution Architecture Overview

Our end-to-end solution will follow a robust, decoupled architecture. A user interacts with a client application, which sends requests to our Strands-based customer support agent. This agent is not running on a manually configured server but is deployed on the Agentcore Runtime, a secure and scalable serverless compute environment.
To maintain conversational context and recall user history, the agent interacts with Agentcore Memory. To perform actions like checking an order status or processing a refund, it securely connects to backend services (e.g., an internal orders API implemented as an AWS Lambda function) via the Agentcore Gateway. This architecture, modeled after production-grade systems, ensures each component is scalable, secure, and independently maintainable.

The Build vs. Buy Decision for Agentic Infrastructure

The decision to use a managed platform like Bedrock Agentcore is a strategic one, accelerating time-to-market by abstracting away months of complex infrastructure work. By offloading the operational burden, development teams can focus their resources on crafting superior agent logic and user experiences, rather than becoming full-time infrastructure engineers. The following table starkly contrasts the DIY approach with the managed Agentcore solution, making the value proposition clear.

Feature	DIY Approach (The Hard Way)	Bedrock Agentcore (The Smart Way)
Execution Environment	Provision and manage EC2/Fargate, configure load balancers, and handle complex scaling policies.	Agentcore Runtime: Fully managed, serverless compute with intelligent, workload-aware auto-scaling.
Session Management	Build a custom solution with Redis/DynamoDB for session state, handling timeouts and data isolation manually.	Agentcore Runtime: Built-in, cryptographically secure session isolation in dedicated microVMs for each user.
Persistent Memory	Set up and manage a vector database (e.g., OpenSearch), and build custom logic for conversation history and semantic retrieval.	Agentcore Memory: Managed short-term and long-term memory with built-in strategies for summaries, facts, and preferences.
Tool Integration	Write boilerplate code for every API, manage credentials in code or AWS Secrets Manager, and build custom authentication logic.	Agentcore Gateway & Identity: Transform APIs into secure tools with minimal code, and manage OAuth/API key auth flows centrally.
Observability	Instrument code manually with OpenTelemetry, and build custom CloudWatch dashboards for traces, logs, and metrics.	Agentcore Observability: Automatic, agent-specific tracing of reasoning steps and tool calls, with pre-built dashboards.

In the next part of this series, we will roll up our sleeves and begin building the "brains" of our operation: a capable customer support agent using the Strands SDK.

Patching Scheduled Auto Scaling Groups with AWS

Mark Laszlo — Mon, 13 Jan 2025 14:55:00 +0000

Introduction

Maintaining up-to-date patches on Amazon EC2 instances is critical for security and compliance. However, patching auto-scaling groups (ASGs) can be challenging, especially when dealing with scheduled ASGs that are scaled down during maintenance windows. Traditional patching jobs rely on running instances, creating a gap when instances are unavailable.

In this post, we address this issue by exploring how to automate the patching process for scheduled ASGs. We’ll leverage AWS Systems Manager (SSM) Maintenance Windows, CloudFormation, and EventBridge to create a solution that ensures patches are applied even when no instances are running at the time of the maintenance job.

Problem Statement

Organizations often use scheduled ASGs to optimize costs by scaling down during non-peak hours or maintenance windows. However, this introduces several challenges when it comes to patching:

No Running Instances: Since the ASG scales down to zero, there are no instances to trigger the patching process.
Delayed Compliance: Patching jobs remain pending until instances are scaled up manually, leading to security and compliance gaps.
Increased Manual Intervention: Administrators may need to manually scale up instances to execute patches, adding operational overhead.

Without a tailored solution, these gaps can leave critical systems exposed to vulnerabilities.

Reference to Existing Solutions

In a previous blog, Patching Your Auto Scaling Group on AWS, I discussed how to patch standard ASGs effectively. That solution focused on ensuring that patching was streamlined for running instances in dynamically scaled environments. However, scheduled ASGs introduce unique challenges due to their scaled-down state during maintenance windows.

This blog builds on that foundation, offering a targeted solution to patch scheduled ASGs by automating scaling, patch application, and scaling back down—ensuring seamless compliance without manual intervention.

AWS Automation for Scheduled Auto Scaling Groups

Patching scheduled auto-scaling groups requires an automation strategy that accounts for their scaled-down state during maintenance windows. AWS provides several tools that make this possible:

AWS Systems Manager (SSM) Maintenance Windows: Automates patching during predefined schedules.
Amazon EventBridge: Coordinates events and triggers necessary actions to manage scaling and patching processes.
IAM Roles and Policies: Grants permissions required for automation tasks like scaling, patching, and updating ASG configurations.

By combining these services, you can automate scaling up instances for patching, applying patches, and scaling back down—all without manual intervention.

Implementation Strategy

Step 1: Identify Scheduled Auto Scaling Groups

Tagging plays a critical role in identifying ASGs that require patching. Use tags like ep:asg:patch=true to specify the groups to be included in the automation process.

Step 2: Schedule and Automate the Patching Process

Leverage SSM Maintenance Windows to define patching schedules using cron expressions. For instance, you can create a window to run every second Wednesday at 4:00 AM:


Schedule: cron(0 4 ? * WED#2 *)

Step 3: Scale Up Instances During Maintenance

The automation logic temporarily scales up the ASG to ensure there are running instances for patching. This scaling is coordinated through EventBridge and Lambda functions.

Step 4: Apply Patches and Scale Down

Once patches are applied, the automation script scales the ASG back to its original size, maintaining cost-efficiency while ensuring compliance.

Code Walkthrough

The provided CloudFormation (CFN) template is designed to automate this entire process. Below are some key snippets to demonstrate how the solution works:

Tagging ASGs for Patching

The template uses tags to identify ASGs that require patching. The following parameters define the tag key and value:


Parameters:
  AsgTagKey:
    Type: String
    Default: ep:asg:patch
  AsgTagValue:
    Type: String
    Default: "true"

This ensures that only tagged ASGs are included in the patching process.

Scheduling Maintenance Windows

The template creates SSM Maintenance Windows based on environment and month-specific schedules:


Resources:
  MaintenanceWindow:
    Type: 'AWS::SSM::MaintenanceWindow'
    Properties:
      AllowUnassociatedTargets: false
      Cutoff: 0
      Duration: 1
      Name: !Sub "Maintenance_Window-${AsgTagValue}"
      Schedule: cron(0 4 ? * WED#2 *)
      Description: !Sub "Maintenance window for patching ${AsgTagValue} ASGs"

Scaling Logic

The automation script checks the ASG's desired capacity and scales up instances if the ASG is scaled down:


def scaleUpASG(asg_client, asg_name):
    asg_client.update_auto_scaling_group(
        AutoScalingGroupName=asg_name,
        MinSize=1,
        DesiredCapacity=1
    )

This function ensures there are running instances available for patching.

Patching and Creating a New AMI

The automation script applies patches to instances and creates a new AMI for the ASG:


def createAMI(ec2_client, instance_id, new_ami_name):
    ec2_client.create_image(
        InstanceId=instance_id,
        Name=new_ami_name,
        Description="Patched AMI created for ASG",
        NoReboot=True
    )

This ensures that patched AMIs are used for subsequent instance launches, maintaining compliance.

Updating the Auto Scaling Group

Once the patched AMI is created, the ASG is updated to use the new AMI:


def updateASG(asg_client, asg_name, launch_template_id, new_version):
    asg_client.update_auto_scaling_group(
        AutoScalingGroupName=asg_name,
        LaunchTemplate={
            'LaunchTemplateId': launch_template_id,
            'Version': str(new_version)
        }
    )

The new launch template version ensures that all future instances in the ASG are launched with the patched AMI.

Scaling Down After Patching

Finally, the ASG is scaled back to its original size:


def scaleDownASG(asg_client, asg_name, original_min, original_desired):
    asg_client.update_auto_scaling_group(
        AutoScalingGroupName=asg_name,
        MinSize=original_min,
        DesiredCapacity=original_desired
    )

This step restores the ASG to its cost-efficient state while ensuring patches have been applied.

Best Practices

1. Use Consistent Tagging

Ensure that all ASGs requiring patching are tagged consistently. This simplifies the automation process and minimizes the risk of missing critical groups.

2. Test Automation in Non-Production Environments

Before deploying automation scripts in production, test them in non-production environments to validate cron schedules, scaling logic, and patch application processes.

3. Monitor Maintenance Windows

Integrate monitoring tools like Amazon SNS to receive notifications about the status of maintenance windows. This allows administrators to track successes, failures, and potential issues.

4. Audit and Review Launch Templates

Regularly review and update ASG launch templates to ensure they reference the latest AMIs with applied patches.

5. Plan for Compliance

Align patching strategies with organizational and regulatory compliance requirements to avoid penalties and enhance security.

Conclusion

Patching scheduled auto-scaling groups can be a complex task due to their scaled-down state during maintenance windows. By leveraging AWS Systems Manager, CloudFormation, and EventBridge, this blog demonstrates how to automate the entire process—from scaling up instances to applying patches and scaling back down.

This solution addresses security and compliance gaps without increasing operational overhead, ensuring that your ASGs remain secure and cost-efficient. If you’ve faced similar challenges, consider implementing this automation strategy to streamline your patching workflows.

Feel free to share your thoughts or questions in the comments below!

Amazon GuardDuty S3 Malware Protection at Scale in Multi-Account Environments

Mark Laszlo — Tue, 07 Jan 2025 15:04:00 +0000

Introduction

Amazon GuardDuty S3 Malware Protection is a critical service for organizations aiming to safeguard their data against malicious threats. It provides automated scanning of objects stored in S3 buckets, ensuring that malware threats are identified and mitigated promptly. While it is an effective tool, applying this protection at scale, especially in multi-account and multi-region environments, poses significant challenges.

In this blog, we’ll explore how to implement S3 Malware Protection for large-scale environments, addressing the complexities of protecting multiple buckets across accounts. We’ll also discuss AWS's recommendations, cost considerations, and an automation strategy using a CloudFormation (CFN) template.

This post builds upon my previous blog, Amazon GuardDuty Malware Protection for Amazon S3, and focuses on scaling this solution efficiently.

Problem Statement

Large organizations often operate in environments where hundreds or even thousands of S3 buckets are distributed across multiple accounts and regions. Applying S3 Malware Protection to each bucket manually is time-intensive, prone to errors, and financially impractical. Additionally, AWS’s current design for the service adds certain constraints:

Regional and Account-Specific Scope: Malware protection must be configured independently for each region and account.
No Object-Level or Organizational Policies: AWS does not currently support selective scanning of specific objects or organization-wide configurations for malware protection.
High Costs for Comprehensive Scanning: Scanning low-risk buckets (e.g., those storing system logs) can lead to unnecessary expenses without significant security benefits.

Organizations must identify a scalable, cost-effective solution that adheres to AWS’s recommendations while ensuring critical assets are protected.

Why a Scalable Approach Is Necessary

Implementing S3 Malware Protection at scale involves navigating key challenges:

Cost vs. Security Trade-offs

Scanning every object in every bucket across an organization is financially prohibitive. By prioritizing high-risk buckets, organizations can balance robust security with cost efficiency.

Operational Complexity

Manually enabling and managing malware protection across accounts, regions, and buckets is not feasible for enterprises with dynamic environments.

Lack of Organization-Wide Policies

With AWS’s focus on regional and account-specific operations, a centralized solution must be built to ensure consistent deployment and management of malware protection.

Given these challenges, automation becomes essential for enabling a scalable and sustainable S3 Malware Protection strategy.

AWS Recommendations: A Risk-Based Approach

AWS recommends a targeted, risk-based approach when implementing S3 Malware Protection. This strategy focuses on securing the most vulnerable and high-impact buckets while minimizing costs and operational overhead. Key elements of AWS’s recommendations include:

Targeted Application: Prioritize buckets that receive uploads from external or untrusted sources, store sensitive data, or play a critical role in security-sensitive workflows.
Cost Efficiency: Avoid applying malware protection to low-risk buckets, such as those storing system logs, where malware risks are minimal.
Compliance and Governance: Align scanning strategies with regulatory requirements using a focused approach to ensure critical assets are protected.

By adopting a risk-based approach, organizations can effectively balance security needs and cost considerations.

Implementation Strategy

To implement a scalable and cost-efficient S3 Malware Protection strategy, follow these steps:

Step 1: Identify and Categorize Buckets

Start by auditing your S3 environment to classify buckets based on their risk profiles:

High-risk: Buckets with external uploads or sensitive data.
Medium-risk: Buckets used for internal file sharing with potential external access.
Low-risk: Buckets storing system logs or analytics data.

Step 2: Prioritize High-Risk Buckets

Enable malware protection for high-risk buckets first. Extend to medium-risk buckets as needed, focusing on scenarios where the impact of a malware infection would be significant.

Step 3: Automate Deployment

Use automation tools like AWS CloudFormation to apply malware protection policies consistently across multiple accounts and regions.

Step 4: Continuous Review

Regularly review and update your scanning strategy to adapt to new risks, evolving compliance requirements, and changes in your AWS environment.

Code Walkthrough

To address the challenges of manual configuration, I developed a CloudFormation (CFN) template that automates the deployment of S3 Malware Protection at scale. Below, I’ll highlight key aspects of the code used in this solution.

Role and Policy Configuration

The template creates an IAM role and policy that enable GuardDuty to access and scan relevant S3 buckets:


Resources:
  GuardDutyS3ProtectionRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: 'Allow'
            Principal:
              Service: 'malware-protection-plan.guardduty.amazonaws.com'
            Action: 'sts:AssumeRole'

This role grants GuardDuty the necessary permissions to scan S3 buckets while adhering to the principle of least privilege.

Excluding Low-Risk Buckets

The template ensures cost efficiency by excluding low-risk buckets based on predefined prefixes:


EXCLUDE_PREFIXES = [
    'aws-controltower-logs',
    'aws-cloudtrail-logs',
    'inventory-*',
    'backup-*'
]

buckets_to_include = [
    bucket for bucket in all_buckets
    if not any(bucket.startswith(prefix) for prefix in EXCLUDE_PREFIXES)
]

This Python-based logic filters out buckets that don’t require malware protection, ensuring a targeted approach.

Enabling GuardDuty S3 Malware Protection

The core logic iterates over relevant buckets and applies GuardDuty protection:


def enable_guardduty_s3_protection(guardduty_client, bucket_name):
    guardduty_client.create_malware_protection_plan(
        ClientToken=str(uuid.uuid4()),
        Role=os.environ['ROLE_ARN'],
        ProtectedResource={
            'S3Bucket': {
                'BucketName': bucket_name
            }
        },
        Actions={
            'Tagging': {
                'Status': 'DISABLED'
            }
        }
    )

By invoking this function, the template ensures consistent protection for all high-risk buckets across specified regions.

Best Practices for Large-Scale S3 Malware Protection

To maximize the effectiveness of your S3 Malware Protection strategy, consider these best practices:

1. Align Scanning with Business Priorities

Focus on buckets that are critical to your business operations or store sensitive data. For example, prioritize customer-uploaded content or partner data exchanges over system logs.

2. Use Automation for Consistency

Leverage tools like AWS CloudFormation, as demonstrated in this blog, to automate the deployment and management of malware protection policies. Automation reduces human error and ensures uniform application across accounts and regions.

3. Regularly Audit and Update Configurations

As your AWS environment evolves, conduct periodic reviews to ensure that high-risk buckets are protected and that low-risk buckets are excluded to minimize costs.

4. Monitor GuardDuty Alerts

Integrate GuardDuty findings with monitoring tools like Amazon CloudWatch or AWS Security Hub to stay informed of any detected threats and take swift action.

5. Advocate for Continuous Improvement

Provide feedback to AWS for features like organization-wide protection or selective scanning. Collaboration with AWS can drive enhancements to the service.

Conclusion

Amazon S3 Malware Protection is a robust tool for safeguarding your data, but applying it at scale in multi-account environments requires strategic planning and automation. By adopting AWS's risk-based approach, categorizing buckets by priority, and leveraging tools like CloudFormation, you can implement a cost-effective and efficient malware protection strategy.

The solution detailed in this blog builds upon my previous work and addresses the unique challenges of large-scale environments. While AWS continues to refine its services, proactive efforts such as these ensure that your critical assets remain secure.

If you’d like to discuss further or share your feedback, feel free to reach out!

Amazon GuardDuty Malware Protection for Amazon S3

Mark Laszlo — Mon, 24 Jun 2024 10:27:00 +0000

Amazon GuardDuty Malware Protection for Amazon S3 is a feature that automatically scans newly uploaded objects in S3 buckets for potential malware. This service provides a seamless, scalable solution to enhance security within AWS environments, particularly focusing on preventing the ingress of malicious files.

Key Features

Automated Malware Detection:
GuardDuty Malware Protection for S3 scans new objects or new versions of existing objects as they are uploaded to your S3 buckets. This automated process ensures that any potential malware is detected in real-time, mitigating risks before the files are accessed or processed downstream.
Event-Driven Architecture:
The service uses an event-driven approach, which means that every time an object is uploaded to a bucket or a new version is added, a malware scan is automatically initiated. This timely detection mechanism is crucial for maintaining security without manual intervention.
Scanning Scope:
GuardDuty Malware Protection for S3 focuses on newly uploaded objects. It does not retroactively scan existing objects in a bucket prior to the feature being enabled. If there is a need to scan existing objects, they must be re-uploaded to trigger the scan process.
Operational Simplicity and Scalability:
By being fully managed by AWS, this feature alleviates the need for customers to maintain their own scanning infrastructure. This reduces operational complexity and ensures that scanning operations do not impact the performance and scalability of S3 operations.
Integration with AWS Services:
Results from the malware scans can be integrated with Amazon EventBridge and Amazon CloudWatch. This enables automated workflows such as tagging, quarantine, or notification setups based on scan results. However, currently, the Malware Protection for S3 finding type does not integrate with AWS Security Hub and Amazon Detective.

Getting Started and Usage

To enable GuardDuty Malware Protection for S3:

Configure the feature through the GuardDuty console.
Select the specific S3 buckets to protect and set up necessary permissions through AWS Identity and Access Management (IAM).
Choose whether to scan all objects in a bucket or only those with a specific prefix.
Configure post-scan actions like tagging objects based on their scan status.

Organizational-Level Controls

Currently, there are no direct organizational-level controls to enable malware protection for all buckets simultaneously. Each bucket must be enabled individually. Furthermore, delegated GuardDuty administrators cannot enable this feature on buckets belonging to member accounts.

Security Findings and Notifications

Detailed security findings are generated for each scanned object, categorizing them based on the presence of threats. These findings are visible in the GuardDuty console and can trigger automated responses through EventBridge, ensuring timely handling of detected threats.

Pricing

The pricing for GuardDuty Malware Protection for S3 is based on the volume of data scanned and the number of objects evaluated. AWS offers a limited free tier that includes 1,000 requests and 1 GB of scanned data per month for the first year or until June 11, 2025, for existing accounts.

Enhancing Data Security with Automated Cross-Account Backup in AWS

Mark Laszlo — Tue, 30 Jan 2024 06:07:00 +0000

In today's digital landscape, data security and backup are not just options but necessities. With businesses increasingly moving their operations to the cloud, the need for robust backup strategies has never been more critical. In this context, AWS (Amazon Web Services) provides a powerful and flexible platform for backing up data. However, managing backups, especially across multiple accounts and regions, can be challenging. This is where automation using AWS CloudFormation comes into play.

Automating Cross-Account Backup in AWS

The concept of automating cross-account backup involves creating backups in one AWS account and then securely transferring these backups to another account, possibly in a different AWS region. This strategy enhances data security by providing geographical redundancy and protecting against account-specific risks.

Key Components of the Automation

AWS Organizations: Utilizing AWS Organizations is fundamental in managing multiple accounts. It allows for centralized governance and streamlined operations across the account landscape.
IAM Role for AWS Backup: An IAM role specifically designed for AWS Backup ensures that AWS Backup has the necessary permissions to perform backup tasks across accounts.
Backup Vault with Cross-Account Access: A Backup Vault is created where the backups are stored. The access policy of this vault is configured to allow cross-account backup copying, ensuring that backups can be securely transferred between accounts.
KMS Customer Managed Key: To ensure the security of backups, they are encrypted using a customer-managed key in AWS Key Management Service (KMS). This key is configured with policies that align with cross-account access and security best practices.
Cross-Region Backups: Backups are not just copied across accounts but can also be replicated across different AWS regions. This ensures data durability and availability even in the event of a regional AWS service disruption.
Lambda Function for Backup Automation: A Lambda function is at the heart of this automation. It triggers backup copying in response to specific events, such as the completion of a backup job. This function handles the creation of backup vaults in the destination account and initiates the backup copy job.
Amazon EventBridge (formerly CloudWatch Events): Amazon EventBridge rules are set up to trigger the Lambda function based on specific backup events, automating the entire process.
Centralized Logging and Monitoring: Through Amazon CloudWatch, all operations are logged, providing visibility into the backup processes and ensuring that any issues can be quickly identified and resolved.

Benefits of Automated Cross-Account Backup

Enhanced Data Security: By storing backups in a separate account, the risk associated with account-level compromises is significantly reduced.
Geographical Redundancy: Storing backups in different regions protects against region-specific failures, ensuring higher data availability.
Scalability: As the organization grows, this automated solution scales to handle increased backup requirements without significant additional management overhead.
Compliance and Governance: This approach aligns well with various compliance requirements that mandate off-site backups and data redundancy.
Cost-Effective: Automation reduces the need for manual backup and restore processes, leading to cost savings in terms of manpower and operational expenses.

Conclusion

Automating cross-account backup in AWS using CloudFormation not only streamlines the backup process but also significantly enhances the security and reliability of data stored in the cloud. As businesses continue to leverage cloud services, adopting such advanced backup strategies will be crucial in safeguarding their digital assets against a myriad of threats and ensuring business continuity.

Patching your Auto Scaling Group on AWS

Mark Laszlo — Tue, 23 Jan 2024 06:52:10 +0000

In the fast-paced world of cloud computing, maintaining efficiency, security, and reliability is paramount. A key component in achieving these objectives is ensuring that server instances within Auto Scaling Groups (ASGs) are regularly updated and patched. This is where AWS Systems Manager Automation steps in, offering a streamlined, automated solution for managing these critical updates.

The Challenge: Keeping Auto Scaling Groups Up-to-Date

ASGs are essential for managing the scalability and availability of applications. They ensure that the number of instances adjusts automatically according to the defined conditions, such as traffic or load changes. However, regularly updating these instances, especially for patch management, can be a complex and time-consuming task. Manual interventions increase the risk of errors and inconsistencies, leading to potential security vulnerabilities and performance issues.

The Automated Solution

To address this challenge, a specialized AWS Systems Manager Automation document is used. This document automates the process of patching instances in an ASG, creating a new Amazon Machine Image (AMI) with the latest patches, and then updating the ASG to use this new AMI. This ensures that all new instances launched by the ASG will be up-to-date with the latest patches, thereby maintaining the security and stability of the environment.

How It Works

The automation process involves several key steps:

Querying the ASG: The process starts by identifying the ASG that needs updating. This is done by querying ASGs based on specific tags, ensuring that the right group is targeted for the update.
Creating a Patched Instance: A new EC2 instance is launched using the current AMI of the ASG. This instance is then used for patching.
Patching and Image Creation: The newly launched instance undergoes a patching process based on the specified patch baseline. Post-patching, a new AMI is created from this instance. This step can be configured to either reboot or not reboot the instance after patching, depending on the user's choice.
Updating the ASG: Once the new AMI is ready, the ASG's launch configuration is updated to use this new, patched AMI. This ensures that any new instances launched will be based on the updated AMI.
Refreshing Instances: The final step involves refreshing the instances in the ASG. This means that existing instances are gradually replaced with new instances launched from the updated AMI, thus ensuring that all instances in the ASG are up-to-date.
Monitoring and Verification: Throughout the process, checks and balances are in place to monitor the execution and ensure successful completion. This includes verifying the management state of the new instance and ensuring that the instance refresh in the ASG completes successfully.

Benefits of Automation

By automating the process of updating AMIs and ASGs, several benefits are realized:

Consistency and Reliability: Automation reduces the risk of human error and ensures that all instances are consistently patched and updated.
Security: Regular patching helps in addressing vulnerabilities quickly, enhancing the overall security posture.
Efficiency: Automating repetitive tasks like patching and AMI updates saves time and allows teams to focus on more strategic initiatives.
Scalability: As the infrastructure grows, this automated solution scales accordingly, handling updates without additional overhead.

Conclusion

In conclusion, leveraging AWS Systems Manager Automation for managing updates in Auto Scaling Groups offers a robust, efficient, and secure way to handle instance patching and AMI updates. This automated approach not only ensures that the infrastructure remains secure and up-to-date but also significantly reduces the operational burden on IT teams, allowing them to focus on more value-added activities.

Best Practices When Designing AWS Architecture: Cost Optimization and Sustainability

Mark Laszlo — Tue, 09 May 2023 05:54:00 +0000

The AWS Well-Architected Framework helps you understand the decisions you make while building workloads on AWS. The Framework provides architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud1.

The AWS Well-Architected Framework is based on six pillars:

Operational Excellence
Security
Reliability
Performance Efficiency
Cost Optimization
Sustainability

This post focuses on the Cost Optimization and on the Sustainability pillars.

The Cost Optimization pillar includes the ability to run systems to deliver business value at the lowest price point. A cost-optimized workload fully utilizes all resources, achieves an outcome at the lowest possible price point, and meets your functional requirements.

By adopting the practices in this pillar you will build capability within your organization, design your workload, select your services, configure and operate the services, and apply cost optimization techniques.

This pillar provides an overview of design principles, best practices, and questions.

The Sustainability pillar focuses on environmental impacts, especially energy consumption and efficiency since they are important levers for architects to inform direct action to reduce resource usage. It provides design principles, operational guidance, best practices, potential trade-offs, and improvement plans you can use to meet sustainability targets for your AWS workloads.

By adopting the practices in this pillar you can build architectures that maximize efficiency and reduce waste.

This pillar provides an overview of design principles, best practices, and questions.

If you have any questions, feel free to reach out.

Best Practices When Designing AWS Architecture: Reliability and Performance Efficiency

Mark Laszlo — Tue, 02 May 2023 05:49:00 +0000

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. By using the Framework you will learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud. It provides a way to consistently measure your architectures against best practices and identify areas for improvement1.

The AWS Well-Architected Framework is based on six pillars:

Operational Excellence
Security
Reliability
Performance Efficiency
Cost Optimization
Sustainability

This post focuses on the Reliability and on the Performance Efficiency pillars.

The Reliability pillar includes the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.

By adopting the practices in this pillar you will build architectures that have strong foundations, resilient architecture, consistent change management, and proven failure recovery processes.

The Performance Efficiency pillar includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve. It is the pillar that brings efficient use of computing resources to the fore of best practice. This includes the best way to maintain maximum efficiency for users, even as requirements and demands change.

By adopting the practices in this pillar you will build architectures on AWS that efficiently deliver sustained performance over time.

This pillar provides an overview of design principles, best practices, and questions.

If you have any questions, feel free to reach out.

Best Practices When Designing AWS Architecture: Security and Operational Excellence

Mark Laszlo — Tue, 25 Apr 2023 05:43:00 +0000

The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework you will learn operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud. It provides a way to consistently measure your operations and architectures against best practices and identify areas for improvement.

The framework is based on six pillars:

Operational Excellence
Security
Reliability
Performance Efficiency
Cost Optimization
Sustainability

This post focuses on the Operational Excellence and on the Security pillars.

Operational Excellence is the first pillar of AWS Well-Architected Framework. It includes the ability to support the development and run workloads effectively while gaining insight into operations and continuously improving processes and procedures to deliver business value.

By adopting the practices in this pillar you can build architectures that provide insight to their status, are enabled for effective and efficient operation and event response, and can continue to improve and support your business goals.

This pillar provides an overview of design principles, best practices, and questions.

The Security pillar encompasses the ability to protect data, systems, and assets to take advantage of cloud technologies to improve your security. The pillar allows making use of cloud technologies to predict, prevent and respond to any threats as well as enforce privacy, data integrity, guard assets, and enhance detection of security events within a software environment.

By adopting the practices in this pillar you can build architectures that protect your data and systems, control access, and respond automatically to security events.

This pillar provides an overview of design principles, best practices, and questions.

If you have any questions, feel free to reach out.