Forem: Ryan Nazareth

AWS Sagemaker Notebook Jobs for Accelerating Data Science Experimentation Workflows with Mlflow and Optuna

Ryan Nazareth — Mon, 05 Jan 2026 03:21:13 +0000

Introduction

Hyperparameter tuning across multiple models presents a common challenge for ML practitioners. Tracking experiment results, managing configurations, and ensuring reproducibility becomes increasingly difficult as the number of models grows. This post walks through a solution that combines Amazon SageMaker, MLflow, and Optuna to create an automated, scalable hyperparameter optimization pipeline.

The use case that motivated this work involved training separate demand forecasting models for different product categories—smartphones, laptops, tablets, and accessories. Each category exhibits distinct patterns, making category-specific models more effective than a single unified model. The goal was to automate the hyperparameter search, centralize experiment tracking, and enable parallel training across all categories.

Manual hyperparameter tuning workflows often suffer from several issues. Experiment results end up scattered across notebooks and spreadsheets. Configurations from previous runs get lost or forgotten. Comparing results across different models requires tedious manual aggregation. And scaling to additional models means duplicating effort. A viable solution needs to address these pain points while integrating smoothly with existing ML workflows and AWS infrastructure.

Architecture Overview

The architecture leverages several AWS services working together. SageMaker Studio provides the development environment for notebook-based experimentation. When ready for full optimization runs, SageMaker Pipelines orchestrates notebook jobs for each product category in parallel. Each job uses Optuna to search for optimal XGBoost hyperparameters, with all experiments logged to a managed MLflow server.

Model artifacts, metrics, and visualizations are stored in Amazon S3. The entire infrastructure is defined in CloudFormation, enabling consistent deployments across accounts and regions.

The stack can be deployed by running the following command in bash, by setting your user (for sagemaker domain), bucket name for storing the artifacts, and the region.

cd infrastructure
./infrastructure/deploy.sh --user ryan --bucket sm-mlflow-optuna --region us-east-1

You can monitor the deployment of the resources in the cloudformation console under the name of the stack.

Note Deployment typically minutes, with most of that time spent provisioning the MLflow server (however you could also update the cloudformation template and use mlflow serverless option in Studio as well).

After deployment, access your SageMaker Studio domain via your created user from the cloudformation template. You will see the private space provisioned. The cloudformation stack deploys this with the latest Sagemaker distribution image and default ml.t3.medium instance size. A lifecycle configuration script is also attached to install additional dependencies. Furthermore, auto shutdown is also enabled to shut the space after 60 mins of idle activity. Run the space and wait for it to show in running state.

Navigate to the mlflow app and under tracking server, you should see your tracking server provisioned, with the artifact location as a prefix within the bucket deployed in cloudformation. Make a note of the mlflow tracking server arn as you will need to update it in the notebooks.

Once the space is in running state, click open. Navigate to git branch icon in the sidebar and clone the repository https://github.com/ryankarlos/sagemaker_mlflow_optuna.git using https.

The repository has two notebooks which will be used:

fm_train.ipynb: Notebook that runs the execution of the data preparation, processing and model training, logging to mlflow server using optuna as backend for hyperparameter tuning. When running the notebook, the execution will run for the category and the parameters defined in the notebook cell.

Open the notebook at update the section for the mlflow arn you noted previously. We will briefly go over what each of the cells is doing in the next sections.

nb-job-pipeline.ipynb: This is the main orchestration notebook, here we execute different configurations of fm_train.ipynb notebook as Notebook Job Steps and stitch them together into a singular SageMaker Pipeline. This will run the training of models for each of the 4 product categories in the dataset, so we will have 4 models. We will describe how we accomplish this in future sections as it will require a few settings in Sagemaker Studio from the user.

You will need to update the variables for bucket and region in the cell in screenshot below if you have deployed the cloudformation stack with different values.

Data Preparation

This example uses synthetic electronics sales data generated through Claude Opus 4.5 in Kiro, for number of daily units sold for laptops, smartphones, accessories and tablets. The data generator creates features with realistic correlations to the target variable, including price sensitivity, promotional effects, seasonality, and competitive dynamics, weekends. Some of the requirements in the prompt, when generating The synthetic data was to produce feature correlations in the 0.2-0.76 range against the target, providing Optuna with meaningful signal for optimization. Weak or nonexistent correlations would limit the effectiveness of any hyperparameter search. The target variable units_sold was generated with based on a combination of these features with some added noise.

Hyperparameter Optimization with Optuna

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. it handles the hyperparameter search via Bayesuan Optimisation using a Tree-structured Parzen Estimator (TPE) sampler by default, although users have the option of choosing other sampler options. TPE models the relationship between hyperparameters and objective values, focusing exploration on promising regions of the search space.

For this example, we are going to use XgBoost for predicting the number of units sold for each category. The XGBoost search space includes:

Booster type (gbtree, gblinear, dart)
Regularization parameters (lambda, alpha) with log-uniform distributions
Tree depth and learning rate
Growth policy

Log-uniform distributions work well for regularization parameters where optimal values can span several orders of magnitude. The Optuna documentation on search spaces covers the available distribution options.

Optuna uses the concepts of study and trial. A study is the optimization based on an objective function. A trial is a single execution of the objective function

The objective function for this use case is defined as below. The goal of a study is to find out the optimal set of hyperparameter values through multiple trials (e.g., n_trials=50).

def objective(trial):
    """Optuna objective function with MLflow child runs."""
    with mlflow.start_run(run_name=f"trial-{trial.number}", nested=True):
        # Suggest hyperparameters
        params = {
            "objective": "reg:squarederror",
            "eval_metric": "rmse",
            "booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
            "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
            "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        }

        # Conditional hyperparameters based on booster type
        if params["booster"] in ["gbtree", "dart"]:
            params["max_depth"] = trial.suggest_int("max_depth", 1, 9)
            params["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
            params["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
            params["grow_policy"] = trial.suggest_categorical(
                "grow_policy", ["depthwise", "lossguide"]
            )

        # Train model
        dtrain = xgb.DMatrix(train_x, label=train_y)
        dvalid = xgb.DMatrix(valid_x, label=valid_y)

        model = xgb.train(params, dtrain, num_boost_round=100)
        preds = model.predict(dvalid)

        # Calculate metrics
        mse = mean_squared_error(valid_y, preds)
        rmse = math.sqrt(mse)

        # Log to MLflow
        mlflow.log_params(params)
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("rmse", rmse)

        return mse  # Optuna minimizes this value

Optuna provides integration with mlflow which allows every trial to be systematically recorded. Mlflow allows nesting runs for each experiment. Each iteration (or trial) in Optuna can be considered a 'child run' in mlflow. Each child run will track the specific hyperparameters used and the resulting metrics, providing a consolidated view of the entire optimization process. All child runs can be grouped into a parent run in mlflow, which represents the entire optimization study for a particular product catogory e.g. laptops. This structure keeps experiments organized in the MLflow where the overall best result appears at the parent level, with individual trials available for detailed inspection.

Parameterizing Notebooks for Pipeline Execution

SageMaker Pipelines executes notebooks as jobs, but proper parameterization is essential. The mechanism relies on cell tags—specifically, ptagging a cell with "parameters"](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run-troubleshoot-override.html) using the JupyterLab metadata editor.

In this example, you will need to tag the cell in the notebook in the screenshot below, with the parameters tag, This cell defines all the parameters/configuration which may need to be changed with every model training run for each category e.g. category name, model starting parameters, number of optuna trials etc.
Open the fb_train.ipynb notebook, select the cell titled Configuration and expand the common tools section in the right sidebar. You should see a parameters tag already in the tag box, click on it to apply the tab to the cell. This will appear with a check mark as in the screenshot below.

When a notebook job run, the notebook job executor searches for a Jupyter cell tagged with the parameters tag and applies the new parameters or parameter overrides immediately after this cell.

Note All parameter values must be strings, so any constants that are overridden are injected as strings. Hence, you will see one of the cells in the notebook casts, the variables back to int and float as required. The notebook jobs documentation provides complete details on parameterization.

SageMaker Pipeline with Notebook Job Steps for Running Training Across Multiple Categories

SageMaker Notebook Step Jobs enable automated execution of Jupyter notebooks as managed compute jobs. When integrated with SageMaker Pipelines, they provide a scalable mechanism for running parameterized training workflows across multiple model categories.

The pipeline creates a NotebookJobStep for each product category using the SageMaker Pipelines SDK. Each step is configured with category-specific parameters, compute resources, and execution policies. The NotebookJobStep API reference details all available configuration options.
The following snippet from the notebook, is how the notebook job step is created using the [python sdk] in the notebook_pipeline.ipynb) notebook (https://docs.aws.amazon.com/sagemaker/latest/dg/create-notebook-auto-run-sdk.html).

from sagemaker.workflow.notebook_job_step import NotebookJobStep

nb_step = NotebookJobStep(
        name=step_name,
        description=f"XGBoost training for {category}",
        notebook_job_name=step_name,
        image_uri=image_uri,
        kernel_name=kernel_name,
        display_name=step_name,
        role=sagemaker.get_execution_role(),
        s3_root_uri=notebook_artifacts,
        additional_dependencies=[
                "/home/sagemaker-user/sagemaker_mlflow_optuna/scripts"
            ],
        initialization_script="nb_job_init.sh",
        input_notebook=train_notebook,
        instance_type=instance_type,
        parameters=nb_job_params,
        max_runtime_in_seconds=3600,
        max_retry_attempts=2,

There are few more non-default additional settings that need to be included. The parameters to override in the notebook need to be passed as a dictionary - an example as below.

{
"category: "smartphones",
"n_trials": 50,
"experiment_name": "electronics-smartphones",
"test_size": 0.25
}

The additional dependencies option allows us to include any additional files or folders along with the notebook to be made available when the job is running in the Sagemaker managed instance. Here the scripts folder path is included as the notebook imports from some python modules in this folder. An initialization script option, allows installing the necessary libraries in the instance which may not be present in the base image uri defined. We also need to include some other scripts along with the notebook, as it imports from functions from these scripts. Sagemaker does not include any other file besides the main input_notebook file by default when initialisation the training job instance. The additional_dependencies option allows different folder or files to passed as a list, The Notebook Job will now have access to all files under the input file's folder, in this case scripts. While the notebook job is running the file structure of directory remains unchanged

Whilst, we use the python sdk to automate this, a Notebook Job can also be initiated via the console as explained in the docs. At the top of notebook tab, click on the Notebook Jobs widget in blue

In the next configuration tab, we can input all the options required e.g. adding in parameters, including additional files/scripts folder the scheduler tries to infer a selection of default options and automatically populates the form to help you get started quickly. If you are using Studio, at minimum you can submit an on-demand job without setting any options. You can also submit a (scheduled) notebook job definition supplying just the time-specific schedule information. However, you can customize other fields if your scheduled job requires specialized settings. If you are running a local Jupyter notebook, the scheduler extension provides a feature for you to specify your own defaults (for a subset of options) so you don't have to manually insert the same values every time.

When you create a notebook job, you can include additional files such as datasets, images, and local scripts. To do so, choose Run job with input folder. The Notebook Job will now have access to all files under the input file's folder. While the notebook job is running the file structure of directory remains unchanged.

Each notebook job runs on its own compute instance, enabling true parallel execution. The ml.m5.xlarge instance type (4 vCPUs, 16 GB RAM) provides sufficient resources for XGBoost training with 50 Optuna trials. For larger workloads or GPU-accelerated training, you can specify different instance types. The SageMaker instance types documentation lists all available options for notebook jobs.

By defining an iterable of Notebook Job Steps with different parameter configurations for each of the categories, we can then execute these as a Sagemaker Pipeline.

session = PipelineSession()
role = sagemaker.get_execution_role()

pipeline = Pipeline(
    name=pipeline_name,
    steps=pipeline_steps,
    sagemaker_session=session,
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

print(f"Pipeline: {pipeline_name}")
print(f"Execution: {execution.arn}")

When the pipeline starts, all four categories begin training simultaneously. Each runs 50 Optuna trials, logs results to MLflow, and saves the best model. You can monitor the pipeline execution from the Sagemaker Studio UI under the Pipelines section and check the logs for each step.

Notebook Job Logs and Executed Notebooks

After execution, notebook job outputs for all the steps in the pipeline are stored in S3 at the specified s3_root_uri under a prefix associated with the Sagemaker pipeline execution id, as shown in screenshot below

Download the output.tar.gz file, unzip and the executed notebook should be named after the name of the step. In addition, the sagemaker execution log file is also attached. Open the notebook in sagemaker to view the cell executions or any execution log errors.

Integration with MLflow and Experiment Tracking

The MLflow UI displays all experiments organized by category. Each experiment shows optimization history—how the objective value improved across trials. The nested run structure (parent run per category, child runs per trial) provides clear organization. Runs can be compared, parameter distributions examined, and artifacts downloaded for offline analysis.

Notebook jobs integrate seamlessly with MLflow because they run in isolated environments with the same MLflow tracking URI configured. Each job connects to the managed MLflow server independently, ensuring all experiments are logged centrally regardless of which compute instance executed the training.

The nested run structure provides clarity. At a glance, the best result for each category is visible. Expanding a parent run reveals all child trials with their logged parameters and metrics.

Optuna's built-in visualizations—parameter importance plots, parallel coordinate plots, optimization history—are logged as artifacts alongside the models.

Clicking on any child runs (tuning trial) reveals logged metrics associated with the trial

The parent run, stores the best model metrics and logged model artifacts, signatures, plots, code etc. The model can then be retrieved for inference or added to the mdoel registry for versioning, promotion and deployment.

Plots that were logged such as feature importance and residual plots are visible directly in the Mlflow console

By setting the environment variable MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING to true in the code, we can also log system metrics automatically for each run and child run to help decide on optimal instance types for future runs.

Tearing Down Resources

Once you are done experimenting, you can tear down the resources to save cost by navigating to the cloudformation console and selecting delete stack. Before doing this, make sure that you shut down any running Sagemaker apps and empty the bucket contents.

You can monitor the resources state. Note, that deletion of mlflow may take over 20 mins.

Building a Sports Marketing Video Generator using Nova Canvas and Nova Reel

Ryan Nazareth — Mon, 19 May 2025 03:04:42 +0000

Introduction

In today's fast-paced digital marketing landscape, creating engaging sports content quickly is essential. Traditional video production is time-consuming and expensive, often requiring specialized skills and equipment. What if marketers could generate professional sports marketing videos from a single image with just a few clicks?

In this blog post, we'll explore a solution called that builds a streamlit application hosted n ECS and leverages Amazon Nova's generative AI capabilities to transform sports images into dynamic marketing videos.

The Business Challenge

Marketing teams across industries face common challenges:

Limited resources for video production
Need for rapid content creation across multiple channels
Maintaining brand consistency across all visual assets
Scaling content production without proportionally scaling costs

This solution addresses these challenges by providing an intuitive web interface that allows non-technical users to generate professional-quality video content from static images with just a few clicks.

Architecture

The Canvas Video application is built on a modern, serverless architecture that leverages several AWS services.

App: Dockerised streamlit application deployed in AWS ECS (Fargate) as a service, with tasks running in multiple availability zones to ensure high availability.
Authentication: Amazon Cognito user pool with username and password for authentication.
Load Balancer: ALB intercepts requests, redirects unauthenticated users to Cognito for authentication, and then forwards authenticated users to the backend application with user claims, enabling secure access to the app running as ECS tasks.
AI Services: AWS Bedrock (Nova Reel, Nova Canvas) and Amazon Rekognition
Storage: S3 for storing the videos generated from Nova Reel.

The basic flow is as follows. A user navigates to url alias record in Route53 and authenticates via Cognito. User uploads a sports image. AWS Rekognition analyzes the image to confirm it is a sports-related image. The user has the option of editing the image using Nova Canvas by providing inpainting/outpainting related prompts. Nova Reel generates a sports marketing video based on the image and prompt. The video is stored in S3 and a presigned URL is provided to the user in the streamlit application.

Key Features

1. Sports Image Classification

The app uses Amazon Rekognition to analyze uploaded images and determine if they're sports-related. It can identify specific sports like basketball, football, soccer, etc.

def is_sports_image(self, image_bytes):
    """Determine if the image is sports-related using Amazon Rekognition"""
    try:
        rekognition = boto3.client('rekognition')
        response = rekognition.detect_labels(Image={'Bytes': image_bytes})

        # Extract labels from the response
        labels = [label['Name'].lower() for label in response['Labels']]

        # Determine the specific sport type from labels
        sport_type = self.determine_sport_type(labels, response['Labels'])

        # Check if any sports keywords are in the labels
        for keyword in self.sports_keywords:
            if keyword.lower() in labels:
                return True, labels, sport_type

        # Check for confidence scores on sports-related activities
        for label in response['Labels']:
            if any(keyword.lower() in label['Name'].lower() for keyword in self.sports_keywords):
                if label['Confidence'] > 70:
                    return True, labels, sport_type

        return False, labels, "General Sports"

    except Exception as e:
        logger.error(f"Error in sports image classification: {str(e)}")
        return False, [], "General Sports"

2. Image Editing with Nova Canvas

Before generating a video, users can enhance their images using AWS Bedrock Nova Canvas. The app supports:

Inpainting: A technique used in image processing to fill in missing or damaged parts of an image in a way that blends seamlessly with the surrounding areas. This process can reconstruct removed elements or extend the background of an image while preserving the visual coherence of textures, colors, and patterns.
Outpainting: A technique used to expand an image beyond its original borders by generating new visual content that seamlessly extends the existing scene. Unlike inpainting, which fills in missing areas within an image, outpainting creatively imagines what might lie beyond the current frame, effectively "uncropping" it to add context or detail.

The method implementing this logic is included in the snippet below.
The mask prompt tells Canvas which parts of the image to edit or preserve For inpainting, the mask is what you want to change
whilst dor outpainting the mask is what you want to keep.

def process(self, image_bytes, negative_prompt, main_prompt, mask_prompt, operation_type, config=None):
    """Process image using Amazon Nova Canvas for inpainting or outpainting with sports focus"""
    try:
        # Check if the image is sports-related
        is_sports, labels = self.sports_classifier.is_sports_image(image_bytes)

        if not is_sports:
            return "NOT_SPORTS_IMAGE"

        # Convert image bytes to base64
        image_base64 = base64.b64encode(image_bytes).decode('utf-8')

        # Use default config if none provided
        if config is None:
            config = DEFAULT_IMAGE_CONFIG

        body = {
            "taskType": operation_type,
            "imageGenerationConfig": config
        }

        # Add the appropriate parameters based on operation type
        if operation_type == "INPAINTING":
            body["inPaintingParams"] = {
                "text": main_prompt,
                "maskPrompt": mask_prompt,
                "negativeText": negative_prompt,
                "image": image_base64,
            }
        elif operation_type == "OUTPAINTING":
            body["outPaintingParams"] = {
                "text": main_prompt,
                "maskPrompt": mask_prompt,
                "negativeText": negative_prompt,
                "image": image_base64,
            }

        response = self.bedrock_runtime.invoke_model(
            modelId="amazon.nova-canvas-v1:0",
            body=json.dumps(body),
            accept=self.accept, contentType=self.content_type
        )
        response_body = json.loads(response.get("body").read())
        base64_image = response_body.get("images")[0]
        base64_bytes = base64_image.encode('ascii')
        image_bytes = base64.b64decode(base64_bytes)

        return image_bytes

    except Exception as e:
        logger.error(f"Error in Nova Canvas processing: {str(e)}")
        return None

3. Video Generation with Nova Reel

The core feature is video generation using AWS Bedrock Nova Reel. This generates dynamic videos using using the context provided in the image and a video prompt that is autogenerated in the backend. Users can select from different marketing styles to influence/update the prompt:

Dynamic action
Athlete showcase
Team spirit
Fan experience
Product in action
Inspirational

The video generation request to Amazon Nova Reel is invoked asynchronously using the context provided in the image and the results are stored in Amazon S3 for later retrieval. The video object is accessible by the user in the frontend through an auto generated presigned url.

Note Amazon Nova currently only allows a video duration of 6 secs if using text prompt with an Image.

4. Prompt Enhancement

The app uses a base prompt template and enhances it with sport-specific details:

def enhance_prompt(self, marketing_prompt, brand=None, sport_type=None):
    """Enhance the marketing prompt with the base Nova Reel prompt"""
    enhanced_prompt = NOVA_REEL_BASE_PROMPT.strip() + " " + marketing_prompt

    if brand:
        enhanced_prompt += f" for {brand}"

    if sport_type:
        enhanced_prompt += f" in the context of {sport_type}"

    return enhanced_prompt

Deploying the application to AWS

To deploy this solution in your AWS environment, first clone the repository and install the python dependencies.

$ git clone https://github.com/ryankarlos/llm-use-cases.git
$ cd image_and_video

$ python -m venv venv
$ . venv/bin/activate

pip install -r requirements.txt
pip install -r src/image_and_video/requirements.txt

Now we will need to deploy the resources in AWS shown in the architecture diagram using terragrunt and terraform. You can install it here and here

The inputs and locals block in terragrunt.hcl file in the terragrunt folder has a list of default variables set which will be passed to the terraform scripts during deployment of the resources.
You can update some of the variables listed below (the email and username will need to be updated to your username and email so you can reset your password set by Cognito

locals {
  region = "us-east-1"
  username = <replace with your own username for cognito>
  email = <replace with your own email>
  ecr_repo_name = "canvas-video"
}

inputs = {
  hosted_zone_name = "awscommunity.com"
  ....
  bucket_name = "image-llm-example"
  subdomain = "nova-video"
}

Now we will run terragrunt commands following your installation to generate a plan of the resources to be deployed and then apply the changes.

$ cd terragrunt
$ terragrunt plan
$ terragrunt apply

Once the resources are deployed, we will need to build and push the docker image with the app code, to ECR, which will be automatically deployed to the ECS service. Execute the ecr-build-push.sh script which will handle the Docker build and push process. Note that this uses the defaults for region and ecr repo name. If you have selected different values, then you will need to update these defaults or set the environment variables accordingly.

# Configuration
AWS_REGION=${AWS_REGION:-"us-east-1"}
ECR_REPOSITORY=${ECR_REPOSITORY:-"canvas-video"}
IMAGE_TAG=${IMAGE_TAG:-"latest"}

Once the image is pushed, a manual step will need to be performed to deploy the task in ECS. Navigate to the ECS console, and navigate to the service in the cluster that was deployed. Click on the Update Service option and select force new deployment.

Wait for a few minutes for the deployment. The tasks will change status from PROVISIONING to PENDING to ACTIVATING and if successful finally show as RUNNING status.

Navigate to the Route53 url and you should be able to see the Cognito login page. Enter your username and temp password sent to your email address, which will prompt for a password reset. Reset your password and you should now see the streamlit application home page.

Using the Application

Upload the image using the upload image tab. This app uses Amazon Rekognition behind the scenes to label the image. If the image is not sports-related, it will throw an error.

If the image is sports-related, you will see the image uploaded and the tags. The app automatically identifies elements like "basketball," "soccer," or "athlete" in your image, helping tailor the marketing content to your specific sport.

In the example below, an image of a tennis racquet with balls on a clay court has been uploaded.

Image Processing options

In the left tab, you will see different options for performing inpainting, outpainting, or no processing on the image

No Processing: Use your image as-is for video generation
Inpainting: Replace or modify specific areas within your image. Perfect for adding brand elements or removing unwanted objects. For example in the image above, we can ask to replace the yellow balls (mask) with striped balls (processing prompt). We can set the negative prompt to blur to prevent any blurring.

Outpainting: Extend your image beyond its original boundaries. Great for creating more dynamic compositions or adding space for text. In the image above, we can set the mask as the racquet and balls, and set the processing prompt to garden to show an image of the racquet and balls in a garden instead of a clay court.

If you choose inpainting or outpainting, you'll need to provide:

A main prompt describing what you want to add or modify.
A mask prompt indicating which area to modify
An optional negative prompt to specify what to avoid

Image Video Generation

In the sidebar, you'll find several options to customize your marketing video under the Video Settings options. Choose from templates in the dropdown menu for Marketing Video Style like "Dynamic Action," "Team Spirit," or "Product in Action" to set the tone of your video. You can also ddd your brand name to personalize the marketing message.

The app uses these selections to craft a specialized prompt for the AI video generator, optimizing it for sports marketing content.
The "Review Final Prompt" expander allows you to see and edit this final prompt. This is helpful if you want to fine-tune specific aspects of your marketing message. You can experiment with different combinations of image processing, marketing styles, and sport types to create the perfect sports marketing video for your brand.

Once you're satisfied with your settings, click the "Generate Sports Marketing Video" button. The system will create a specialized marketing prompt based on your selections and generate a dynamic sports marketing video using AWS Bedrock Nova Reel During generation, you'll see progress updates. This process typically takes a 3-5 mins as the AI creates your custom video. The video is stored securely in your AWS S3 bucket that you selected when deploying the terraform infrastructure.

It is also accessible from the frontend through a presigned URl.

Future Improvements

There are several ways this application could be enhanced:

Multi-image input - Allow users to upload multiple images for more dynamic videos
Custom audio - Add options for background music or voiceovers
Video templates - More specialized templates for different sports and marketing needs
Analytics integration - Track video performance metrics
Batch processing - Generate multiple videos at once

Conclusion

The Sports Marketing Video Generator demonstrates how Amazon Nova's generative AI capabilities can transform sports marketing. By combining image classification, image editing, and video generation in a simple interface, marketers can create professional videos in minutes instead of days.

This project showcases the power of combining multiple AWS services (Bedrock, Rekognition, Cognito, ECS, S3) with a user-friendly frontend (Streamlit) to solve real business problems. This approach not only democratizes video creation but also enables marketing teams to produce more content, faster, and at a lower cost—all while maintaining brand consistency and quality standards.

References

https://aws.amazon.com/blogs/machine-learning/exploring-creative-possibilities-a-visual-guide-to-amazon-nova-canvas/

https://aws.amazon.com/blogs/aws/amazon-nova-reel-1-1-featuring-up-to-2-minutes-multi-shot-videos/

https://github.com/aws-samples/deploy-streamlit-app

Data Transfer from S3 to Cloud Storage using GCP Storage Transfer Service

Ryan Nazareth — Sat, 04 Jan 2025 03:57:02 +0000

Storage Transfer Service automates the transfer of data to, from, and between object and file storage systems, including Google Cloud Storage, Amazon S3, Azure Storage, on-premises data, and more. It can be used to transfer large amounts of data quickly and reliably, without the need to write any code. Depending on your source type, you can easily create and run Google-managed transfers, or configure self-hosted transfers that give you full control over network routing and bandwidth usage. Storage transfer service only allows transfer into GCP and does not support bi-directional transfer e.g. from GCP to AWS.

In this blog, we will demonstrate how to create a on-off storage transfer job to transfer data from S3 bucket to GCP Cloud Storage. In addition, we will also demonstrate how to setup an event transfer job to transfer objects by continuously listen to event notifications associated with objects being added or modified in source S3 bucket

Prerequisites

Before you begin, make sure you have the following prerequisites:

A GCP account with the necessary permissions to create and manage storage buckets and transfer jobs.
An AWS account with the necessary permissions to create and manage S3 buckets.
The AWS CLI installed and configured on your local machine.
The gcloud CLI installed and configured on your local machine.
The necessary IAM roles and permissions set up in both AWS and GCP.

Create a source S3 bucket demo-s3-transfer and destination cloud storage bucket demo-storage-transfer. In the source S3 bucket, we will upload some parquet files in a prefix 2024/12. We will be transferring the parquet files in this prefix into the demo-storage-transfer bucket.

Storage Transfer REST API

Storage Transfer Service uses a Google-managed service account to move your data. This service account is automatically created the first time you create a transfer job or call googleServiceAccounts.get, or visit the job creation page in the Google Cloud console. The service account's format is typically project-PROJECT_NUMBER@storage-transfer-service.iam.gserviceaccount.com. googleServiceAccounts.get

We can use the googleServiceAccounts.get method to retrieve the managed Google service account that is used by Storage Transfer Service to access buckets in the project where transfers run or in other projects. Each Google service account is associated with one Google Cloud project.
Navigate to the googleServiceAccounts.get reference page here.
On the right, you will see an window open, where you can enter the project ID under the request parameters. Executing this will return the subjectId in the response, along with the storage transfer account email. Keep a note of the subject ID and storage service managed account as we will require it in the latter sections.

Alternatively, we can do the same via cli, using curl command and passing the bearer token in the header.

curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "x-goog-user-project: <project-id>" https://storagetransfer.googleapis.com/v1/googleServiceAccounts/<project-id>

The x-goog-user-project header key is required to set the default project quota for the request see the troubleshooting guide. If excluded, you may get the following error:The storagetransfer.googleapis.com API requires a quota project, which is not set by default

AWS IAM role permissions

In the AWS console, navigate to IAM and create a new role.
Select Custom trust policy and paste the following trust policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "accounts.google.com"

            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:sub": <subject-id>
                }
            }
        }
    ]
}

Replace the value with the subjectID of the Google-managed service account that you retrieved from the previous section using the googleServiceAccounts.get reference page. It should look like the screenshot below.

Paste the following json policy to grant permissions to the role to list bucket and get objects from the S3 bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [ "*"]
        }
    ]
}

Once the role is created, note down the ARN value, which will be passed to Storage Transfer Service when initiating the transfer programatically in python.

Transfer permissions in GCP

The GCP service account used to create the transfer job will need to granted the Storage Transfer User role (roles/storagetransfer.user) and roles/iam.roleViewer. In addition, we need to give the Google-managed service account retrieved in the previous section, access to resources needed to complete transfers.

Navigate to the Cloud Storage Bucket demo-storage-transfer. In the permissions tab, click grant access.

In the new window, enter the principal as the managed gcp transfer service email. Assign the Storage Admin Role.

Create one-off batch Storage Transfer Job

We can interact with Storage Transfer Service programmatically with Python.

copy this folder which contains the requirements.txt and script for initiating the storage transfer job, checking status and verifying completion.
in command line terminal window, run pip install - requirements.txt, to install the google-cloud-storage-transfer and cloud-storage libraries.
If you use a service account json, then set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path to this service account. Otherwise, use one of the other GCP authentication options
Now, run the following command to execute the storage_transfer_batch.py job script in a terminal of your choosing. This will transfer the data from the 2024/12 prefix in the S3 bucket to the GCP bucket with a Data prefix. We pass in the arn of the role we created earlier, which will be assumed during the transfer to generate temp credentials with the required permissions.

python python/storage_transfer.py --gcp_project_id <your-gcp-project-id> --gcp_bucket <your-gcp-bucket> --s3_bucket <your-s3-bucket> --s3_prefix <s3-prefix> --gcp_prefix <gcp-prefix> --role_arn <aws-role-arn>

You should see the logs as in the screenshot below. Wait for the job to show as completed.

Navigate to the cloud storage bucket and you should see the data in the bucket in the Data prefix

You can monitor and check your transfer jobs from the Google Cloud Console UI. Open the Google Cloud Console and navigate to "Transfer Service". The jobs executed will be listed.

In the monitoring tab, we can see plots for performance metrics (bytes transferred, objects processed, transfer rate etc).

In the operations and configuration tabs, we can get more details regarding Transfer specifications e.g. Run history, data transferred and other configuration details we set for the transfer job.

Create event driven transfer job

Event-driven transfers listen to Amazon S3 Event Notifications sent to Amazon SQS to know when objects in the source bucket have been modified or added.

Create an SQS queue in AWS

In AWS management console, go to the SQS service, click on "Create queue" and provide a name for the queue.
In the Access policy section, select Advanced. A JSON object is displayed. Paste the policy below, replacing the values for , and . This will only permit SQS:SendMessage action on the SQS queue from the S3 bucket in the AWS account.

{
  "Version": "2012-10-17",
  "Id": "example-ID",
  "Statement": [
    {
      "Sid": "example-statement-ID",
      "Effect": "Allow",
      "Principal": {
        "Service": "s3.amazonaws.com"
      },
      "Action": "SQS:SendMessage",
      "Resource": <SQS-RESOURCE-ARN>,
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": <AWS-ACCOUNT-ID>
        },
        "ArnLike": {
          "aws:SourceArn": <S3_BUCKET_ARN>
        }
      }
    }
  ]
}

Now we need to enable notifications in the S3 bucket, setting the SQS queue as destination.

Navigate go the S3 bucket and select the Properties tab. In the Event notifications section, click Create event notification.
Specify a name for this event.In the Event types section, select "All object create events", as in the screenshot below.

As the Destination select SQS queue and select the queue you created previously.

Create an event driven Storage transfer job

We will now use the GCP cloud console to create an event driven transfer job. Navigate to the GCP Transfer Service page and click Create transfer job

Select Amazon S3 as the source type, and Cloud Storage as the destination.
For the Scheduling mode select Event-driven and click Next.

Enter the S3 bucket name. We will use the same bucket we used previously for the one-off transfer but you can use a different one if you wish.
Enter the Amazon SQS queue ARN that you created earlier, as in the screenshot below

Select the destination Cloud Storage bucket path (which can optionally include a prefix) as in the screenshot below.
Leave the rest of the options as defaults and click create.

The transfer job starts running and an event listener waits for notifications on the SQS queue.

We can test this by putting some data into S3 bucket source location. Observe your objects being replicated from AWS S3 to GCS bucket. You can also view monitoring details in the SQS queue.

Conclusion

GCP's Storage Transfer Service is a powerful tool for transferring data from S3 to GCS. It offers a cost-effective, scalable, and secure solution for data migration, with flexible scheduling and data filtering options. In this practical blog, we walked you through the steps required to set up GCP's Storage Transfer Service for transferring data from S3 to GCS. By following these steps, you can easily migrate your data from S3 to GCS with minimal effort and maximum efficiency.

References

AWS Neptune for analysing event ticket sales between users - Part 2

Ryan Nazareth — Mon, 29 May 2023 21:41:05 +0000

This blog follows on from the setup of Neptune DB with Worldwide Events data in the first part. Here we will run some queries and investigate node relationships in the Neptune Notebook.
We can use some of the magic commands available in graph notebook open sourced by AWS and available to use in the Neptune Notebook instance configured in the first part of this blog. We can configure the visualisation options available in the notebook when we execute the query, using the%graph_notebook_vis_options command. This will output a json containing the default configuration options for rendering the graphs.

To modify the executing notebook's vis.js options, we can use %%graph_notebook_vis_options with the modified JSON payload provided in the cell body. For example, in the screenshot below I have switched the physics solver from barnesHut to forceAtlas2Based.

We will also create a mapping between node label and property used to label the node in the visualisation. Run the following in the notebook cell.

display_var = '{"user":"name","event":"name"}'

We will reference $display_var, the value of the variable when running the first query. The %%oc magic command indicates we want to execute an openCypher query. The -d hint is used to enable the mappings defined above, so we pass $display_var. The -l hint sets the maximum length of the text to 20, that can be displayed in a node. The cypher query will return all the connected nodes (along with the edges) in the database.

%%oc -d $display_var -l20
MATCH ((n)-[r]-(p))
RETURN n,r,p

By default, we see the results in text form in the console tab

If we switch to the graph tab, we can see the rendered visualisation. We can zoom in and out using then +/- icons or move the nodes by clicking and dragging. Zooming into a cluster of nodes and relationships should also show the associated labels. Hovering over any nodes should also show the label.

Users from New York

Now lets find out all users (buyers and sellers) who are from New York. We will use the user when to only match the user nodes and then use the WHERE clause to filter only nodes whose property city is New York.

%%oc 
MATCH (t:user)
WHERE t.city = "New York"
RETURN t

Users buying and selling for events in Toronto

To find paths containing users who listed tickets and bought tickets for events in Toronto. We can use the following query to match the path (n:user)-[]-(e:event)-[]-(u:user) and then filter the city property of the event node to Toronto

%%oc  -d $display_var -l20
MATCH p=(n:user)-[]-(e:event)-[]-(u:user)
WHERE e.city = "Toronto"
RETURN p

Match event property directly in path

Let us now match all sellers and buyers for tickets to The Police event. Instead of using the WHERE clause after MATCH we can filter the required paths directly in the first MATCH clause by specifying the property name as The Police in the event node in MATCH statement.

%%oc -d $display_var
MATCH p=(seller:user)-->(event {name: 'The Police'})-->(buyer:user)
RETURN p

Looks like we have two events, for which tickets listed by a user were purchased by another user. However, maybe we need some more granular information regarding the transactions and whether all the tickets listed by the seller were bought.

We can then return the associated properties of the relationships as separate columns in a table. The previous query can be modified to return the properties and relationship type instead of the path and alias the names (which will be the table column names). We also do not need to match the full path (explicitly the directions between users and events) as in the previous query as we are interested in all relationship types connected to the The Police events node(s). This output will not give the option of displaying a graph in the widget as we have not returned a path.

%%oc -d $display_var
MATCH p=()-[e]-(event {name: 'The Police'})
RETURN e.date as event_date,
type(e) AS event_type,
e.quantity as number_of_tickets,
e.price as price

Path length and hops

Here we will try and find a user has listed a ticket for the event Mary Poppins and is at least 11 hops away from any other node. In the cypher query below, we have matched a user node with a directed relationship to event node with property name Mary Poppins. Since this already accounts for the first hop, we need to match the remaining minimum 10 hops from the event node to any other user node. We can achieve this by using variable length pattern matching in cypher which allows users to specify a range of lengths in the relationship description of a pattern. Here we use a lower bound for the range followed by ellipsis to signify no upper bound. Finally we return the username.

MATCH p=(u:user)-->(event {name: 'Mary Poppins'})-[e*10..]-(:user) 
RETURN DISTINCT u.name AS name

This returns user QRG30DIY. Now let's return the path so we can visualise who this person is connected to.

We can modify the query to match the user QRG30DIY who listed the ticket for Mary Poppins event and then return all subsequent relationships and nodes connected any number of hops away from the Mary Poppins event node (using the * notation).

MATCH p=(u:user {name:'QRG30DIY'})-->(event {name: 'Mary Poppins'})-[*]-()
RETURN p

The user node QRG30DIY has been highlighted in the visual below. If we count the number of connections from this node, there are two paths which have at least 11, ending at user nodes have a single connection to the Macbeth event node.

Deleting Resources

Once you have finished with the queries and analysis, you will need to delete the Neptune DB instance and Redshift Serverless namespace (and associated workgroup). The Neptune DB instance can be deleted with or without final snapshot by following the instructions in the docs. Then delete the Redshift Serverless workgroup by following the steps here followed by the namespace.

AWS Neptune for analysing event ticket sales between users - Part 1

Ryan Nazareth — Mon, 29 May 2023 21:39:57 +0000

This is the first of a two part blog series, where we will walk through the setup for using AWS Neptune to anaylse a property graph modelled from the Worldwide Event Attendance from AWS Marketplace Data Exchange, which is free to subscribe to. This contains data for user ticket purchases and sales for fictional daily events(operas, plays, pop concerts etc) across 2008. in the USA. This data is accessible from Redshift so part of this setup will involve loading the data in required format from Redshift to S3 bucket and then loading it into a Neptune DB instance for running queries and generating visualisations in Part 2.

Setting up the Neptune Cluster and Notebook

First we will need to create the Neptune Cluster and database instance. I have configured this from the AWS console, which can be followed using the steps in the docs but this could also be automated via one of the Cloudformation templates here.

For the Engine options, select provisioned mode and the latest version of Neptune
For Settings, select the Development and testing option rather than Production as this will give use the option to select the cheaper burstable class (db.t3.medium).
We will not create any Neptune replicas in different availability zones so click No for Multi-AZ Deployment.

For the Connectivity option, I have selected my default VPC for which I already have the security group configured with an inbound rule to allow access to any port in range with existing security group id as source. Alternatively you could add another custom rule to only allow inbound traffic to the specific default port for Neptune (8182).
You can also choose to create a new VPC and new security group if you do not want to use the existing ones.
We will configure the notebook separately after creating the cluster, so skip the Notebook configuration option.
You can either skip the Additional configuration option and accept the defaults, which enable deletion protection, encryption at rest and auto minor version upgrades or disable the options you do not want.

We will now configure a Neptune graph notebook to access the cluster, so we can run queries and generate interactive visualisations. Neptune Workbench allows users to run fully managed jupyter notebook environment in Sagemaker with the latest release of the open source graph Neptune project. This has the benefit of offering in-built capabilities like visualisation of queries

Click Notebooks from the navigation pane on the left and select Create notebook.
In the Cluster list, choose your Neptune DB cluster. If you don't yet have a DB cluster, choose Create cluster to create one.
For Notebook instance type, select ml.t3.medium which should be sufficient for this example.
Under IAM role name, select create an IAM role for the notebook, and enter a name for the new role.

Finally, we need to create an IAM role for Neptune to assume to be able to load data from S3. Also, since Neptune DB instance is within a VPC, we need to create an S3 gateway endpoint to allow access to S3. This can be achieved by following the steps in the IAM prerequites for the Neptune Bulk Loader.

Redshift Serverless Data Query and Unload

In this previous blog, I have described how to configure AWS Redshift Serverless with access to AWS Marketplace Worldwide Events Dataset. Follow the steps to configure datashare to access this database from the redshift cluster.

We will model the users and events as nodes and relationship between each user and event as an edge. For example, a seller (node) would list (relationship) a ticket for a given event (node) for which one or many buyers (node(s)) would purchase (relationship(s)) tickets for (or unluckily noone may pruchase from the seller).

Open the query editor in the navigation pane in the Redshift Serverless console. We will first create a view which will filter the all_users view in the worldwide events datashare, to only contain users who like theatre, concerts and opera. The additional constraint is that we will only keep data that has no nulls in any of the entries for the boolean columns selected.

CREATE VIEW user_sample_vw AS 
SELECT userid, username, city, state, liketheatre, likeconcerts, likeopera FROM 
"worldwide_event_data_exchange"."public"."all_users" 
WHERE (liketheatre IS NOT NULL AND  likeconcerts IS NOT NULL AND  likeopera IS NOT NULL) 
with no schema binding;

Lets also create another view containing a snapshot of events and related transactions between selected buyers and sellers in our user_sample_vw for the month of January. We also need to pull in additional columns corresponding to venue, event and ticket purchase and listing details (.e.g number of tickets and price). Hence we need to join to the respective tables.
Note We also only want records where either the buyer or seller cannot be NULL and all users must be from the subset we sampled in user_sample_vw.

CREATE OR REPLACE VIEW network_vw AS 
SELECT *
FROM
(
SELECT  S.saletime, L.sellerid, L.listtime, S.buyerid, E.eventid, E.eventname, 
V.venuename , C.catname , V.venuecity, V.venuestate,  pricepaid ,qtysold, D.caldate,
  priceperticket AS listprice ,numtickets AS listtickets
FROM "worldwide_event_data_exchange"."public"."date" D 
JOIN "worldwide_event_data_exchange"."public"."sales" S
ON D.dateid = S.dateid
JOIN "worldwide_event_data_exchange"."public"."listing" L
ON S.listid = L.listid
JOIN "worldwide_event_data_exchange"."public"."event" E
ON E.eventid = S.eventid
JOIN  "worldwide_event_data_exchange"."public"."category" C
ON E.catid = C.catid
JOIN "worldwide_event_data_exchange"."public"."venue" V
ON E.venueid = V.venueid
JOIN "dev"."public"."user_sample_vw" U
ON S.buyerid = U.userid 
WHERE D.qtr = 1
) A
JOIN "dev"."public"."user_sample_vw" B
ON A.sellerid = B.userid 
with no schema binding;

You should see the network_vw view visible if you refresh the dev database and expand the view dropdown in the tree. A sample of the rows and columns of the view looks like below. We will use this later to simplify the creation of edge records for our csv to export to S3. We will also use the eventid and related properties to create nodes csv.

We will need to generate two csv files(one containing all the nodes and other containing all the relationship records) in the S3 bucket. This is a requirement when we will subsequently use the Neptune Bulk Loader to load the data into Neptune using the openCypher-specific csv format (since we will be using openCypher to query the graph data). In addition, the openCypher load format requires system column headers in node and relationship files as detailed in the docs. Any column that holds the values for a particular property needs to use a property column header propertyname:type.

We will need to create a role to associate with redshift serverless endpoint so it can unload data into S3.
In the Redshift Serverless console, go to Namespace configuration and select the namespace. Then go to Security and Encryption Tab and click on Manage IAM roles under the Permissions section. Click the Create IAM role option in the Manage IAM roles dropdown. This will create an IAM role as the default with AWS managed policy AmazonRedshiftAllCommandsFullAccess attached. This includes permissions to run SQL commands to COPY, UNLOAD, and query data with Amazon Redshift Serverless.

Select the option Specific S3 buckets and select the S3 bucket created for unloading the nodes and relationship data to. Then click Create IAM role as default.

This default role created does allow permissions to run select statements on other services besides S3, including Sagemaker, Glue etc. The policy attached to the new role created would need to be updated from IAM if you want to limit permissions to fewer services.

If you navigate back to the Namespace, you should see the IAM role and the associated arn (highlighted in yellow) which you will need to specify when running commands to unload data to S3.

We will use the UNLOAD command to unload the results of the queries above to S3 in csv format. We need to add the following options below.

CSV DELIMITER AS: to use csv format with delimiter as ','
HEADER: specify first row as header row
CLEANPATH: to remove any existing S3 file before unloading new query
PARALLEL OFF: turn off parallel writes as we want a single CSV files rather than multiple partitions.

unload ('<query>')
to <s3://object-path/name-prefix>
iam_role <your role-arn>
CSV DELIMITER AS ','
HEADER
cleanpath
parallel off;

The query below will unload the results for all the user and event node records to an S3 bucket s3://redshift-worldwide-events with object name prefix nodes. Replace the iam role arn with your role arn. The first line will force the column names to be the same case as used in the query (by default all column names are overriden to lowercase).

SET enable_case_sensitive_identifier TO true;

unload (
'SELECT DISTINCT *
FROM
(
    SELECT CONCAT(''u'', A.buyerid) AS ":ID", B.username AS "name:String", 
    B.liketheatre AS "liketheatre:Bool", B.likeconcerts AS "likeconcerts:Bool", B.likeopera AS "likeopera:Bool", 
    NULL AS "venue:String", NULL AS "category:String",  B.city AS "city:String",  B.state AS "state:String",  ''user'' AS ":LABEL" 
    FROM "dev"."public"."network_vw" A 
    JOIN user_sample_vw B
    ON A.buyerid = B.userid
)
UNION 
(
    SELECT CONCAT(''u'', A.sellerid) AS ":ID", B.username AS "name:String", 
    B.liketheatre AS "liketheatre:Bool", B.likeconcerts AS "likeconcerts:Bool", B.likeopera AS "likeopera:Bool", 
    NULL AS "venue:String", NULL AS "category:String",  B.city AS "city:String",  B.state AS "state:String",  ''user'' AS ":LABEL" 
    FROM "dev"."public"."network_vw" A 
    JOIN user_sample_vw B
    ON A.sellerid = B.userid
)
UNION
(
    SELECT CONCAT(''e'', eventid) AS ":ID",  eventname  AS "name:String", 
    NULL AS "liketheatre:Bool", NULL AS "likeconcerts:Bool", NULL AS "likeopera:Bool",
    venuename AS "venue:String", catname AS "category:String", venuecity AS "city:String", venuestate AS "state:String", ''event'' AS ":LABEL"
    FROM "dev"."public"."network_vw" B
)')
to 's3://redshift-worldwide-events/nodes' 
iam_role '<your-iam-role>'
CSV DELIMITER AS ','
HEADER
cleanpath
parallel off

If it ran successfully, we should see a warning saying that 239 rows loaded successfully.

Lets break down the query and see what its doing. The first and second subqueries create records for buyer and seller nodes respectively by aliasing the column names to openCypher format and setting the event property columns to NULL. We need to join the network_vw (which contains the list of seller and buyer pairs) and the user_sample_vw (which contains the properties of all users) to select additional information per user like username, city and whether they like concerts, theatre and/or opera. The final subquery creates the records for the events nodes from network_vw and similarly aliasing the column names to the required format and setting the values for the columns corresponding to the users nodes to NULL. We then UNION the separate sub queries to combine them in the same results set.

We can similarily run a query for unloading the edge records results set. Here the S3 location option is slightly modified to use an object name prefix 'edges'

SET enable_case_sensitive_identifier TO true;

unload (
'SELECT ROW_NUMBER() OVER() AS ":ID",":START_ID",":END_ID", ":TYPE", "price:Double", "quantity:Int", 
"date:DateTime"
FROM 
    (
        ( 
        SELECT  CONCAT(''u'', sellerid) AS ":START_ID", 
        CONCAT(''e'', eventid) AS ":END_ID",''TICKETS_LISTED_FOR'' AS ":TYPE",
        pricepaid AS "price:Double" ,qtysold AS "quantity:Int", caldate AS "date:DateTime"
        FROM "dev"."public"."network_vw"
        )
    UNION 
        (
            SELECT CONCAT(''e'', eventid) AS ":START_ID", 
            CONCAT(''u'', buyerid) AS ":END_ID",''TICKET_PURCHASE'' AS ":TYPE", 
            pricepaid AS "price:Double" ,qtysold AS "quantity:Int" , caldate AS "date:DateTime"
            FROM "dev"."public"."network_vw"
        )
    )') 
to 's3://redshift-worldwide-events/edges' 
iam_role '<your-iam-role>'
CSV DELIMITER AS ',' 
HEADER 
cleanpath 
parallel off

Notice that we have also used a window function to rank the edge records for same node ids by date, so we can only take the latest transaction between pair of same users.
The screenshot below shows the edge records where there are multiple transactions between same buyer and seller on different dates. We will only keep the latest record.

If the query has loaded successfully, check that the two objects are visible in the S3 bucket.

Loading S3 Data into Neptune

Now we will load the data from the S3 bucket to the Neptune cluster. To do this, we will open the notebook we configured in Sagemaker to access the Neptune cluster.

Go to the Sagemaker console, Notebook tab and select Notebook instances.
You should see the Notebook instance status in service if the create notebook task ran successfully.
Under Actions, click on Open Jupyter or Jupyter lab.

You should see a number of subfolders containing sample notebooks on various topics, one level below the Neptune parent folder. Either open one of the existing notebooks or start a blank new one.

First we will check if the notebook configurations are as we expect. Graph notebook offers a number of magic extensions in ipython3 kernel to run specific tasks in a cell such as run query in specific language (cypher, gremlin), check the status of load job/query, configurations settings, visualisation options etc.

In a new cell, use the magic command %graph_notebook_config and execute. This should return a json payload containing connection information for the Neptune host instance the notebook is connected to.

If we want to override any of these (for example if we have set a port different to 8182, then we can copy the json output from the previous cell output and modify the required value. Run the cell with the magic command %%graph_notebook_config to set the configuration to the new setting.

Check the status of the Neptune cluster endpoint is showing as healthy using the %status magic extension.

We can use the Neptune loader command to send a post request to the Neptune endpoint as described here. For the request parameters we will use the following:

source : "s3://redshift-worldwide-events/",
format : "opencypher",
iamRoleArn :
region : "us-east-1",
failOnError : "FALSE",
parallelism : "MEDIUM",
updateSingleCardinalityProperties : "FALSE",
queueRequest : "FALSE"

This will output a loadid in the payload.

Then we can check the load status, by using the loader get status request, replacing your neptune endpoint, port and loadId in the command: curl -G https://your-neptune-endpoint:port/loader/loadId
If successful, you should see an output similar to the payload below. This returns one or more loader feed codes. If the load was successful you should see only a LOAD_COMPLETED code.

If there is an issue with one or both csvs then you may see a LOAD_FAILED code or one of the other codes listed here. In the next section, we will investigate some options to diagnose the errors. Also if one of the loads is still in progress, you will see a LOAD_IN_PROGRESS key with the value corresponding to the number of S3 object loads which are still in progress. Running the curl command to check the load status again, should hopefully update the code to LOAD_COMPLETED or one of the error codes, if there was an error.

Check that you can access some data by submitting an openCypher query to the openCypher HTTPS endpoint using curl as explained in the docs. In this case, we will just return a single pair of connected nodes from the database by passing the query MATCH (n)-[r]-(p) RETURN n,r,p LIMIT 1 as the value to the query attribute as in the screenshot below.

Note the endpoint is in the format HTTPS://(cluster endpoint):(the port number)/openCypher. Your cluster endpoint will be different to mine in the screenshot below, so you will need to copy it from the Neptune dashboard for the database cluster identifier.

Debugging Neptune data load errors

Running the check loader status command can sometime return errors. To further diagnose the error logs, we can run this curl command with additional query parameters replacing neptune-endpoint, port and loadid with your values. This will give a more detailed response with an errorLogs object listing the errors encountered as shown in the screenshot below. Here, the load failed because some of the node ids in the edge records in the relationship csv file were missing in the node csv file.

The next screenshot below shows a cardinality violation error because some of the edge record ids in the original data are duplicated.

We can also reset the db and remove any existing data in it by using the magic command %db_reset. This will prompt you to
tick an acknowledgement option and click Delete. You will then get a checking status check. Wait for this to complete and the you should get a database has been reset message when it is complete.

We are now setup for running more complex queries to generate insights from our data. Part 2 of this blog will run a number of openCypher queries to explore the property graph containing the model of the worldwide events network

Data Analysis with Redshift Serverless and Quicksight - Part 2

Ryan Nazareth — Sat, 13 May 2023 23:18:09 +0000

In the first part of this blog, we have introduced Redshift Serverless offering and setup a workgroup and namespace configured with datasharing to allow access to AWS Marketplace Data Exchange. In the second part of this blog, we will focus on the accessing the data from Amazon Quicksight to generate interactive visualisations and explore some other features it has to offer.

Amazon Quicksight is a fully managed business intelligence (BI) service which allows users to publish dashboards and share amongst team members. Since it is a serverless offering, it scales to tens of thousands of users without having to worry about managing underlying infrastructure. It can connect to a wide variety of data sources in the cloud (S3, RDS, Redshift, Athena to name o few) and on-prem.

It also provides some more advanced features such as integrating machine learning insights with dashboards in the form of forecasting, anomaly detection and natural language querying. We will explore some of these in this blog.

Setting up a Quicksight subscription

If this is the first time using Quicksight, you will need to set up an account. Sign into your IAM user account and then navigate to Quicksight service. Follow the instructions here and choose the options to setup an enterprise account with method of authentication as federated users and QuickSight-managed role and grant access to Redshift. You will get a free trial subscription for 30 days for a Standard or Enterprise subscription. If your free trial has expired, then you can sign up for one of the cheaper pay as you go Reader subscriptions which is only charged for an active session and can be stopped after this tutorial is complete.

Once you have setup the subcription, you should be able to sign into Quicksight as an IAM user and manage your account by choosing the user icon at the upper right of the page and select Manage Quicksight. You can now check your active subscription or as admin user, invite users to your account if required and manage permissions accordingly.

Quicksight also uses SPICE to run fast in-memory computations on data for visual analytics. For Enterprise subscriptions, data is also encrypted at rest by default. By default, we get a total 11GB SPICE capacity per region per subscription account, which can be shared amongst Quicksight users added to the account. We will be loading data from the Redshift into SPICE for this tutorial and the capacity would be more than enough.

Connecting to Redshift

On the Amazon QuickSight start page choose Datasets from the options on the left and on the Datasets page, choose the New data set option on the top right (screenshot below).

In the new window, choose the Redshift Manual connect icon. A new window will pop up requiring the the connection information for the data source to be filled in.
For Data source name, enter a name for the data source.
For Database server, you will need to retrieve the endpoint of the cluster. You can get the endpoint value from the Endpoint field on in the general information section when clicking on the cluster workgroup in the Redshift Serverless dashboard. The server address is the first part of the endpoint before the colon as highlighted in yellow below.

The port will be default port for redshift (5439) unless this was set differently in setup, in which case confirm from the endpoint address (the number following the first colon).
Enter the name of the database (after the second colon in the endpoint). In my case, it is dev.
For UserName and Password, enter the user name and password you configured in part1 of this blog when setting up the redshift cluster
Click on Validate Connection. If successful, you should see a green tick, saying validated. If this has failed check that you have done the following things:
- Check that the security group attached to the Redshift cluster allows inbound traffic from IP address range associated with the region Quicksight was setup in as explained in the previous blog.
- Did you forget to make the VPC that the Redshift cluster resides in publicly available ?
- Check that you are using the correct username and/or password (this can be reset from the Redshift dashboard).
Assuming everything worked, click Create DataSource.

You will be presented with the schema and set of tables to connect to. The view worldwide_events_vw created in the previous blog , should be visible. Select this and click next.

In the next pop up, we need to select whether we want to directly query the dataset from source or use the table data as-is and import into SPICE. The latter is the recommended method as it improves performance and quicker analytics, provided you have enough SPICE capacity. Select the Import to SPICE option
If you do not want to be emailed when a refresh fails, then untick the box. Then choose Visualize.

Accept the default settings for creating a new sheet and you should now be presented with the dashboard for creating the charts.

Data Insights

Quicksight offers a number of visual types which can be selected from the visual types pane using the representative visual icon. The AWS docs on creating quicksight visuals goes through the steps for adding a visual to the dashboard. First we will create a line chart from the visual types to plot the fields caldate and totalprice from the fields list.

Quicksight allows non technical users to generate forecasts using the built in Random Cut Forest algorithm to analyse historical data and generate forecast for the a specified period with a prediction interval of required confidence level.

For forecast length, we set the periods forward to 14
Set the prediction interval to 90.
Set the seasonality to auto and leave the other settings as the default values.

We get a wide confidence interval which suggests that the forecast could lie anywhere within that range. A smaller value of prediction interval will generate a narrower band but will give less confidence in the forecast.

We can also generate a forecast for a period in history and compare it to actual data. To do this, edit the forecast and for the forecast length setting, set the periods forward option to 0 and the periods backward setting to 100.

Amazon Quicksight also provides users with ML powered anomaly insights by analysing a number of combinations of metrics and trends in data. The concepts for detecting outliers are based on whether an extreme data point occurs by random chance or is a significant event. Quicksight notifies users when there are any anomalies in the visuals and whether they are worth investigating. Click on the bulb icon in the top right hand corner of the chart. You will see the largest anomaly detected in the time series via ML insights. Click on the more options and then view more details. On the left hand panel, you should see a list of anomalies with additional statistics on the percentage change from average expected total price. Click 'Add anomaly to sheet'

This will open an insight widget in the same sheet. Click get started in the widget. You are now taken to a configuration screen with preview

Amazon Quicksight provides contributions analysis (key drivers) that contribute to the anomalous outcomes. Expand the top contributors option and tick upto 4 features to use as key drivers for running contribution analysis. The screenshot below shows the results for day, eventname, month and venuecity.

Choose Save to confirm your choices. You are taken back to the
insight widget, where you can select Run now to run the anomaly detection and view your insight. This will take a few minutes to complete. Once complete you should see an update in the widget with the latest anomaly detected and an option to explore anomalies, which you can click.

This will open the anomalies screen as in the screenshot below. Select SHOW ANOMALIES BY DATE to display the The Number of anomalies chart which shows outliers detected over time. We can see two outliers detected end of May and end of June. On the left pane, we can re-run contribution analysis if required with different set of key drivers. In the screenshot below, I have run this between May 26, 2008 and May 27, 2008 (corresponding to the first anomaly) and selected eventname and eventcity. We can also explore anomalies per category or dimension.

The dashboard below uses the vertical bar chart,histogram and boxplot visual types. The bar chart shows the total number of tickets for each quarter in the year split by each of the even days in the week. We can see that the first quarter (Jan-March) has the least number of tickets sold and quarter 3 has a larger variation in tickets sold across the week, with Sunday being the most day for events. In the last 3 months of the year, we have more tickets sold between Friday-Monday compared to rest of the week.

Jan is the month in the first quarter where the range and median of total transactions for an event were the least possibly due to fewer tickets sold. We can see a strong right skew in February with a long upper whisker. For the rest of the months, the median remains consistent between £12k - £15k. November showed a slight left skew and the maximum transaction values for a given event of just over £32k were seen for Dec, Feb and May.

This sheet contains tree map with venuename dimension arranged by total_tickets (rectangle size) and color encoded by venueseats. The larger the venue size the darker the shade of green (e.g FedEx field, New York Giants stadium,,Arrowhead stadium) whilst the smaller the venue the brighter shade of yellow (e.g. Shoreline Ampitheatre). We can see that some of the events smaller venue sizes between 20k-50k sold a larger number of ticket (possibly because more events held at these venues during this timeframe). The pie chart shows the proportion of total transactions for the top eventname.Here the top 6 events are represented and the rest grouped in 'others' category. Greg Kihn and Yaz (Yazoo) bands accounted for more than 65% of total transaction sales.

Deleting Resources

Finally, remember to delete all resources in Redshift and stop the Quicksight subscription created in both parts of this blog to avoid being charged further. Note that for Redshift Serverless although you do not pay for compute capacity when you do not run any queries, you still pay for storage (more details can be found here).

To delete the Quicksight Enterprise subscription follow the instructions here. You can also export the dashboard to pdf and delete the dashboard if required.
The Redshift Serverless workgroup and associated namespace can be deleted by following these instructions.

Data Analysis with Redshift Serverless and Quicksight - Part 1

Ryan Nazareth — Sat, 13 May 2023 23:17:28 +0000

In the first part of this blog, we will focus mainly on setting up Redshift serverless cluster and configuring access to external WorldWide Event Attendance Data Exchange via Redshift DataSharing Feature so it can be accessed from the database in the cluster provisioned. We will then run some queries and unload data to S3. The other features provided by Redshift such as cluster performance monitoring, data recovery and guarding against surprise bills will also be touched upon. In the second part, we will setup Quicksight and connect to our Redshift cluster, to access the data and build dashboards to generate some interesting insights.

Note that the queries would cost between $20-$30 as we are using the entire dataset with minimal filtering. Serverless will try and optimise the computation by scaling to more RPUs which will increase cost. In addition, you will also be charged for data storage. This is still well within the free trial. You can adjust the RPU base capacity or set usage limits which is explained more in the Data Recovery, Monitoring and Cost Management section.

Worldwide Event Attendance is a free product available in AWS Marketplace, which allows subscribers to query, analyse and build applications quickly. Instructions on how to subscribe to this product can be found here
Once subscribed to, we need to create a datashare in the Redshift cluster to access the data immediately.

We will use Redshift Serverless, the serverless offering of Redshift which removes the need for setting up and managing underlying cluster specs and scaling. All new users get free $300 credits for a trial period of 3 months. Sign in to the Redshift console and select Serverless free trial. You are only billed according to capacity used in a given duration (RPU hours), which scales automatically to optimise running the query. There is also a charge for Redshift Managed Storage (RMS).

Redshift Serverless Setup

If this is the first time using Redshift Serverless, then need to create a default workgroup. A workgroup is a collection of compute resources (VPC subnet groups, security groups, RPUs) and can be associated with one namespace - collection of database objects and users comprising tables, users, schemas, KMS keys etc.

Sign in to the Redshift Serverless console and choose Create workgroup.
Specify a value for Workgroup name: e.g. default
In Network and Security option, choose VPC and security groups. I have chosen the default VPC and associated security group. In addition, I also created another security group to allow access to Quicksight, which we will need to later connect to our database to access the data and create dashboards. Setting up the security group for this can be found here

The security group you choose should have an inbound rule to allow traffic to port 5439 (default Redshift port) from CIDR address range where Quicksight was created. This can be looked up here. For example, if Quicksight is configured in us-east-1, the associated IP address range for data source connectivity is 52.23.63.224/27

Select Create a New namespace and specify the name
Under admin user credentials, setup a username and password to connect to the database in the cluster. We will use this later when manually connecting to the redshift endpoint from Quicksight.

We will now need to create a role to associate with redshift serverless endpoint so it can unload data into S3.

In the Permissions section, Click the Create IAM role option in the Manage IAM roles dropdown in the Associate Roles subsection. Select the option Specific S3 buckets and select the S3 bucket created for unloading the nodes and relationship data to. Then click Create IAM role as default. This will create an IAM role as the default with AWS managed policy AmazonRedshiftAllCommandsFullAccessattached. This includes permissions to run SQL commands to COPY, UNLOAD, and query data with Amazon Redshift Serverless.

Now back in the Associate IAM roles section, select Associate IAM role and tick the role just created.
By default, KMS encryption is provided with AWS owned key. The Encryption and security section can be skipped unless you want to provide your own KMS key for encryption and enable database logging.
Click Next
In the Review Step, check that all the options and configuration are set correctly and select Create.

Now go back to the Serverless Dashboard and check the workgroup list to see the workgroup and namespace created, with status showing as 'Available'.

We will also need to connect Quicksight to our Redshift endpoint. Hence we will need to make the database publicly accessible to allow access from outside the VPC. The VPC security group we configured earlier should have an inbound rule to only allow access from the Quicksight.

In the navigation panel on the left, click on Workgroup configuration and select the Workgroup created
In the Network and security panel under the Data access tab, the Publicly accessible option by default is turned off. Click the edit button (highlighted in yellow in the screenshot) and Turn on Public Accessible option.

Note if our Redshift cluster was in a private subnet, we would need to create a private connection from Quicksight to the VPC in which the cluster is located, as described here.

Accessing the DataShare

Redshift data sharing allows you to share live data across clusters with different accounts and regions with relative ease. This also decouples storage and compute and ensures access to live data and consistency, without the need to copy or move data.

We now want to create a datashare to access the Worldwide Event Attendance data exchange from our cluster. Carry out the following steps:

Navigate to the Redshift Serverless dashboard in the Amazon Redshift console and select default namespace.
Navigate to the Datashares tab and scroll down to subscriptions to AWS Data Exchange datashares.
Click on the datashare worldwide_event_test_data.
Choose Create database from datashare.
In the Create database from datashare pop-up, specify worldwide_event_test_data as the Database name.
Choose Create. You will see a message confirming successful database creation. You are now ready to run read-only queries on this database.

Running Example Queries in the Editor

In order to successfully query the datashare database, you are required to connect to redshift cluster using the cluster native database first and then use cross-database query notation <shareddatabase>.<schema>.<object> to query the data in shared database.

Navigate to the Redshift query editor v2 page. Select your Serverless workgroup (default), and you will be presented a window to select authentication as a Federated User (which will generate temporary credentials) or provide Database Username and Password that was setup during the cluster setup.
If selecting Federated User, set database name to the database in your cluster 'dev'. Then click save

For Database Username and Password authentication, set the database name 'dev' and input your usernmame and password. Then click save.

This will also save the credentials in AWS Secrets In the Secrets Manager console, choose Secrets, and then choose the secret. Scroll down to the Secret Value section in the Secrets Detail page and click Retrieve Secret Value on the right hand side. You will see the username and password saved as key value pairs for the associated engine, dbname, port and dbClusterIdentifer.

You will wonder why we just do not connect directly to the worldwide_event_data_exchange datashare. This is because Amazon Redshift data sharing has the following considerations, as detailed in the docs :

Connecting directly to datashare database is not possible.
As a datashare user, you can still only connect to your local cluster database.
Creating databases from a datashare does not allow you to connect to them, but you can read from them.

If you try and connect directly as a federated user for example, then you will get the following error.

We are now ready to run queries in the editor. Before moving on, as mentioned previously running all the queries once to join all the tables with most of the data will incur a cost (probably less than $20) and still within the free credits. Should you wish to control this you can reduce the RPU base capacity which defaults to 128 RPUs when creating the cluster (where 1 RPU provides 16 GB memory). This is discussed in the Data Recovery, Monitoring and Cost Management section.

In the query editor, run the sql block below. This will join the event, sales, venue, category and date tables and create a view of the results for the aggregated ticket price, commissions and total number of ticket sold for a given event in venue on a calendar date.

CREATE VIEW worldwide_events_vw AS 
SELECT caldate, day, week, month, qtr, year, holiday, eventname, catname,  venuename, venuecity, venueseats, SUM(pricepaid) AS total_price, SUM(commission) AS total_commission, SUM(qtysold) AS total_tickets
FROM "worldwide_event_data_exchange"."public"."event" as event
JOIN "worldwide_event_data_exchange"."public"."sales" as sales 
on event.eventid=sales.eventid
JOIN (SELECT * FROM "worldwide_event_data_exchange"."public"."venue" WHERE venueseats > 0) as venue 
on event.venueid=venue.venueid
JOIN "worldwide_event_data_exchange"."public"."category" as cat
on event.catid=cat.catid
JOIN "worldwide_event_data_exchange"."public"."date" AS datetable
ON datetable.dateid=event.dateid
GROUP BY caldate, day, week, month, qtr, year, holiday, eventname, catname, venuename, venuecity, venueseats
with no schema binding;

By default, creating views from external tables is not supported and will throw an error as below. The with no schema binding statement at the end of the query is to allow us to create the view successfully.

If the query ran successfully, the view should be visible from the check that the view is visible under the view dropdown after refresh (highlighted in blue in the screenshot below). Now lets check the data in the view, by running a SELECT * FROM "public"."worldwide_events_vw". The results should be similar to the screenshot below.

Unloading data to S3

Create a S3 bucket or use an existing one if you wish. I have created a new S3 bucket redshift-worldwide-events with an events folder. The UNLOAD SQL command allows users to unload the results of the query to a S3 bucket. This requires the query, S3 path and IAM role arn to allow permissions to write to the S3 bucket. In addition we will be adding the following extra options below with the command, to create a single csv file to include the header.

CSV DELIMITER AS: to use csv format with delimiter as ','
HEADER: specify first row as header row
CLEANPATH: to remove any existing S3 file before unloading new query
PARALLEL OFF: turn off parallel writes as we want a single CSV files rather than multiple partitions.

So the query will look like below.

unload ('SELECT * FROM worldwide_events_vw')
to 's3://redshift-worldwide-events/events/worldwide_events' 
iam_role '<your-iam-role-arn>'
CSV DELIMITER AS ',' 
HEADER 
cleanpath 
parallel off

The s3 bucket path needs to be in the format <s3://object-path/name-prefix>. So if the bucket created is redshift-worldwide-events with an events folder, then the object path becomes redshift-worldwide-events/events. The 'name-prefix' is the object name prefix to set which gets concatenated with a slice number (if it is a single file then 000). So if the 'name-prefix' is set as worldwide_events then the object stored will be names worldwide-events000. The iam role arn will be the arn of the role we associated with the redshift cluster when setting up. You can find it in the dashboard by navigating to namespace configuration (highlighted in yellow in screenshot below).

If the query is successful, you should see a success message in the editor as below.

Navigate to the s3 bucket and check that the file is visible. This can now be downloaded or accessed via other AWS services for further analysis as required.

Data Recovery, Monitoring and Cost Management

Recovery points in Amazon Redshift Serverless are created approximately every 30 minutes and saved for 24 hours.
In the Redshift Serverless console, the data backup tab shows the list of database backups which can be restored if required in case there is a failure. Given that backups are only available for 24 hours, we can also create a snapshot from a backup to be used at a later time if required.

We can monitor cluster performance via CloudWatch metrics in the Redshift console and CloudWatch, such as CPU utilization, latency etc for monitoring cluster performance. In addition, we can also monitor database query and load events directly in the console at a 1 minute resolution. The screenshot below shows the RPU capacity used for some of the queries executed. As the number of queries increases, Redshift Serverless automatically scales to optimise performance.

Based on usage patterns, we can monitor usage and control billing by updating the Base Capacity and Max RPU hours per day, week or month in the respective sections of the workgroup configuration dashboard as shown in the screenshot below. The default RPU is 128 which can be reduced to a minimum of 8 for simpler queries on smaller data. Under the Manage Usage limits we can set a maximum RPU-hours limit by frequency and decide what action needs to be taken if breached (e..g turn off user queries, send an alert). Details on how to configure this is described in more detail in the docs

Cloudwatch alarms can also be set up from the Redshift serverless dashboard by choosing alarms from the navigation menu and selecting create alarm. We can then set a alarm for a metric to track in either the namespace or workgroup. In the screenshot below, I have set an alarm to monitor compute capacity and trigger when it exceeds a threshold of 80 RPU for 10 periods of 1 minute each. However, you can lower the threshold and/or number of consecutive periods if required. The minimum duration of a period is 1 minute and can be increased in denominations from the dropdown as shown below.

In the serverless dashboard, you can now see the status of the alarm. It will first start off in INSUFFICIENT_DATA state until it gathers more data to evaluate and change state to either OK or ALARM.

Clicking on the View in CloudWatch widget allows users to view more details in the CloudWatch dashboard like the alarm status over the most recent time period and history of the alarm state changes as shown in the screenshot below. For a more detailed explanation on Alarm Periods and Evaluation Periods, please refer to the AWS documentation

Clicking on the date for the state change in the history tab brings up a summary of the state change in json format. Here we can diagnose further why an alarm changed state from OK to ALARM status for example in the screenshot below. The threshold exceeded 80 RPUs for 10 consecutive datapoints (each evaluated after 1 min period), which was the configuration we set. This was probably a longer running query which required a higher compute capacity (128 RPUs) in this case.

That concludes the first part of this tutorial. Continue to part 2 for creating visualisations in Quicksight.

Building an entirely Serverless Workflow to Analyse Music Data using Step Functions, Glue and Athena

Ryan Nazareth — Sun, 26 Feb 2023 18:52:37 +0000

This blog will demonstrate how to create and run an entirely serverless ETL workflow using step functions to execute a glue job to read csv data from S3, carry out transformations in pyspark and writing the results to S3 destination key in parquet format. This will then trigger a glue crawler to create or update tables with the metadata from the parquet files. A successful job run, should then send an SNS notification to a user by email.

We will use the LastFM dataset which represents listening habits for nearly 1,000 users. These are split into two tsv files, one containing user profiles (gender, age, location, registration date) and the other containing details of music tracks each user has listened to, with associated timestamp.
Using aws glue, we can carry out data transformations in pyspark to generate the insights about the users, like the following:

Number of distinct songs each user has played.
100 most popular songs (artist and title) in the dataset, with the number of times each was played.
Top 10 longest sessions (by elapsed time), with the associated information about the userid, timestamp of first and last songs in the session, and the list of songs played in the session (in order of play). A user's “session” will be assumed to be comprised of one or more songs played by that user, where each song is started within 20 minutes of the previous song’s start time.

Glue Notebook and Spark Transformations

We will create a glue job by uploading this notebook to Amazon Glue Studio Notebooks. Before setting up any resources, let's first go through the various code snippets and functions in the notebook to describe the different transformation steps to answer the questions listed above.

The first cell imports and initializes a GlueContext object, which is used to create a SparkSession to be used inside the AWS Glue job.
Spark provides a number of classes (StructType, StructField) to specify the structure of the spark dataframe. StructType is a collection of StructField which is used to define the column name, data type and a flag for nullable or not.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import (
    StringType,
    StructField,
    StructType,
    TimestampType,
)
import boto3

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
client = boto3.client('s3')


SESSION_SCHEMA = StructType(
    [
        StructField("userid", StringType(), False),
        StructField("timestamp", TimestampType(), True),
        StructField("artistid", StringType(), True),
        StructField("artistname", StringType(), True),
        StructField("trackid", StringType(), True),
        StructField("trackname", StringType(), True),
    ]
)

S3_PATH="s3://lastfm-dataset/user-session-track.tsv"
BUCKET="lastfm-dataset"

This function will read the LastFM dataset in csv format into a spark dataframe from the S3 bucket lastfm-dataset, using the S3_PATH and schema definition defined above. We will drop the columns we do not need. The schema is printed below by calling the printSchema() method of the spark dataframe.

def read_session_data(spark):
    data = (
        spark.read.format("csv")
        .option("header", "false")
        .option("delimiter", "\t")
        .schema(SESSION_SCHEMA)
        .load(S3_PATH)
    )
    cols_to_drop = ("artistid", "trackid")
    return data.drop(*cols_to_drop).cache()

df = read_session_data(spark)
df.printSchema()

The function create_users_and_distinct_songs_count will create a list of user IDs, by selecting the columns userid, artistname and trackname, dropping duplicate rows and performing a groupBy count for each userid.

def create_users_and_distinct_songs_count(df: DataFrame) -> DataFrame:
    df1 = df.select("userid", "artistname", "trackname").dropDuplicates()
    df2 = (
        df1.groupBy("userid")
        .agg(count("*").alias("DistinctTrackCount"))
        .orderBy(desc("DistinctTrackCount"))
    )
    return df2

songs_per_user = create_users_and_distinct_songs_count(df)
songs_per_user.show()

The create_popular_songs function performs a GroupBy count operation for artistname and trackname columns and then ordered in descending order of counts with a limit to get the 100 most popular songs.

def create_popular_songs(df: DataFrame, limit=100) -> DataFrame:
    df1 = (
        df.groupBy("artistname", "trackname")
        .agg(count("*").alias("CountPlayed"))
        .orderBy(desc("CountPlayed"))
        .limit(limit)
    )
    return df1

popular_songs = create_popular_songs(df)
popular_songs.show()

The next snippet will lag the previous timestamp for each user partition (using window function) and compute the difference between current and previous timestamp in a session per user. We then create a session flag (binary flag) for each user, if time between successive played tracks exceeds session_cutoff (20 minutes). A SessionID column will compute a cumulative sum over the sessionflag column for each user.

We then group the Spark DataFrame by userid and SessionID and compute min and max timestamp as session start and end columns. Then create a session_length (hrs) column which computes the difference between session end and start for each row and convert to hours. Order the DataFrame from max to min session length and limit to top 10 sessions as required.

To get the list of tracks for each session, join to the original raw dataframe read in and group by userid, sessionID and session_length in hours. Now apply the pyspark.sql function collect_list to each group to create a list of tracks for each session.

def create_session_ids_for_all_users(
    df: DataFrame, session_cutoff: int
) -> DataFrame:
    w1 = Window.partitionBy("userid").orderBy("timestamp")
    df1 = (
        df.withColumn("pretimestamp", lag("timestamp").over(w1))
        .withColumn(
            "delta_mins",
            round(
                (
                    col("timestamp").cast("long")
                    - col("pretimestamp").cast("long")
                )
                / 60
            ),
        )
        .withColumn(
            "sessionflag",
            expr(
                f"CASE WHEN delta_mins > {session_cutoff} OR delta_mins IS NULL THEN 1 ELSE 0 END"
            ),
        )
        .withColumn("sessionID", sum("sessionflag").over(w1))
    )
    return df1


def compute_top_n_longest_sessions(df: DataFrame, limit: int) -> DataFrame:
    df1 = (
        df.groupBy("userid", "sessionID")
        .agg(
            min("timestamp").alias("session_start_ts"),
            max("timestamp").alias("session_end_ts"),
        )
        .withColumn(
            "session_length(hrs)",
            round(
                (
                    col("session_end_ts").cast("long")
                    - col("session_start_ts").cast("long")
                )
                / 3600
            ),
        )
        .orderBy(desc("session_length(hrs)"))
        .limit(limit)
    )
    return df1


def longest_sessions_with_tracklist(
    df: DataFrame, session_cutoff: int = 20, limit: int = 10
) -> DataFrame:
    df1 = create_session_ids_for_all_users(df, session_cutoff)
    df2 = compute_top_n_longest_sessions(df1, limit)
    df3 = (
        df1.join(df2, ["userid", "sessionID"])
        .select("userid", "sessionID", "trackname", "session_length(hrs)")
        .groupBy("userid", "sessionID", "session_length(hrs)")
        .agg(collect_list("trackname").alias("tracklist"))
        .orderBy(desc("session_length(hrs)"))
    )
    return df3

df_sessions = longest_sessions_with_tracklist(df)
df_sessions.show()

Finally, the snippet below will convert pyspark dataframe to glue dynamic dataframe and write to s3 bucket in parquet format, using the write_dynamic_frame() method. By default this method, saves the output files with the prefix part-00 in the name. It would be better to rename this to something simpler. To do this, we can use the copy_object() method of the boto s3 client to copy the existing object to a new location (using a custom name as suffix .e.g popular_songs.parquet) within the bucket.The original object can then be deleted using the delete_object() method.

def rename_s3_results_key(source_key_prefix, dest_key):
    response = client.list_objects_v2(Bucket=BUCKET)
    body = response["Contents"]
    key =  [obj['Key'] for obj in body if source_key_prefix in obj['Key']]
    client.copy_object(Bucket=BUCKET, CopySource={'Bucket': BUCKET, 'Key': key[0]}, Key=dest_key)
    client.delete_object(Bucket=BUCKET, Key=key[0])

def write_ddf_to_s3(df:DataFrame, name: str):
    dyf = DynamicFrame.fromDF(df.repartition(1), glueContext, name)
    sink = glueContext.write_dynamic_frame.from_options(frame=dyf,                                                    connection_type = "s3a",format = "glueparquet",                                  connection_options = {"path": f"s3a://{BUCKET}/results/{name}/", "partitionKeys": []},
                                                        transformation_ctx = f"{name}_sink"
                                                                )
    source_key_prefix = f"results/{name}/run-"
    dest_key = f"results/{name}/{name}.parquet"
    rename_s3_results_key(source_key_prefix, dest_key)
    return sink

write_ddf_to_s3(popular_songs, "popular_songs")
write_ddf_to_s3(df_sessions, "df_sessions")
write_ddf_to_s3(songs_per_user, "distinct_songs")

In the next sections we will setup all the resources defined in the architecture diagram and execute the state machine.

Data upload to S3

First we will create a standard bucket lastfm-dataset from the AWS console to store the source files in and enable transfer acceleration in the bucket properties to optimise transfer speed. This will generate a s3-endpoint s3 accelerate.amazonaws.com, which can be used to upload files to using the cli. Since some of these files are large, it is easier to use the aws s3 commands (such as aws s3 cp) for uploading to the S3 bucket as this will automatically use multipart upload feature if the file size exceeds 100MB.

AWS Glue Job and Crawler

We will then create a glue job by uploading this notebook to Amazon Glue Studio Notebooks. We will first need to create a role for Glue to assume and give permissions to access S3 as below.

In the Amazon Glue Studio console, choose Jobs from the navigation menu.
In the Create Job options section, select Upload and then select the AWS_Glue_Notebook.ipynb file to upload.

On the next screen, name your job as LastFM_Analysis. Select the glue role created previously in the IAM Role dropdown list. Choose spark kernel and Start Notebook.

We should see the notebook in the next screen. Click 'Save'. If we navigate back to the AWS Glue Studio Jobs tab, we should see the new job LastFM_Analysis created.

We can now setup the glue crawler from the AWS Glue console, to include the settings in the screenshot below. This will collect metadata from the glue output parquet files in S3 , and update the glue catalog tables.

SNS topic

We will also need to set up subscription to SNS topic so that notifications will be sent to an email address by following the AWS docs. We will setup a separate task at the end of the Step Function workflow to publish to the SNS topic. However, one could alternatively configure S3 event notification for specific S3 keys so that any parquet outputs from the glue job into S3 will publish to SNS topic destination.

Once you have setup the sub subscription from the console, you should get an email notification, asking you to confirm subscription as below:

Step Function setup and execution

Now we need to create a state machine. This is a workflow in an AWS Step Function, which consists of a set of states,
each of which represent a single unit of work. The state machine is defined in Amazon States Language, which is a JSON-based notation. In this example, the amazon state language specification is as below. We will use this when creating the state machine in the console.

{
  "Comment": "Glue ETL flights pipeline execution",
  "StartAt": "Glue StartJobRun",
  "States": {
    "Glue StartJobRun": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun",
      "Parameters": {
        "JobName": "LastFM_Analysis",
        "MaxCapacity": 2
      },
      "ResultPath": "$.gluejobresults",
      "Next": "Wait"
    },
    "Wait": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "Get Glue Job status"
    },
    "Get Glue Job status": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:glue:getJobRun",
      "Parameters": {
        "JobName.$": "$.gluejobresults.JobName",
        "RunId.$": "$.gluejobresults.JobRunId"
      },
      "Next": "Check Glue Job status",
      "ResultPath": "$.gluejobresults.status"
    },
    "Check Glue Job status": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.gluejobresults.status.JobRun.JobRunState",
          "StringEquals": "SUCCEEDED",
          "Next": "StartCrawler"
        }
      ],
      "Default": "Wait"
    },
    "StartCrawler": {
      "Type": "Task",
      "Parameters": {
        "Name": "LastFM-crawler"
      },
      "Resource": "arn:aws:states:::aws-sdk:glue:startCrawler",
      "Next": "Wait for crawler to complete"
    },
    "Wait for crawler to complete": {
      "Type": "Wait",
      "Seconds": 70,
      "Next": "SNS Publish Success"
    },
    "SNS Publish Success": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:*:Default",
        "Message.$": "$"
      },
      "Next": "Success"
    },
    "Success": {
      "Type": "Succeed"
    }
  }
}

Before creating the state machine, we will also need to create a role for Step Function to assume, with permissions to call the various services e.g Glue, Athena, SNS, Cloudwatch (if logging will be enabled when creating the state machine) etc using AWS managed policies as below.

In the Step Functions console, in the State Machine tab:

Select Create State Machine
Select "Write your workflow in code" with Type "Standard"
Paste in the state language specification. This will generate a visual representation of the state machine as below, if the definition is valid.

Select next and then in the "Specify Details" section, fill in the State Machine Name, execution role created previously from the dropdown and turn on Logging to CloudWatch. Then click "Create State Machine"

Let us go through what each of the states will be doing. The first task state Glue StartJobRun will start the glue job LastFM_Analysis with 2 data processing units (DPUs) capacity as specified in the parameters block. The output of this state is then included in the ResultsPath as $.gluejobresults along with the original input. This will give access to glue job metadata like the job id, status, job name to be used as parameters for subsequent states.

The next state is a Wait state which pauses the execution of the state machine for 30 seconds before proceeding to the next tasks of checking the glue job status for the glue job. Using Choice state , we can add a condition to proceed to the next task (StartCrawler) if the value of glue job status is SUCCEEDED, otherwise it loops back to the Wait Task activity and waits for another 30 seconds before repeating the process again. This ensures we only start crawling the data from S3 when the glue job has completed as the output parquet files will be available and ready to be crawled.

Similarly, after the StartCrawler task, we can add a wait state to pause step function for 70 seconds (we expect the crawler to have completed in a minute), to ensure that a notification is sent to the SNS topic Default only when the crawler has completed successfully.

Now the state machine can be executed. If the step function completes successfully, we should see an output similar to below.

If the glue job is successful, we should see the parquet files in dedicated subfolders in the results folder in the S3 bucket. You should also get a notification to the email subscribed to the SNS topic.

The catalog tables should be created after successful completion of the crawler. We can now query the tables in Athena as below. The tables could also be accessed via Amazon Quicksight or Tableau for generating visualisation dashboards for further insights.

What movie to watch next ? Amazon Personalize to the rescue - Part 2

Ryan Nazareth — Tue, 04 Oct 2022 10:24:56 +0000

In the first part of this blog, we used AWS Step Functions to orchestrate a workflow to run a Glue Job on data from S3, trigger an import job in Personalize and train a model (recipe). In this section, we will focus on deploying the model and getting batch and realtime recommendations for movies.The designed architecture is as shown in the screenshot below. Again, all scripts referenced in code snippets in this blog can be found in my github repository.

With the User-Personalization recipe, Amazon Personalize generates scores for items based on a user's interaction data and metadata. These scores represent the relative certainty that Amazon Personalize has in whether the user will interact with the item next. Higher scores represent greater certainty as described in the documentation. Amazon Personalize scores all the items in your catalog relative to each other on a scale from 0 to 1 (both inclusive), so that the total of all scores equals 1. For example, if you're getting movie recommendations for a user and there are three movies in the Items dataset, their scores might be 0.6, 0.3, and 0.1. Similarly, if you have 1,000 movies in your inventory, the highest-scoring movies might have very small scores (the average score would be.001), but, because scoring is relative, the recommendations are still valid. Please refer to the docs for further details on this.

Personalize Batch Inference Job

The CloudFormation stack created in [part 1], should have deployed the necessary resources to run the batch job, i.e a lambda function which gets triggered when input data is added to S3, and creates the batch job in Personalize. The lambda function will automatically trigger either a batch segment job or a batch inference job depending on the filename. A users.json file, is assumed to have user ids for which we required item recommendations.

{"userId": "4638"}
{"userId": "663"}
{"userId": "94"}
{"userId": "3384"}
{"userId": "1030"}
{"userId": "162540"}
{"userId": "15000"}
{"userId": "13"}
{"userId": "50"}
{"userId": "80000"}
{"userId": "20000"}
{"userId": "110000"}
{"userId": "5000"}
{"userId": "9000"}
{"userId": "34567"}

This will trigger a batch inference job using the solution version Arn specified as the lambda environment variable (defined through the CloudFormation stack parameters) . We will be using the solution version trained using the USER_PERSONALIZATION recipe. An items.json file on the other hand, will trigger a batch segment job, and should be of the format as below:

{"itemId": "1240"}
{"itemId": "33794"}
{"itemId": "89745"}
{"itemId": "89747"}
{"itemId": "89753"}
{"itemId": "1732"}
{"itemId": "8807"}
{"itemId": "7153"}
{"itemId": "44"}
{"itemId": "165"}
{"itemId": "307"}
{"itemId": "306"}
{"itemId": "457"}
{"itemId": "586"}
{"itemId": "588"}
{"itemId": "589"}
{"itemId": "596"}

This will return a list of users with highest probabilites for recommending the items to. Note that a batch segment job requires the solution to be trained with a USER_SEGMENTATION recipe and will throw an error if another recipe is used. This will require training a new solution with this recipe and is beyond the scope of this tutorial.The lambda config should look as below with the event trigger set as S3.

A second lambda function which runs a transform operation when the results from the batch job are added to S3 from Personalize. If successful, a notification is sent to SNS topic configured with email as endpoint to send alert when the workflow completes. The output of the batch job from personalize, returns a json in the following format:

{"input":{"userId":"1"},"output":{"recommendedItems":[....],"scores":[....]},"error":null}
{"input":{"userId":"2"},"output":{"recommendedItems":[....],"scores":[...]},"error":null}
......
.....

With the transformation, we intend to return a structured dataset serialised in parquet format (with snappy compression), with the following schema:

userID: integer
Recommendations: string

The movie id is mapped to the title and associated genre and release year or each user as below . Each recommendation is separated by a | delimiter.

   userId                           Recommendations
0    1       Movie Title (year) (genre) | Movie Title (year) (genre) | ....
1    2       Movie Title (year) (genre) | Movie Title (year) (genre) | ....
......

This also uses a lambda layer with the AWS managed DataWrangler layer, so the pandas and numpy libraries are available. The configuration, should look like below, with the lambda layer and destination as SNS.

To trigger the batch inference job workflow, copy the sample users.json batch data to s3 path below.

aws s3 cp datasets/personalize/ml-25m/batch/input/users.json s3://recommendation-sample-data/movie-lens/batch/input/users.json

This creates a batch inference job with the job name having the unixtimestamp affixed to the end. We should receive a notification via email, when the entire workflow completes. The outputs of the batch job and subsequent transformation, should be visible in the bucket with keys movie-lens/batch/results/inference/users.json.out and
movie-lens/batch/results/inference/transformed.parquet respectively. These have also been copied and stored here.

    userId                                    Recommendations
0    15000  Kiss the Girls (1997) (Crime) | Scream (1996) ...
1   162540  Ice Age 2: The Meltdown (2006) (Adventure) | I...
2     5000  Godfather, The (1972) (Crime) | Star Wars: Epi...
3       94  Jumanji (1995) (Adventure) | Nell (1994) (Dram...
4     4638  Inglourious Basterds (2009) (Action) | Watchme...
5     9000  Die Hard 2 (1990) (Action) | Lethal Weapon 2 (...
6      663  Crow, The (1994) (Action) | Nightmare Before C...
7     1030  Sister Act (1992) (Comedy) | Lethal Weapon 4 (...
8     3384  Ocean's Eleven (2001) (Crime) | Matrix, The (1...
9    34567  Lord of the Rings: The Fellowship of the Ring,...
10      50  Grand Budapest Hotel, The (2014) (Comedy) | He...
11   80000  Godfather: Part II, The (1974) (Crime) | One F...
12  110000  Manhattan (1979) (Comedy) | Raging Bull (1980)...
13      13  Knocked Up (2007) (Comedy) | Other Guys, The (...
14   20000  Sleepless in Seattle (1993) (Comedy) | Four We...

Creating a Campaign for realtime recommendations

A campaign is a deployed solution version (trained model) with provisioned dedicated transaction capacity for creating real-time recommendations for your application users. After you complete Preparing and importing data and Creating a solution, you are ready to deploy your solution version by creating an AWS Personalize Campaign.If you are getting batch recommendations, you don't need to create a campaign.

$ python projects/personalize/deploy_solution.py --campaign_name MoviesCampaign --sol_version_arn <solution_version_arn> --mode create

2022-07-09 21:12:08,412 - deploy - INFO - Name: MoviesCampaign
2022-07-09 21:12:08,412 - deploy - INFO - ARN: arn:aws:personalize:........:campaign/MoviesCampaign
2022-07-09 21:12:08,412 - deploy - INFO - Status: CREATE PENDING

An additional arg --config can be passed, to set the explorationWeight and explorationItemAgeCutOff parameters for the User Personalizaion Recipe. These parameters default to 0.3 and 30.0 respectively if not passed (as in previous example)
To set the explorationWeight and ItemAgeCutoff to 0.6 and 100 respectively, run the script as below:

$ python projects/personalize/deploy_solution.py --campaign_name MoviesCampaign --sol_version_arn <solution_version_arn> \
--config "{\"itemExplorationConfig\":{\"explorationWeight\":\"0.6\",\"explorationItemAgeCutOff\":\"100\"}}" --mode create

2022-07-09 21:12:08,412 - deploy - INFO - Name: MoviesCampaign
2022-07-09 21:12:08,412 - deploy - INFO - ARN: arn:aws:personalize:........:campaign/MoviesCampaign
2022-07-09 21:12:08,412 - deploy - INFO - Status: CREATE PENDING

Setting up API Gateway with Lambda Proxy Integration

You can also get real-time recommendations from Amazon Personalize with a campaign created earlier to give movie recommendations.To increase recommendation relevance, include contextual metadata for a user, such as their device type or the time of day, when you get recommendations or get a personalized ranking. The API Gateway integration with lambda backend, should already be configured if CloudFormation was run successfully. We have configured the method request to accept a querystring parameter user_id and defined model schema. An API method can be integrated with Lambda using one of two integration methods: Lambda proxy integration or Lambda non-proxy (custom) integration.

By default, we use Lambda Proxy Integration when creating the resource in CloudFormation, which allows the client to call a single lambda function in the backend. When a client submits a request, API Gateway sends the raw request to lambda without necessarily preserving the order of the parameters. This request data includes the request headers, query string parameters, URL path variables, payload, and API configuration data as detailed here.

We could also use Lambda non-proxy integration by setting the template parameter APIGatewayIntegrationType to AWS. The difference to the Proxy Integration method is that in addition, we also need to configure a mapping template to map the incoming request data to the integration request, as required by the backend Lambda function. In the CloudFormation template personalize_predict.yaml, this is already predefined in the RequestTemplates property of the ApiGatewayRootMethod resource, which translates the user_id query string parameter to the user_id property of the JSON payload. This is necessary because input to a Lambda function in the Lambda function must be expressed in the body. However, as the default type is set to AWS_PROXY, the mapping template is ignored as it is not required.

The API endpoint URL to be invoked should be visible from the console, under the stage tab.

Invocation and Monitoring with AWS X-Ray

The API can be tested by opening a browser and typing the URL into a browser address bar along with the querystring parameters. For example, https://knmel67a1g.execute-api.us-east-1.amazonaws.com/dev?user_id=5, will generate recommendations for user with id 5.
For monitoring, we have also configured API gateway to send traces to X-Ray and logs to CloudWatch. Since the API is integrated with a single lambda function, you will see nodes in the service map containing information about the overall time spent and other performance metrics in the API Gateway service, the Lambda service, and the Lambda function. The timeline shows the hierarchy of segments and subsegments. Further details on request/response times and faults/errors can be found by clicking on each segment/subsegment in the timeline. For further information, refer to following AWS documentation on using AWS X-Ray service maps and trace views with API Gateway.

What movie to watch next ? Amazon Personalize to the rescue - Part 1

Ryan Nazareth — Sun, 02 Oct 2022 20:58:30 +0000

Amazon Personalize allows developers with no prior machine learning experience to easily build sophisticated personalization capabilities into their applications. With Personalize, you provide an activity stream from your application, as well as an inventory of the items you want to recommend, and Personalize will process the data to train a personalization model that is customized for your data.

In this tutorial, we will be using the MovieLens dataset which is a popular dataset used for recommendation research. We will be using the MovieLens 25M Dataset under the Recommended for new research section. This contains 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. The scripts and code referenced in this tutorial can be found in my github repository.

Download the respective zip file and navigate to the folder where the zip is stored and run the unzip command. You may need to install the unzip package if not already installed from this link. For example on ubuntu sudo apt-get install -y unzip.

$ cd datasets/personalize
$ unzip ml-25m.zip

Important Note on Pricing

Depending on the personalize recipe used and the size of the dataset, this can result in a large bill when training a personalize solution. I learnt this the hard way by not reading the AWS Personalize billing docs properly which resulted in this exercise costing me over $100.

So I thought I would share the ways in which one could mitigate this and what to look out for when configuring the training solution.

For the purpose of this tutorial, I have used the MovieLens 25M Dataset. However, one could sample a smaller dataset from this or use the MovieLens Latest Datasets recommended for education and development, which is a lot smaller (100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users).

Secondly, It should be noted that if the model complexity needs better configuration, AWS Personalize will automatically scale up to suitable instance. This means that more compute resource will be used to complete your jobs faster and hence result in a larger bill.

The training hours can be broken down to the following components:

A training hour represents 1 hour of compute capacity using 4v CPUs and 8 GiB memory
The number of training jobs created for the solution if HPO is enabled

If one has enabled hyperparameter optimization (HPO) or tuning, the optimal hyperparameters are determined by running many training jobs using different values from the specified ranges of possibilities as described in the docs. In this tutorial, I have used HPO tuning with the following configuration for the training solution:


    "hpoResourceConfig": {
        "maxNumberOfTrainingJobs": "16",
        "maxParallelTrainingJobs": "8"
    }

The maxNumberOfTrainingJobs indicates that you have maximum training job set to 16. Each will need own resource. In another words, the 560 training hours are a result of 16 training jobs as well as larger compute resource.

I was wondering how to reduce the cost for future solutions, so i contacted AWS Technical Support. They recommended the following:

Disable HPO, if you want to tune your model, build something that works first then optimise later. One can check that HPO is enabled or disabled by running aws personalize describe-solution <arn>

Go over our cheat sheet provided by the Service Team.

There is also no way to force AWS Personalize to use a specific instance type, when it scales to complete the training job faster. The best cost optimisation method would be to turn off HPO in this scenario. Once the model is trained, there is no extra cost for keeping the model active after training. The ACTIVE status will only be shown when training is complete and it does not necessarily mean the training is ACTIVE as described in the docs.

Loading data into S3

Create an S3 bucket named recommendation-sample-data and run the following command in the cli to enable Transfer Acceleration for the bucket.All Amazon S3 requests made by s3 and s3api AWS CLI commands can now be directed to the accelerate endpoint: s3-accelerate.amazonaws.com. We also need to set the configuration value. use_accelerate_endpoint to true in a profile in the AWS Config file. For further details, please consult the AWS docs

$ aws s3api put-bucket-accelerate-configuration --bucket recommendation-sample-data --accelerate-configuration Status=Enabled
$ aws configure set default.s3.use_accelerate_endpoint true

In addition, to Transfer Acceleration, this AWS article recommends using the CLI for uploads for large file sizes, as it automatically performs multipart uploading when the file size is large. We can also set the maximum concurrent number of requests to 20 to use more of the host's bandwidth and resources during the upload. By default, the AWS CLI uses 10 maximum concurrent requests.


$ aws configure set default.s3.max_concurrent_requests 20
$ aws s3 cp datasets/personalize/ml-25m/ s3://recommendation-sample-data/movie-lens/raw_data/ --region us-east-1 --recursive --endpoint-url https://recommendation-sample-data.s3-accelerate.amazonaws.com

Finally we need to add the glue script and lambda function to S3 bucket as well. This assumes the lambda function is zipped as in lambdas/data_import_personalize.zip and you have a bucket with key aws-glue-assets-376337229415-us-east-1/scripts. If not adapt the query accordingly. Run the following commands from root of the repo

$ aws s3 cp step_functions/personalize-definition.json s3://recommendation-sample-data/movie-lens/personalize-definition.json
$ aws s3 cp lambdas/trigger_glue_personalize.zip s3://recommendation-sample-data/movie-lens/lambda/trigger_glue_personalize.zip

If you have not configured transfer acceleration for the default glue assets bucket, then you can set to false before running cp command as below. Otherwise, you will get the following error:

An error occurred (InvalidRequest) when calling the PutObject operation: S3 Transfer Acceleration is not configured on this bucket

$ aws configure set default.s3.use_accelerate_endpoint false
$ aws s3 cp projects/personalize/glue/Personalize_Glue_Script.py s3://aws-glue-assets-376337229415-us-east-1/scripts/Personalize_Glue_Script.py

CloudFormation Templates

The CloudFormation template for creating the resources for this example is located in this folder. The CloudFormation template personalize.yaml creates the following resources:

Glue Job
Personalize resources (Dataset, DatasetGroup, Schema) and associated Role
Step Function for orchestrating the Glue and Personalize DatasetImport Jobs and creating a Personalize Solution
Lambda function and associated Role, for triggering step function execution with S3 event notification.

We can use the following cli command to create the template, with the path to the template passed to the --template-bodyargument. Adapt this depending on where your template is stored. We also need to include the CAPABILITIES_NAMED_IAM value to --capabilities argument as the template includes IAM resources (e.g. IAM role) which has a custom name such as a RoleName

 $ aws cloudformation create-stack --stack-name PersonalizeGlue \
 --template-body file://cloudformation/personalize.yaml \
 --capabilities CAPABILITY_NAMED_IAM

If successful, we should see the following resources successfully created in the resources tab.

If we run the command as above, just using the default parameters, we should see the key value pairs listed in the parameters tab as in screenshot below.

We should see that all the services should be created. For example navigate to the Step function console and click on the step function name GlueETLPersonalizeTraining

S3 event notifications

We need to configure S3 event notifications for the training and prediction workflows. For the Training workflow, we need to configure s3 to lambda notification when raw data is loaded into S3, to trigger the step function execution. For the prediction workflow (batch and realtime), the following configurations are required:

S3 to Lambda notification for triggering Personalize Batch Job when batch sample data object but into S3 bucket prefix
S3 to Lambda notification for triggering lambda to transform output of batch job added to S3.
S3 notification to SNS topic, when the output of lambda transform lands in S3 bucket. We have configured email as subscriber to SNS via protocol set as email endpoint, via Cloudformation. The SNS messages will then send email to subscriber address when event message received from S3.

To add bucket event notification for starting the training workflow via step functions, run the custom script and passing arg --workflow with value train. By default, this will send S3 event when csv file is uploaded into movie-lens/batch/results/ prefix in the bucket.

$ python projects/personalize/put_notification_s3.py --workflow train

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:__main__:Lambda arn arn:aws:lambda:........:function:LambdaSFNTrigger for function LambdaSFNTrigger
INFO:__main__:HTTPStatusCode: 200
INFO:__main__:RequestId: X6X9E99JE13YV6RH

To add bucket event notification for batch/realtime predictions run the script and pass --workflow with value predict. The default prefixes set for the object event triggers for s3 to lambda and s3 to sns notification, can be found in the source code. These can be overridden by passing the respective argument names (see click options in thesource code.

$ python projects/personalize/put_notification_s3.py --workflow predict

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:__main__:Lambda arn arn:aws:lambda:us-east-1:376337229415:function:LambdaBatchTrigger for function LambdaBatchTrigger
INFO:__main__:Lambda arn arn:aws:lambda:us-east-1:376337229415:function:LambdaBatchTransform for function LambdaBatchTransform
INFO:__main__:Topic arn arn:aws:sns:us-east-1:376337229415:PersonalizeBatch for PersonalizeBatch
INFO:__main__:HTTPStatusCode: 200
INFO:__main__:RequestId: Q0BCATSW52X1V299

Note: There is currently no support for notifications to FIFO type SNS topics.

Trigger Workflow for Training Solution

The lambda function and step function resources in the workflow should already be created via Cloudformation. We will trigger the workflow, by uploading the raw dataset into S3 path, for which S3 event notification is configured to trigger lambda and invoke the function. This will execute the state machine, which will run all the steps defined in the definition file.

Firstly, it will run the glue job to transform the raw data to required schema and format for importing interactions dataset into Personalize. The outputs from the glue job are stored in a different S3 folder to the raw data.

It will then import the interactions dataset into the Personalize. A custom dataset group resource and interactions dataset is already created and defined, when creating the Cloudformation stack.

Wait for solution version to print an ACTIVE status. Training can take a while, depending on the dataset size and number of user-item interactions. If using AutoMl this can take longer. The training time (hrs) value is based on 1 hr of compute capacity (default is 4CPU and 8GiB memory). However, as discussed in Pricing section of this blog, AWS Personalize automatically chooses a more efficient instance type to train the data in order to complete the job more quickly. In this case, the training hours metric computed will be adjusted and increased, resulting in a larger bill.

Analysing Traces and Debugging with AWS X-Ray

To diagnose any faults in execution, we can look at the x ray traces and logs. You can now view the service map within the Amazon CloudWatch console. Open the CloudWatch console and choose Service map under X-Ray traces from the left navigation pane. The service map indicates the health of each node by colouring it based on the ratio of successful calls to errors and faults. Each AWS resource that sends data to X-Ray appears as a service in the graph. Edges connect the services that work together to serve requests as detailed here. In the center of each node, the console shows the average response time and number of traces that it sent per minute during the chosen time range. A trace collects all the segments generated by a single request.

Choose a service node to view requests for that node, or an edge between two nodes to view requests that traversed that connection. The service map disassociates the workflow into two trace ids, for every request, with the following groups of segments:

lambda service and function segments
step function, glue, personalize segments

You can also choose a trace ID to view the trace map and timeline for a trace. The Timeline view shows a hierarchy of segments and subsegments. The first entry in the list is the segment, which represents all data recorded by the service for a single request. Below the segment are subsegments. This example shows subsegments recorded by Lambda segments. Lambda records a segment for the Lambda service that handles the invocation request, and one for the work done by the function as described here. The function segment comes with subsegments for the following phases:

Initialization phase: Lambda execution environment is initialised.
Invocation phase: function handler is invoked.
Overhead phase: dwell time between sending the response and the signal for the next invoke.

For step functions, we can see the various subsegments corresponding to the different states in the state machine.

Evaluating solution metrics

You can use offline metrics to evaluate the performance of the trained model before you create a campaign and provide recommendations. Offline metrics allow you to view the effects of modifying a solution's hyperparameters or compare results from models trained with the same data. To get performance metrics, Amazon Personalize splits the input interactions data into a training set and a testing set. The split depends on the type of recipe you choose. For USER_SEGMENTATION recipes, the training set consists of 80% of each user's interactions data and the testing set consists of 20% of each user's interactions data.For all other recipe types, the training set consists of 90% of your users and their interactions data. The testing set consists of the remaining 10% of users and their interactions data.

Amazon Personalize then creates the solution version using the training set. After training completes, Amazon Personalize gives the new solution version the oldest 90% of each user’s data from the testing set as input. Amazon Personalize then calculates metrics by comparing the recommendations the solution version generates to the actual interactions in the newest 10% of each user’s data from the testing set as described here

You retrieve the metrics for a the trained solution version above, by running the following script, which calls the GetSolutionMetrics operation with the solutionVersionArn parameter.

python projects/personalize/evaluate_solution.py --solution_version_arn <solution-version-arn>

2022-07-09 20:31:24,671 - evaluate - INFO - Solution version status: ACTIVE
2022-07-09 20:31:24,787 - evaluate - INFO - Metrics:

 {'coverage': 0.1233, 'mean_reciprocal_rank_at_25': 0.1208, 'normalized_discounted_cumulative_gain_at_10': 0.1396, 'normalized_discounted_cumulative_gain_at_25': 0.1996, 'normalized_discounted_cumulative_gain_at_5': 0.1063, 'precision_at_10': 0.0367, 'precision_at_25': 0.0296, 'precision_at_5': 0.0423}

The metrics above are summarised from the AWS docs below:

coverage: An evaluation metric that tells you the proportion of unique items that Amazon Personalize might recommend using your model out of the total number of unique items in Interactions and Items datasets. To make sure Amazon Personalize recommends more of your items, use a model with a higher coverage score. Recipes that feature item exploration, such as User-Personalization, have higher coverage than those that do not, such as popularity-count
mean reciprocal rank at 25: An evaluation metric that assesses the relevance of a model’s highest ranked recommendation. Amazon Personalize calculates this metric using the average accuracy of the model when ranking the most relevant recommendation out of the top 25 recommendations over all requests for recommendations. This metric is useful if you're interested in the single highest ranked recommendation.
normalized discounted cumulative gain (NCDG) at K: An evaluation metric that tells you about the relevance of your model’s highly ranked recommendations, where K is a sample size of 5, 10, or 25 recommendations. Amazon Personalize calculates this by assigning weight to recommendations based on their position in a ranked list, where each recommendation is discounted (given a lower weight) by a factor dependent on its position. The normalized discounted cumulative gain at K assumes that recommendations that are lower on a list are less relevant than recommendations higher on the list. Amazon Personalize uses a weighting factor of 1/log(1 + position), where the top of the list is position 1. This metric rewards relevant items that appear near the top of the list, because the top of a list usually draws more attention.
precision at K: An evaluation metric that tells you how relevant your model’s recommendations are based on a sample size of K (5, 10, or 25) recommendations. Amazon Personalize calculates this metric based on the number of relevant recommendations out of the top K recommendations, divided by K, where K is 5, 10, or 25. This metric rewards precise recommendation of the relevant items as described here

In the second part of this blog, we will create a campaign with the deployed solution version and set up API Gateway with Lambda for the generating real time.

Setting up AWS Code Pipeline to automate deployment of tweets streaming application

Ryan Nazareth — Sat, 17 Sep 2022 16:07:03 +0000

Introduction

In this tutorial, we will configure AWS CodePipeline to build an ECR image and deploy the latest version to lambda container. The application code will stream tweets using Tweepy, a Python library for accessing the Twitter API. First we need to setup CodePipeline and the various stages to deploy application code to lambda image which will stream tweets when invoked. The intended devops architecture is as below.

Typically, a CodePipeline job contains the following stages:

Source: In this step the latest version of our source code will be fetched from our repository and uploaded to an S3 bucket. The application source code is maintained in a repository configured as a GitHub source action in the pipeline. When developers push commits to the repository, CodePipeline detects the pushed change, and a pipeline execution starts from the Source Stage. The GitHub source action completes successfully (that is, the latest changes have been downloaded and stored to the artifact bucket unique to that execution). The output artifacts produced by the GitHub source action, which are the application files from the repository, are then used as the input artifacts to be worked on by the actions in the next stage. This is described in more detail in the AWS docs
Build: During this step we will use this uploaded source code and automate our manual packaging step using a CodeBuild project. The build task pulls a build environment image and builds the application in a virtual container.
Unit Test: The next action can be a unit test project created in CodeBuild and configured as a test action in the pipeline.
Deploy to Dev/Test Environment: This deploys the application to a dev/test env environment using CodeDeploy or another action provider such as CloudFormation.
Integration Test: This runs end to end Integration testing project created in CodeBuild and configured as a test action in the pipeline.
Deploy to Production Environment: This deploys the application to a production environment. Could configure the pipeline so this stage requires manual approval to execute.

The stages described above are just an example and could be fewer or more depending on the application. For example, we could have more environments for testing before deploying to production. For the rest of the blog, we will describe how some of these stages can be configured for deploying the twitter streaming application. All source code referenced in the rest of the article can be accessed here.

The Application code for Streaming Tweets

First, you will need to sign up for a Twitter Developer account and create a project and associated developer application. Then generate the following credentials:

API Key and Secret: Username and password for the App
Access Token and Secret: Represent the user that owns the app and will be used for authenticating requests

These steps are explained in more detail here. We can then define a python function which requires these credentials as parameters to make requests to the API with OAuth 1.0a authentication. This will require the latest version of tweepy to be installed pip install tweepy.The event parameter will be the payload with keys keyword and duration to determine the keyword to search for and the duration the stream should last for. For example, to stream tweets containing keyword machine learning for 30 seconds, the payload would be {"keyword": "machine learning", "duration": 30}. The code also excludes any retweets to reduce noise.

import tweepy
import time
import json

def tweepy_search_api(
    event, consumer_key, consumer_secret, access_token, access_secret
):

    auth = tweepy.OAuth1UserHandler(
        consumer_key, consumer_secret, access_token, access_secret
    )
    start_time = time.time()
    time_limit = event["duration"]
    api = tweepy.API(auth, wait_on_rate_limit=True)
    counter = 0
    for tweet in tweepy.Cursor(
        api.search_tweets, event.get("keyword"), count=100, tweet_mode="extended"
    ).items():
        if time.time() - start_time > time_limit:
            api.session.close()
            print(f"\n {time_limit} seconds time limit reached, so disconnecting stream")
            print(f" {counter} tweets streamed ! \n")
            return
        else:
            if not tweet.full_text.startswith("RT"):
                counter += 1
                dt = tweet.created_at
                payload = {
                    "day": dt.day,
                    "month": dt.month,
                    "year": dt.year,
                    "time": dt.time().strftime("%H:%M:%S"),
                    "handle": tweet.user.screen_name,
                    "text": tweet.full_text,
                    "favourite_count": tweet.user.favourites_count,
                    "retweet_count": tweet.retweet_count,
                    "retweeted": tweet.retweeted,
                    "followers_count": tweet.user.followers_count,
                    "friends_count": tweet.user.friends_count,
                    "location": tweet.user.location,
                    "lang": tweet.user.lang,
                }
                print(f"{payload}")

Since we will be using AWS for doing this, we can store the credentials in AWS Secrets Manager rather than having them defined in the code or passing them as environment variables for greater security. We can use boto sdk for creating a client session for Secrets Manager and accessing the secrets.

import boto3
import json

def get_secrets():
   session = boto3.session.Session()
   client = session.client(service_name="secretsmanager",)
   response = client.list_secrets(Filters=[{"Key": 
   "description", "Values": ["twitterkeys"]}])
   arn = response["SecretList"][0]["ARN"]
   get_secret_value_response = 
   client.get_secret_value(SecretId=arn)
   secret = get_secret_value_response["SecretString"]
   secret = json.loads(secret)
   print(f"Successfully retrieved secrets !")
   return secret

We will also be invoking lambda to call the Twitter API to stream tweets, so we will need to define another script to call the tweepy_search_api and get_secrets functions defined above in a lambda handler, as shown below. This assumes tweepy_search_api and get_secrets functions are defined in modules secrets.py and tweets_api respectively and in the same directory as the lambda function module. The snippet below parses the user parameters from the payload depending on whether the lambda function is invoked via CodePipeline action or invoked directly from the Lambda console or local machine (for testing purposes). If invoked via CodePipeline action, the UserParameters key contains the parameters stored as string value, which need to be converted to a dictionary type programatically.


def handler(event, context):
    from tweets_api import tweepy_search_api
    from secrets import get_secrets
    import itertools
    import boto3
    import json

    print(f"Event payload type: {type(event)}")
    print(f"Event:{event}")
    if event.get("CodePipeline.job") is not None:
        mode = "cloud"
    else:
        mode = "local"
    print(f"Mode: {mode}")
    if mode == "cloud":
        data = event["CodePipeline.job"]["data"]["actionConfiguration"][
            "configuration"
        ]["UserParameters"]
        params = json.loads(data)
        job_id = event["CodePipeline.job"]["id"]
        print(f"Params:{params}, JobID: {job_id}")
    elif mode == "local":
        params = event.copy()
        print(f"Params:{params}")
    code_pipeline = boto3.client("codepipeline")
    response = get_secrets(mode="aws")
    api_keys = list(itertools.islice(response.values(), 4))
    print("Searching and delivering Tweets with Tweepy API: \n")

    try:
        tweepy_search_api(params, *api_keys)
        if mode == "cloud":
            code_pipeline.put_job_success_result(jobId=job_id)
    except Exception as e:
        print(f"Exception:{str(e)}")
        if mode == "cloud":
            code_pipeline.put_job_failure_result(
                jobId=job_id, failureDetails={"message": str(e), "type": "JobFailed"}
            )
        raise

The following folder contains all the code described above in addition to few more configuration files. These are buildspec.yml and Dockerfile for building the docker image (containing the application code) and pushing to ECR in the CodeBuild stage. These will be explained in more detail in the next sections.

Creating the resources with CloudFormation

First, we will need to create the following resources with CloudFormation which will be referenced when configuring pipeline. Note This assumes you already have an ECR repository named tweepy-stream-deploy, which is referenced as a parameter in the CloudFormation template, although it can be overridden.

Lambda Function with URI reference to ECR repository
Roles for Lambda, CodePipeline and CloudFormation

Parameters:
  ECRRepoName:
    Default: "tweepy-stream-deploy"
    Description: ECR repository for tweets application
    Type: String
Resources:
  LambdaImageStaging:
    Type: 'AWS::Lambda::Function'
    Properties:
      PackageType: Image
      FunctionName: "codedeploy-staging"
      Code:
        ImageUri: !Sub '${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/${ECRRepoName}:latest'
      Role:
        Fn::ImportValue: RoleLambdaImage-TwitterArn
      Timeout: 302
      MemorySize: 1024

$ aws cloudformation create-stack --stack-name CodeDeployLambdaTweets --template-body file://cf-templates/CodeDeployLambdaTweepy.yaml

Create all the role resources using the templates here.

aws cloudformation create-stack --stack-name RoleCloudFormationforCodeDeploy --template-body file://cf-templates/roles/CloudFormationRole.yaml

aws cloudformation create-stack --stack-name RoleCodePipeline --template-body file://cf-templates/roles/CodepipelineRole.yaml

aws cloudformation create-stack --stack-name RoleLambdaImage --template-body file://cf-templates/roles/RoleLambdaImageStaging.yaml

We can validate the Cloudformation templates before deploying by using validate-template to check the template file for syntax errors. During validation, AWS CloudFormation first checks if the template is valid JSON. If it isn't, CloudFormation checks if the template is valid YAML. If both checks fail, CloudFormation returns a template validation error. Note that the aws cloudformation validate-template command is designed to check only the syntax of your template. It does not ensure that the property values that you have specified for a resource are valid for that resource. Nor does it determine the number of resources that will exist when the stack is created.

Creating the CodePipeline Stages

This will use the CodePipeline definition file to create the Source, Build and Deploy stages via the AWS cli. Alternatively, one could also do this with CloudFormation but since I had initially created CodePipeline via the AWS console, I found it easier to generate the structure of the pipeline using get-pipeline from the cli and reuse the definition file the create the CodePipeline again in the future, which will be described in the next steps.

CodePipeline also provide support for a number of actions as listed here which is part of a sequence in a stage, and is a task performed on an artifact. Codepipeline can integrate with a number of action providers such as CodeCommit, S3, Github, CodeBuild, Jenkins, CodeDeploy, CloudFormation, ECS etc in different stages of Source, Build, Test, Deploy. The full list of providers can be found in the docs. We will be using CodeCommit Source action, CodeBuild build action to , CloudFormation Deploy action and Lambda Invoke action.

Next, we will zip the cf templates folder. This is required for the Deploy stage in CodePipeline, which will use [CloudFormation Actions] to update the roles for CloudFormation, CodePipeline and Lambda, if the templates have changed and update the existing Lambda resource with the latest image tag to deploy the updated application source code. These templates will need to be committed to CodeCommit in the Source stage and output as artifacts. We will copy this zipped folder to S3 and configure code pipeline in the definition file so that action in source stage reads from the s3 location of template file.

$ cd cf-templates 
$ zip template-source-artifacts.zip CodeDeployLambdaTweepy.yaml roles/*
$ aws s3 cp template-source-artifacts.zip s3://codepipeline-us-east-1-49345350114/lambda-image-deploy/template-source-artifacts.zip

The definition json assumes CodePipeline role is created as described above. It is worth having a look at the file contents to understand the settings before we create the pipeline.

{
    "name": "CodeCommitSource",
    "actionTypeId": {
        "category": "Source",
        "owner": "AWS",
        "provider": "CodeCommit",
        "version": "1"
    },
    "runOrder": 1,
    "configuration": {
        "BranchName": "master",
        "OutputArtifactFormat": "CODE_ZIP",
        "PollForSourceChanges": "false",
        "RepositoryName": "deploy-lambda-image"
    },
    "outputArtifacts": [
        {
            "name": "CodeCommitSourceArtifact"
        }
    ],
    "inputArtifacts": [],
    "region": "us-east-1",
     "namespace": "SourceVariables"
   }

This creates a code commit repository named deploy-lambda-image and configures the output artifact CodeCommitSourceArtifact for the CodeCommit action. This is a ZIP file that contains the contents of the configured repository and branch at the commit specified as the source revision for the pipeline execution. We will later pass this artifact to the build stage. The next action in the source stage will be for loading the cloud formation templates zip file that we previously uploaded to the S3 bucket. This is an S3 Source Action which creates a CloudWatch Events Rule when a new object is uploaded to a source bucket. More details about this can be found in the AWS docs here.

{
    "name": "CFTemplatesArtifact",
    "actionTypeId": {
                      "category": "Source",
                      "owner": "AWS",
                      "provider": "S3",
                      "version": "1"
                    },
   "runOrder": 1,
   "configuration": {
                     "PollForSourceChanges":"false",
                      "S3Bucket": "codepipeline-us-east-1-49345350114",
                      "S3ObjectKey": "lambda-image-deploy/template-source-artifacts.zip"
                    },
   "outputArtifacts": [
                       {
                        "name": "CFTemplatesArtifact"
                       }
                       ],
   "inputArtifacts": [],
   "region": "us-east-1"
}

This action will also create an output artifact CFTemplatesArtifact so we can pass this to a downstream code deploy stage. The build stage includes information about how to run a build, including where to get the source code, which build environment to use, which build commands to run, and where to store the build output. It uses the following buildspec.yml file which will be included when copying the source code in the next section, to run a build. This action uses the CodeCommitSourceArtifact containing the application code which needs to be built.

 {
"name": "Build",
"actions": [
    {
        "name": "Build-Tweepy-Stream",
        "actionTypeId": {
            "category": "Build",
            "owner": "AWS",
            "provider": "CodeBuild",
            "version": "1"
        },
        "runOrder": 1,
        "configuration": {
            "BatchEnabled": "false",
            "EnvironmentVariables": "[{\"name\":\"IMAGE_REPO_NAME\",\"value\":\"tweepy-stream-deploy\",\"type\":\"PLAINTEXT\"},{\"name\":\"IMAGE_TAG\",\"value\":\"latest\",\"type\":\"PLAINTEXT\"},{\"name\":\"AWS_DEFAULT_REGION\",\"value\":\"us-east-1\",\"type\":\"PLAINTEXT\"},{\"name\":\"AWS_ACCOUNT_ID\",\"value\":\"[ACCT_ID]\",\"type\":\"PLAINTEXT\"}]",
            "ProjectName": "Build-Twitter-Stream"
        },
        "outputArtifacts": [],
        "inputArtifacts": [
            {
                "name": "CodeCommitSourceArtifact"
            }
        ],
        "region": "us-east-1"
    }
]
}

We will be building docker image to push to ECR. We set the following environment variables as they are referenced in the buildspec.yml.

AWS_DEFAULT_REGION: us-east-1
AWS_ACCOUNT_ID: with a value of account-ID
IMAGE_TAG: with a value of Latest
IMAGE_REPO_NAME: tweepy-stream-deploy

The buildspec.yml file, is similar to the example in the AWS docs for pushing the docker image to ECR. In the pre-build phase, we use the get-login-password cli command to retrieve an authentication token using the GetAuthorizationToken API to authenticate to ECR registry. The token is passed to the login command of the Docker cli to authenticate to the ECR registry and allow Docker to push and pull images from the registry until the authorization token expires (after 12 hours). The build phase runs the steps in the Dockerfile shown in the snippet below, to build the image.

FROM public.ecr.aws/lambda/python:3.9.2022.03.23.16

# Copy function code
COPY main_twitter.py ${LAMBDA_TASK_ROOT}
COPY secrets.py ${LAMBDA_TASK_ROOT}
COPY tweets_api.py ${LAMBDA_TASK_ROOT}

COPY requirements.txt  .
RUN  pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

# Set the CMD to your handler
CMD [ "main_twitter.handler" ]

The steps in the Dockerfile include:

Pulling the base python 3.9 image from ECR
Copying the module main_twitter.py containing the lambda handler and the other modules it imports from
Copying the requirements.txt file and install the python dependencies
Finally, setting the container entrypoint to the lambda handler

Once the image is successfully built in the build phase, it is tagged with the Latest tag. Finally, the post-build phase pushes the tagged image to the private ECR repository uri.

The next stage is the Deploy stage named DeployLambda, which use CloudFormation as Action provider for performing a number of actions to update the resource roles, deleting the existing lambda resource and deploying the latest image to lambda. All these actions use the CFTemplatesartifact from the source stage, to reference the path to the cloud formation template (in the TemplatePath property) relative to the root of the artifact. The Input Artifact would need to be the CloudFormation Template script which is output from the Source stage. We provide the stack name and CloudFormation role in the configuration. The ActionMode will depend on whether we need to create, update or delete the stack.

The first three actions will update the roles for CloudFormation, CodePipeline and Lambda, if the CloudFormatiom templates have changed. The runOrder property is set to value of 1 for these actions so that they run in parallel. The next action deletes any existing lambda image which may exist. The runOrder value is incremented to 2 so that it runs after the roles are created.

 {
    "name": "DeleteExistingLambdaImage",
    "actionTypeId": {
        "category": "Deploy",
        "owner": "AWS",
        "provider": "CloudFormation",
        "version": "1"
    },
    "runOrder": 2,
    "configuration": {
        "ActionMode": "DELETE_ONLY",
        "RoleArn": "arn:aws:iam::376337229415:role/CloudFormationRole",
        "StackName": "CodeDeployLambdaTweets"
    },
    "outputArtifacts": [],
    "inputArtifacts": [
        {
            "name": "CFTemplatesArtifact"
        }
    ],
    "region": "us-east-1"
}

Then we will deploy the lambda image using CloudFormation Action called DeployLambdaImage in this stage. In the configuration, we specify the OutputFileName and the outputArtifacts name, which we will pass to the next stage. The TemplatePath will reference the required template yaml in CFTemplatesArtifact

{
    "name": "DeployLambdaImage",
    "actionTypeId": {
        "category": "Deploy",
        "owner": "AWS",
        "provider": "CloudFormation",
        "version": "1"
    },
    "runOrder": 3,
    "configuration": {
        "ActionMode": "CREATE_UPDATE",
        "OutputFileName": "lambda-codedeploy-output",
        "RoleArn": "arn:aws:iam::376337229415:role/CloudFormationRole",
        "StackName": "CodeDeployLambdaTweets",
        "TemplatePath": "CFTemplatesArtifact::CodeDeployLambdaTweepy.yaml"
    },
    "outputArtifacts": [
        {
            "name": "LambdaDeployArtifact"
        }
    ],
    "inputArtifacts": [
        {
            "name": "CFTemplatesArtifact"
        }
    ],
    "region": "us-east-1"
}

In the final stage of CodePipeline, we will do a test invocation of the deployed lambda image. This will use a CodeDeploy action for invoking lambda. We will use the artifact from the previous stage as input. The configuration parameters set the function name and parameter values for invoking the lambda function i.e. we will stream tweets with the keyword Machine Learning for a duration of 10 secs.

{
    "name": "LambdaInvocationTest",
    "actions": [
        {
            "name": "LambdaStagingInvocation",
            "actionTypeId": {
                "category": "Invoke",
                "owner": "AWS",
                "provider": "Lambda",
                "version": "1"
            },
            "runOrder": 1,
            "configuration": {
                "FunctionName": "codedeploy-staging",
                "UserParameters": "{\"keyword\": \"Machine Learning\", \"duration\":10}"
            },
            "outputArtifacts": [
                {
                    "name": "LambdaInvocationArtifact"
                }
            ],
            "inputArtifacts": [
                {
                    "name": "LambdaDeployArtifact"
                }
            ],
            "region": "us-east-1"
        }
    ]
}

Now we can create the CodePipeline resource with the Source, Build and Deploy stages using the following command in cli.

$ aws codepipeline create-pipeline --cli-input-json file://cp-definitions/deploy-lambda-image.json

This should create the pipeline which should be visible in the console or via cli list-pipelines

aws codepipeline list-pipelines

The next section will configure our setup to be able to pull and push to code commit repository from our local machine. The CodeCommit respository that we have just created in CodePipeline is empty so we also need to copy the application code into the repository before running CodePipeline end to end.

Setting up a local repository

In this step, you set up a local repository to connect to your remote CodeCommit repository.This assumes using ssh keys installed on your machine. If not setup ssh keys already using ssh-keygen as described in the AWS docs. Upload your SSH public key to your IAM user. Once you have uploaded your SSH public key, copy the SSH Key ID. Edit your SSH configuration file named "config" in your local ~/.ssh directory. Add the following lines to the file, where the value for User is the SSH Key ID.

Host git-codecommit.*.amazonaws.com
User Your-IAM-SSH-Key-ID-Here
IdentityFile ~/.ssh/Your-Private-Key-File-Name-Here

Once you have saved the file, make sure it has the right permissions by running the following commands:

cd ~/.ssh
chmod 600 config

Clone the CodeCommit repository to your local computer and start working on code. You can get the ssh uri from the console under Clone URL for the CodeCommit repository. Navigate to a local directory (e.g. '/tmp') where you'd like your local repository to be stored and run the following command

$ git clone ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/deploy-lambda-image

Now we will copy all the files from the following folder
into the local directory you created earlier (for example, /tmp/deploy-lambda-image). Be sure to place the files directly into your local repository. The directory and file hierarchy should look like this, assuming you have cloned a repository named deploy-lambda-image into the /tmp directory:

/tmp
   └-- deploy-lambda-image
        ├── README.md
        ├── __init__.py
        ├── appspec.yaml
        ├── buildspec.yml
        ├── dockerfile
        ├── local_run.py
        ├── main_twitter.py
        ├── requirements.txt
        ├── secrets.py
        └── tweets_api.py

Run the following commands to stage all of the files, commit with a commit message and then push the files to the CodeCommit repository.

git add .
git commit -m "Add sample application files"
git push

The files you downloaded and added to your local repository have now been added to the main branch in the CodeCommit MyDemoRepo repository and are ready to be included in a pipeline.

Code Pipeline Execution

CodePipeline can be configured to trigger with every push to CodeCommit via EventBridge by defining an event rule to trigger CodePipeline when there are changes to the CodeCommit repository associated with the pipeline. This can be done from the console as detailed here and replacing the Arn for respective CodeCommit and CodePipeline resources respectively.

We can now commit the code in our local repository to CodeCommit. The will create a CodeCommit event which will be processed by EventBridge according to the configured rule, which will then trigger CodePipeline to execute the different stages as shown in the screenshots below.

For manual triggering, choose Release change on the pipeline details page on the console. This runs the most recent revision available in each source location specified in a CodeCommit Source action through the pipeline.

Once the pipeline has finished, we can check CloudWatch to see the invocation logs in the corresponding log stream. The main_twitter.handler calls the PutJobSuccessResult and PutJobFailureResult actions to return the success/failure of the lambda execution to the pipeline, which will terminate the LambdaInvocationTest stage with success or failure appropriately.

References

AWS Fraud Detector for classifying fraudulent online registered accounts - Part 2

Ryan Nazareth — Wed, 14 Sep 2022 20:21:06 +0000

In the first part of this blog, we trained and evaluated the fraud detector model performance. Now we will need to make it active by deploying it and then make predictions. We will also setup a REST API with API gateway (Lambda Integration) for making realtime predictions. The architecture diagram below shows the workflow. All the references to code snippets are in Github.

In the AWS Fraud Detector console, choose the model we have trained from the Models page. Scroll to the top of the Version details page and choose Actions and Deploy model version. On the Deploy model version prompt that appears, choose Deploy version.

The Version details shows a Status of 'Deploying'. When the model is ready, the Status changes to Active. Once the model has finished deploying and status changed to active, we will need to associate the model with Fraud Detector for predictions. However, we will also need to update the rule expressions as the default Fraud Detector version 1 created from Cloudformation uses the variable amt in the rule expression as seen in the screenshot below.

We need to change this to model insight score which is a new variable created after model training has completed. This variable is not available when the Cloudformation stack is created as the model has not been trained yet so we needed to have a placeholder existing variable so the rule expression is valid to avoid the stack for throwing an error. We can run the following custom script from the command line to update the detector rules and associate the new model with it.

python projects/fraud/deploy.py --update_rule 1 --model_version 1.0 --rules_version 2

This will carry out two steps:

Update the existing rule version with the correct expression based on the number passed to the --update_rule argument. It will create a new rule version (incremented from the original version number).
Then it will create a new detector version and associate the model version (--model_version argument) and the rules_version (--rules_version argument) which should be set to be the increment of the existing rule version. This will automatically increment the detector version to 2.0 as the existing version is 1.0

If the script runs successfully, we should see the following output streamed to stdout.

26-06-2022 04:50:34 : INFO : deploy : main : 121 : Updating rule version 1
26-06-2022 04:50:34 : INFO : deploy : update_detector_rules : 71 : Updating Investigate rule ....
{'detectorId': 'fraud_detector_demo', 'ruleId': 'investigate', 'ruleVersion': '2'}

26-06-2022 04:50:35 : INFO : deploy : update_detector_rules : 80 : Updating review rule ....
{'detectorId': 'fraud_detector_demo', 'ruleId': 'review', 'ruleVersion': '2'}

26-06-2022 04:50:35 : INFO : deploy : update_detector_rules : 89 : Updating approve rule ....
{'detectorId': 'fraud_detector_demo', 'ruleId': 'approve', 'ruleVersion': '2'}

26-06-2022 04:50:35 : INFO : deploy : main : 123 : Deploying trained model version 1.0 to new detector version 
{'detectorId': 'fraud_detector_demo', 'detectorVersionId': '2', 'status': 'DRAFT', 'ResponseMetadata': {'RequestId': 'da37d973-2c43-4c56-93e5-f9b9bd132bb3', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Sun, 26 Jun 2022 03:50:36 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '77', 'connection': 'keep-alive', 'x-amzn-requestid': 'da37d973-2c43-4c56-93e5-f9b9bd132bb3'}, 'RetryAttempts': 0}}

The image below show the model associated with the new version and the correct rules expressions which use the fraud_model_insightscore variable.

In the next section, we will set up API gateway to create a rest api endpoint to serve http requests, with lambda in the backend.

Setting up API gateway

In this section, we will walk through the steps to create the REST API

Open the API Gateway console, and select CreateAPI and the type as RestAPI.
To create an empty API, select Create New API and then New API. In Settings, choose a API name such as FraudLambdaProxy and optional Description. Then choose Create API.
Choose the root resource (/) just created and select Create Method from the Actions menu. Then Select Get from the dropdown menu.
For Integration Type select Lambda Function and choose Use Lambda Proxy integration. The Lambda function should already have been created via Cloudformation in part 1). Select the Lambda Region where the function was created (us-east-1) and for the Lambda Function field, select PredictFraudModel from the dropdown, and then click Save.
In the Method Execution pane, choose Method Request.
In settings, set Request Validator to Validate query string parameters and headers. Leave Authorization as None.
Expand the URL Query String Parameters dropdown, then choose Add query string.
Enter the following variables one by one as a separate name field. Mark all as required except for flow_definition variable

 amt, category, cc_num, city, city_pop, event_timestamp, first, flow_definition, gender, job, last, merchant, state, street, trans_num, zip

Go back to the Method Execution pane.It should look like below.

Choose Integration Request.
Choose the Mapping Templates dropdown and then choose Add mapping template.
For the Content-Type field, enter application/json and then choose the check mark icon.
In the pop-up that appears, choose Yes to secure this integration.
For Request body passthrough, choose When there are no templates defined (recommended).
In the mapping template editor, copy and replace the existing script with the following code:

#if("$input.params('flow_definition')" != "")
#set( $my_default_value = "$input.params('flow_definition')")
#else
#set ($my_default_value = "ignore")
#end


{
  "variables": {
        "trans_num":"$input.params('trans_num')",
        "amt":"$input.params('amt')",
        "zip":"$input.params('zip')",
        "city":"$input.params('city')",
        "first":"$input.params('first')",
        "job":"$input.params('job')",
        "street":"$input.params('street')",
        "category":"$input.params('category')",
        "city_pop":"$input.params('city_pop')",
        "gender":"$input.params('gender')",
        "cc_num":"$input.params('cc_num')",
        "last":"$input.params('last')",
        "state":"$input.params('state')",
        "merchant":"$input.params('merchant')"
  },
  "EVENT_TIMESTAMP":"$input.params('event_timestamp')",
  "flow_definition":"$my_default_value"
}

Choose Save, and go back to MethodExecution pane. Click on Test button on the left.
In the Query Strings box paste the following

trans_num=6cee353a9d618adfbb12ecad9d427244&amt=245.97&zip=97383&city=Stayton&first=Erica&job=Engineer, biomedical&street=213 Girll Expressway&category=shopping_pos&city_pop=116001&gender=F&cc_num=180046165512893&last=Walker&state=OR&merchant=fraud_Macejkovic-Lesch&event_timestamp=2020-10-13T09:21:53.000Z&flow_definition=arn:aws:sagemaker:us-east-1:376337229415:flow-definition/fraud-detector-a2i-1656277295743

If successful you should see the response and logs as in screenshot below. You can also navigate to CloudWatch log stream group for Lambda invocation and check it has run successfully.

We can then proceed to deploying the API. Go back to Resources, Actions and Deploy API. Select Deployment Stage as New Stage and choose name as dev. You should see the API endpoint to invoke on the console. Finally make sure logging is setup to allow debugging errors in the REST API, by following the instructions here. The setup should look like below. Note that when you add the IAM role to gateway console, it should automatically add the log group in the format API-Gateway-Execution-Logs_apiId/stageName. The Arn for the log group end with dev:*. You need to only include the Arn upto the stagename dev as shown in the image below. Otherwise it will throw issues with the validation checks.

To test the API's new endpoint, we can use postman for sending an API request. Create a postman account and select GET from the list of request types. Since the GET method is configured in / root resource, we can invoke the api endpoint https://d9d16i7hbc.execute-api.us-east-1.amazonaws.com/dev with the query string parameters appended at the end (key=value format and separated by &). Paste the following command in the box as in screenshot below. You should see the parameters and values automatically parsed and populated in the KEY/VALUE rows below.
Click send and you should see the response body at the bottom.

https://d9d16i7hbc.execute-api.us-east-1.amazonaws.com/dev?trans_num=6cee353a9d618adfbb12ecad9d427244&amt=245.97&zip=97383&city=Stayton&first=Erica&job='Engineer, biomedical'&street='213 Girll Expressway'&category=shopping_pos&city_pop=116001&gender=F&cc_num=180046165512893&last=Walker&state=OR&merchant=fraud_Macejkovic-Lesch&event_timestamp=2020-10-13T09:21:53.000Z

You can check the log streams associated with the latest invocation in the Cloudwatch log group for API gateway. This will show messages with the execution or access details of your request.

Note: If any changes are made to the api configuration or parameters - it would need to be redeployed for the changes to take effect.

Generate Predictions

You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. You may want to generate fraud predictions for a batch of events. These might be payment fraud, account take over or compromise, and free tier misuse while performing an offline proof-of-concept.

You can also use batch predictions to evaluate the risk of events on an hourly, daily, or weekly basis depending upon your business need. If you want to analyze fraud transactions after the fact, you can perform batch fraud predictions using Amazon Fraud Detector. Then you can store fraud prediction results in an Amazon S3 bucket. Although beyond the scope of this example, we could have also used additional services like Amazon Athena to help analyze the fraud prediction results (once delivered in S3) and Amazon QuickSight for visualising the results on a dashboard.Copy the batch sample file delivered in the glue_transformed folder (following successful glue job run) to batch_predict folder. This will trigger notification to SQS queue which has Lambda function as target, which starts the batch prediction job in Fraud Detector

$ aws s3 cp s3://fraud-sample-data/glue_transformed/test/fraudTest.csv s3://fraud-sample-data/batch_predict/fraudTest.csv

We can monitor the batch prediction jobs in Fraud Detector. Once complete, we should see the output in S3. An example of
a batch output is available here

In realtime mode, we will make use of the API gateway created and integrated with the lambda function which makes the
get_event_prediction api call to FraudDetector. In this example we are using the same lambda for batch and realtime predictions. The code in lamdba checks the checks the event payload to see if certain keys are present which are expected from a request from API gateway (.i.e after the request
is transformed via the mapping template in api gateway). We have configured the mapping template to create a variables key, so we can check if the payload has 'variables' key, to run realtime prediction. If the event payload has 'Records' key, it indicates the event is coming from SQS and will run a batch prediction job.

Ideally, separate lambdas for realtime and batch prediction could have been used, to make it easier to manage.
To run realtime prediction, API gateway REST API has been configured to accept query string parameters and send the request to lambda as explained in the previous section.

Teardown resources

The custom bash script below can be executed to teardown all the fraud detector resources, including the trained fraud model, detector (including rules), event type (including outcomes, variables, labels).

#!/bin/bash

variables=( "trans_num" "amt" "city_pop" "street" "job" "cc_num" "gender" "merchant" "last" "category" "zip" "city" "state" "first")
labels=("legit" "fraud")
rules=("investigate" "review" "approve")
outcomes=("high_risk" "low_risk" "medium_risk")
event_type="credit_card_transaction"
entity_type="customer"
detector_name="fraud_detector_demo"
model_name=fraud_model

echo "Delete model versions"    
aws frauddetector  delete-model-version --model-id $model_name --model-type ONLINE_FRAUD_INSIGHTS --model-version-number 1.0
aws frauddetector  delete-model-version --model-id $model_name --model-type ONLINE_FRAUD_INSIGHTS --model-version-number 2.0

echo ""
echo "Delete model"
aws frauddetector  delete-model --model-id $model_name --model-type ONLINE_FRAUD_INSIGHTS


echo ""
echo "Deleting detector version id 1"
aws frauddetector delete-detector-version --detector-id $detector_name --detector-version-id 1

echo ""
for var in "${rules[@]}";
    do
        echo "Deleting rule $var"
        aws frauddetector  delete-rule --rule detectorId=$detector_name,ruleId=$var,ruleVersion=1
    done;

echo ""
echo "Deleting detector id $detector_name"
aws frauddetector delete-detector --detector-id $detector_name

echo ""
echo "Deleting event-type $event_type"
aws frauddetector delete-event-type --name $event_type

echo ""
echo "Deleting entity-type $entity_type"
aws frauddetector delete-entity-type --name $entity_type

echo ""
for var in "${variables[@]}";
    do
        echo "Deleting variable $var"
        aws frauddetector  delete-variable --name $var
    done;


echo ""
for var in "${labels[@]}";
    do
        echo "Deleting label $var"
        aws frauddetector  delete-label --name $var
    done;

echo ""
for var in "${outcomes[@]}";
    do
        echo "Deleting outcome $var"
        aws frauddetector  delete-outcome --name $var
    done;

echo ""
echo "Deleting cloud formation stack"
aws cloudformation delete-stack --stack-name FraudDetectorGlue

Finally, from the S3 console, empty the bucket contents and then delete the bucket.