Forem: omarkhater

Service in review: Sagemaker Modeling Pipelines

omarkhater — Sat, 18 Mar 2023 09:26:39 +0000

Introduction

Welcome back to my blog, where I share insights and tips on machine learning workflows using Sagemaker Pipelines. If you're new here, I recommend checking out my first post to learn more about this AWS fully managed machine learning service. In my second post, I discussed how parameterization can help you customize the workflow and make it more flexible and efficient.

After using Sagemaker Pipelines extensively in real-life projects, I've gained a comprehensive understanding of the service. In this post, I'll summarize the key benefits of using Sagemaker Pipelines and the limitations you should consider before implementing it. Whether you're a newcomer to the service or a seasoned user, you'll gain valuable insights from this concise review.

Key Features

1.Sagemaker Integration

This service is integrated with Sagemaker directly, so the user doesn't have to deal with other AWS services. Also, one can create pipelines programmatically thanks to the Sagemaker Python SDK Integration. Further, it can be used from within the console due to the seamless integration with Sagemaker Studio.

2.Data Lineage Tracking

Data lineage is the process of tracking and documenting the origin, movement, transformation, and destination of data throughout its lifecycle. It refers to the ability to trace the path of data from its creation to its current state and provides a complete view of data movement, including data sources, transformations, and destinations.

Sagemaker pipelines make this process easier as can visualized below ¹.

3.Curated list of steps for all ML life cycle

Sagemaker pipeline provides a convenient way to manage the highly iterative process of ML development through something called steps. This enables easier development and maintenance either individually or within a team. Currently, it contains the following list of step types.

A more comprehensive guide on these steps is articulated in this post

4.Parallelism

There are several ways to run the ML workflows in parallel using Sagemaker pipelines. For example, it can be used to either change the data, algorithm or both. The ability to smoothly integrate with other Sagemaker capabilities greatly simplifies the process of creating repeatable and well-organized machine learning workflows.

a simple Python program to lunch many pipelines in parallel can be something like the code snippet below:

from sagemaker.workflow.pipeline import Pipeline
from multiprocessing import Process
from concurrent.futures import ThreadPoolExecutor
import datetime

def start_pipeline(Pipeline_Parameters, execution_parameters):
    try:
        ct_start = datetime.datetime.now() 
        print(f'Executing pipeline: {execution_parameters["pipeline_name"]} with the following parameters:\n')
        print(Pipeline_Parameters)
        for k,v in Pipeline_Parameters.items():
            print(f"{k}: {v}")

        pipeline = Pipeline(name = execution_parameters["pipeline_name"])

        execution = pipeline.start(execution_display_name = execution_parameters["disp_name"],
                                       execution_description = execution_parameters["execution_description"],
                                       parameters=Pipeline_Parameters)
        if execution_parameters["wait"]:
            print("Waiting for the pipeline to finish...")
            print(execution.describe())

            ## Wait for maximum 8.3 (30 seconds * 1000 attempts)  hours before raising waiter error. 
            execution.wait(delay = 30, # The polling interval
                           max_attempts = 1000 # The maximum number of polling attempts. (Defaults to 60 polling attempts)
                          ) 
            print(execution.list_steps())
        else:
            print("Executing the pipeline without waiting to finish...")
        print(f'Executing pipeline: {execution_parameters["pipeline_name"]} done')
        ct_end = datetime.datetime.now()
        ET = (ct_end - ct_start)
        print(f"Time Elapsed: {ET} (hh:mm:ss.ms)")
        return execution

    except Exception as E:
        import sys
        sys.exit(f"Couldn't run pipeline: {execution_parameters['disp_name']} due to:\n{E}")

def worer_func(process:Process):
    process.join()
    return process.exitcode


if __name__ == '__main__':
    proc = []
    # List of all required executions such as display name. Each configuation should be a dictionary
    Execution_args_list = [] 
    # List of parameters per execution. Each configuation should be a dictionary
    pipeline_parameters_list = []
    for Execution_args, pipeline_parameters in zip(Execution_args_list, pipeline_parameters_list):
        p = Process(target=start_pipeline, args=(pipeline_parameters, Execution_args))
        p.start()
        proc.append(p)
        with ThreadPoolExecutor(len(proc)) as pool:
            tasks = []
            for index,p in enumerate(proc):
                tasks.append(pool.submit(worer_func, p))
            for item in as_completed(tasks):
                if item.result() != 0:
                    for process in proc:
                        try:
                            process.terminate()
                        except PermissionError:
                            pass

Limitations and areas of improvment

Pipelines with conditions
- SageMaker Pipelines doesn't support the use of nested condition steps. You can't pass a condition step as the input for another condition step.
- A condition step can't use identical steps in both branches. If you need the same step functionality in both branches, duplicate the step and give it a different name.
Loops:

Sagemaker pipelines doesn't provide a direct way to iterate some steps of the ML flow. For example, if you need to repeat data processing and model training until certain accuracy is met, you will have to implement this logic yourself.
Passing data between steps:

It is a typical situation in ML development to pass many data arrays between different steps. While Pipelines can be customized such that the developer can save and load data files from S3, this create a development bottleneck in rapid prototyping. Reading/writing data to files is an error prone in nature and the developer needs to effectively handle the errors of this process to avoid failed executions of the pipelines.
Operations on pipeline parameters:

Sagemaker enables using variables within the pipeline that can be changed in run time by using Pipeline Parameters. I dedicated this post to summarize in-depth this key feature of the pipeline.

Conclusions

In conclusion, Sagemaker Model Building Pipelines is a valuable service that simplifies the creation, management, and monitoring of machine learning workflows. Its integration with Sagemaker makes it easy to use without the need to deal with other AWS services, and the availability of a Python SDK enables the creation of pipelines programmatically. The service provides a curated list of steps for all stages of the ML life cycle and enables data lineage tracking, making it easier to trace the path of data throughout its lifecycle. Additionally, the service supports the parallel execution of ML workflows, which is helpful when processing large amounts of data. However, there are still some limitations that the service needs to address, such as the inability to loop through specific part of the pipelines steps. Overall, Sagemaker Model Building Pipelines is a powerful tool for data scientists and machine learning engineers, and its many features make it a valuable addition to the machine learning ecosystem.

Unlocking Flexibility and Efficiency: How to Leverage Sagemaker Pipelines Parameterization

omarkhater — Sun, 05 Mar 2023 07:09:45 +0000

Introduction

Are you using Sagemaker Pipelines for your machine learning workflows yet? If not, you're missing out on a powerful tool that simplifies the entire process from building to deployment. But even if you're already familiar with this service, you may not know about one of its key features: parameterization. With parameterization, you can customize the whole workflow, making it more flexible and dynamic. In this article, we'll take a deep dive into this feature, share some real-world examples, and discuss the pros and cons. We'll even offer some workarounds for addressing the limitations of this service. So grab your coffee and let's get started!

Overview of Sagemaker Pipelines Parameterization

Before we dive into the pros and cons of Sagemaker Pipelines parameterization, let's take a quick look at how it works. Sagemaker Pipelines allows users to specify input parameters using the Parameter class. These parameters can be specified when defining the pipeline, and their values can be set at execution. Parameterization allows users to create flexible pipelines that can be customized for different use cases.

In addition, Sagemaker studio provides a stunning GUI to execute the pipeline using any other values for the defined parameters.

Generally speaking, Sagemaker Modeling Pipeline supports 4 different type of parameters:

ParameterString – Representing a string parameter.
ParameterInteger – Representing an integer parameter.
ParameterFloat – Representing a float parameter.
ParameterBoolean – Representing a Boolean Python type

and the syntax is as simple as:

<parameter> = <parameter_type>(
    name="<parameter_name>",
    default_value=<default_value>
)

Real-world Examples

Amazon Sagemaker Example Notebooks provides several complete posts about Sagemaker pipelines parameterization. Posts below are really helpful to get started with:

Parameterize Sagemaker Pipelines (Introductory example): shows how to create a parameterized Sagemaker Pipeline using the Amazon Sagemaker SDK.
Comparing model metrics with Sagemaker Pipelines and Sagemaker Model Registry (Advanced): provides an example of how to use Sagemaker Pipelines in deploying the model based on a parmeterized performance.

Benefits of Sagemaker Pipelines Parameterization

Clearly, parameterization is a key advantage of using Sagemaker Pipelines in automating ML workflows.

There are numerous advantages such as:

GUI-based executions: Typically, one can define the Pipeline once then execute the whole work flow smoothly. This is a significant benefit if you work with a colleague data scientist who prefer low-code solutions. Yet, we still can execute it using the Sagemaker SDK
Rapid prototyping: Evidently, it enables more efficient experimentation and testing by allowing for easy modification of pipeline components without the need for extensive manual changes.
Colloboration: By divding the ML workflow into modular, parameterized parts. It is now more practical and efficient for team-work.
Automation: It facilitates automation by enabling the use of scripts and code to modify pipeline parameters, allowing for fully automated end-to-end machine learning workflows.

The list goes on and on.

Limitations of Sagemaker Pipelines Parameterization

While parameterization is a useful feature of Sagemaker Pipelines, it also has some limitations that can make it difficult to use in certain situations. Here are some common limitations to be aware of:

Limited support for dynamic or runtime parameters: Sagemaker Pipelines only supports static parameters that are set during pipeline definition. There is no support for runtime or dynamic parameters that can be set during pipeline execution.
Limited support for nested parameters: Sagemaker Pipelines does not support nested parameters or hierarchical parameters, which can be limiting in more complex pipeline use cases.
Limited parameter validation: Sagemaker Pipelines does not provide extensive parameter validation capabilities, which can make it harder to catch errors or issues during pipeline execution. For example, it may not automatically validate the format or type of the input data, or ensure that the parameters are within acceptable ranges or limits.
Quota limiations: AWS sets a non-adjustable quota limit of 200 for the maximum number of parameters in the pipeline. This might be a problem in large-scale pipelines.

In addition, the official documentation listed some other limitations:

Not 100% compatible with other Sagemaker Python SDK modules: For example, pipeline parameters can't be used to pass image_uri for Framework Processors but can be used with Processor.
Not all arguments can be parameterized: Remember to read the documentation carefully to see whether a certain parameter can be a Pipeline variable or not.

For example, the role can be parameterized while the base_job_name can not be parameterized in the Processor API as shown below.

Not all built-in Python operations can be applied to parameters.

# An example of what not to do
my_string = "s3://{}/training".format(ParameterString(name="MyBucket", default_value=""))

# Another example of what not to do
int_param = str(ParameterInteger(name="MyBucket", default_value=1))

# Instead, if you want to convert the parameter to string type, do
int_param.to_string()

Useful Workarounds

While these limitations can make parameterization in Sagemaker Pipelines challenging, there are solutions for overcoming some of them. Here are a few examples:

Using Lambda functions for dynamic parameters: To work around the limitation of static parameters, you can use a Lambda function to determine the parameter value dynamically based on other pipeline inputs or external data. For example, you could use a Lambda function to calculate the minimum star rating to include in your analysis based on the average star rating of all customer reviews. You can use Lambda step in this context. All different steps are summarized in this post

Using ProperyFiles for nested parameters: If you need to specify nested or hierarchical parameters, you can write a JSON file to be used within the Pipeline using both JsonGet and PropertyFile as shown in the code snippet below:

import sagemaker
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import  FrameworkProcessor
import json
from sagemaker.workflow.functions import JsonGet
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.processing import  ProcessingOutput

pipeline_session = PipelineSession()
pp_outputs = [ProcessingOutput(output_name="paths", 
                               source="/opt/ml/processing/output")]
Paths_file = PropertyFile(
    name="NestedParameter",
    output_name="paths",
    path="nested_parameter.json",
)

pp = FrameworkProcessor(role=sagemaker.get_execution_role(),
                        instance_type="ml.t3.medium",
                        instance_count=1,
                        estimator_cls=SKLearn,
                        framework_version = "0.23-1",
                       sagemaker_session = pipeline_session)
step_args = pp.run(code = "DoNothing.py",
                  outputs= pp_outputs)
step_process = ProcessingStep(
    name="Dummystep",
    step_args=step_args,
   property_files=[Paths_file])

train_path = JsonGet(
 step_name=step_process.name,
 property_file=Paths_file,
 json_path="paths.train.URI",
)
test_path = JsonGet(
 step_name=step_process.name,
 property_file=Paths_file,
 json_path="paths.test.URI",
)

The JSON file could be something like this:

{
"paths": {
    "train": {
        "URI": "s3://<path_to_train_data>"
    },
    "test": {
        "URI": "s3://<path_to_test_data>"
    }
}
}

Building custom validation scripts: To catch errors or issues with your pipeline parameters, you can build custom validation scripts that check the parameter values before the pipeline runs. This can help catch errors early and prevent pipeline failures due to invalid parameters.
Careful design: All in all, you need to design the exposed parameters to ensure not passing the quota value. If you need further increase, you may contact the support about your case.

Conclusion

Parameterization is a useful feature in Sagemaker Pipelines, but it does have some limitations that can make it challenging to use in certain situations. By using Lambda functions, PropertyFiles, and custom validation scripts, you can work around some of these limitations and create more flexible pipelines. By following best practices for parameterization, you can also ensure that your pipelines are well-organized and easy to use. With these tips and tricks, you'll be able to make the most of Sagemaker Pipelines parameterization and create powerful machine learning workflows.

Revolutionize Your Machine Learning Workflow with SageMaker Pipelines: Build, Train, and Deploy Models with Ease!

omarkhater — Mon, 27 Feb 2023 03:39:02 +0000

Introduction

Are you interested in machine learning and looking for ways to optimize your workflow? Look no further than Sagemaker, Amazon Web Services' (AWS) fully managed machine learning service. With Sagemaker, one can develop, train, and deploy machine learning models at scale. Impressively, the Sagemaker Pipelines help automate the highly iterative procedure of training, tuning, and deploying machine learning models at scale.

In recent months, I had the opportunity to use Sagemaker Pipelines extensively within my team. So, I decided to start a series of posts to review this service in-depth. In this first post, we are going to break down the basic elements of this service, also called Steps, and providing some real world illustrations for combining these steps elegantly.

Steps Summary

Are you ready to supercharge your machine-learning workflow with Sagemaker pipelines? With 15 different step types at your fingertips, organizing your pipeline has never been easier. Plus, these steps are grouped based on functionality for effortless recall. Keep reading for a summary of the steps below.

1. Data Processing

There are two ways to process data in SageMaker Pipelines:

Processing: This step offers the flexibility to preprocess data using a custom script or a pre-built container. The resulting output can be utilized as an input to the Training step.
EMR: This step enables data processing using Amazon Elastic MapReduce (EMR) clusters. EMR offers a managed Hadoop framework that allows processing of massive amounts of data quickly and efficiently.

2. Modeling

Under the Training category, the following steps are included:

Training: This step trains a machine learning model using the data that was preprocessed in the previous Processing step. You can specify the algorithm to be used, along with input/output channels and hyperparameters.
Tuning: This step tunes the hyperparameters of the model generated in the previous Training step. It tries a range of values for each hyperparameter and selects the combination of hyperparameters that yields the best performance.
AutoML: This step automatically selects the optimal machine learning algorithm and hyperparameters for a specific problem. It uses various techniques, such as feature engineering and hyperparameter tuning, to generate the most effective model.

3. Monitoring

This category includes the following steps:

ClarifyCheck: This step examines the input data to detect any instances of bias and fairness problems. It generates a report that can assist in improving the data's quality and guaranteeing that machine learning models are impartial and equitable.
QualityCheck: This step examines the results of the Training step to confirm that the model fulfills predefined quality standards. It can establish the standards based on metrics like accuracy, precision, and recall

4. Deployment

The Deployment category includes the following steps:

Model: This step creates a model artifact that can be used for inference. The artifact includes the trained model parameters, as well as any additional files or libraries required for inference.
CreateModel: This step creates an Amazon Sagemaker model using the model artifact generated by the Model step.
RegisterModel: This step registers the created model with Amazon Sagemaker, which allows you to easily deploy the model to different endpoints.
Transform: This step creates a batch or real-time endpoint to serve predictions using the registered model.

5. Control Flow

The Control Flow category includes the following steps:

Condition: This step allows to define conditional logic in the pipeline. It can be used to branch the pipeline based on a specified condition.
Fail: This step allows to intentionally fail the pipeline if certain conditions are met. It can be used to ensure that the pipeline stops if something unexpected happens.

6. Custom Functionalities

The Custom Functionalities category includes the following steps:

Callback: This step enables designating a custom script to be executed during the pipeline operation. It can be utilized for executing custom actions, like dispatching notifications or running extra processing steps.
Lambda: This step allows to execute an AWS Lambda function during pipeline operation. It can be utilized to accomplish various tasks, such as extracting metadata or transforming data.

Example Scenarios

The Sagemaker official documentation contains several great examples on using Sagemaker pipelines. We shed the light on few examples that demonstrate how one can use these steps together, effectively.

These examples show just a few of the many ways Sagemaker Pipeline Steps can be combined to create end-to-end machine learning workflows. With the flexibility and scalability of Sagemaker, the possibilities are endless!

Scenario 1: Create a Sagemaker Pipeline to Automate All the Steps from Data Preparation to Model Deployment

In Fraud Detection for Automobile Claim example, Sagemaker Pipelines is used to facilitate the collaboration within a team of data scientist, machine learning engineer, and ML Ops Engineer.

The architecture below is used to automate all the steps from data preparation to model deployment. (source)

The actual pipeline used in this demo after completion would be as shown below. (source)

The mapping between each block name and type is described as well.

Step Number	Step Name	Step Type
1	ClaimsDataWranglerProcessingStep	Processing Step
2	CustomerDataWranglerProcessingStep	Processing Step
3	CreateDataset	Processing Step
4	XGBoostTrain	Training Step
5	ClarifyProcessor	Processing Step
6	ModelPreDeployment	CreateModel Step
7	XgboostRegisterModel	Register Model
8	DeployModel	Processing Step

Note that this example creates many processing steps using different methods (e.g., sagemaker.processing.Processor, SKLearnProcessor). This is an extremely powerful feature for sagemaker as it enables processing data using either pre-built containers provided by Sagemaker or fully custom docker images. Similarly, this can be used with training step. While this example use pre-built XGBoost container, it is also possible to use either a built-in algorithm, extend existing container, build custom docker image from scratch.

Scenario 2: Orchestrate Jobs to Train and Evaluate Models with Amazon Sagemaker Pipelines

In this abalone age regression problem, Sagemaker Pipelines are used to conditionally approve trained models for deployment or flag an error.

The pipeline for this scenario is shown below (source)

Conclusion

In conclusion, Sagemaker pipelines offer a powerful and flexible solution for building, training, and deploying machine learning models at scale. By automating the end-to-end machine learning process, Sagemaker pipelines can save significant time and effort for data scientists, machine learning engineers, and ML Ops engineers. With a wide range of steps available, including data processing, training, monitoring, deployment, control flow, and custom functionalities, Sagemaker pipelines provide a comprehensive solution for building complex machine learning workflows. The example scenarios provided in this post demonstrate just a few of the many ways in which Sagemaker pipeline steps can be combined to create end-to-end machine learning workflows. Overall, Sagemaker pipelines are an excellent tool for teams looking to collaborate efficiently and deploy machine learning models with ease.

Stay tuned for future updates of this series to speak more about practical and up-to-date tips and tricks of this service, cost-optimization and more. Do you want me to inlcude or review specific part of the service? Let's connect and discuss.

Amazon CodeWhisperer in review: The newest AI code companion

omarkhater — Sat, 22 Oct 2022 23:50:30 +0000

Introduction

Recently, AWS announced a machine-learning powered service that helps improve developer productivity by generating code recommendations based on developers' comments in natural language and their code in the integrated development environment. The service, called Amazon CodeWhisperer, is still in preview to be used at no cost. This service is similar to GitHub copilot, which Microsoft launched last year.

In the past few months, I had a chance to experiment with this new service in a few use cases. As a machine learning (ML) developer, I had the advantage of utilizing ML to help develop ML solutions. Thus, I am writing about some observations after getting early access to this service. Besides, I am giving specific suggestions on how to make it smarter and more accessible.

The service in action

The service provides real-time code suggestions based on the comments in the code editor and previous codes in the same document. The service may suggest line completion or complete code blocks (e.g., methods).

On Visual Studio, there are some handy shortcuts that make the use of the service more convenient. While the extension is enabled, the service provides online inference similar to the auto-complete feature supported by many IDEs. However, the user could hit the (Alt+C) to see the recommendations without waiting for the response.

Below is an example of writing the famous binary search method

Interestingly, the service may suggest multiple code snippets that could be easily navigated (with left/right arrows) to choose the most suitable recommendation.

Amazon CodeWhisprer is like the companion that tries to whisper in your ears with the right code. Hence, it is a pretty fancy and super descriptive name. Good job on naming the service.

A deep dive, How to get the most out of the service?

AI code companion is a powerful tool that can boost developer productivity. Despite it may be argued that such a tool could replace the developers in the future, it is still too early to jump to this conclusion as the service is like any other service: Garbage in Garbage out. That's, it heavily depends on the input it takes to return good results. Below is an example of how the input quality can totally affect the output quality.

Here, the provided description was so vague without any clear requirements, so the output is chaotic imports after waiting for a relatively long time.

As the input description becomes more clear, the output becomes so much better as shown below for a similar, yet clearer problem.

In addition, the quality of the recommendations significantly improves as the user adds more context i.e., as the developer writes more code. For example, it is expected to get faster and more personalized results while working on one project in comparison to either isolated tasks on the same document or at the early moments of the project when the context is just still not enough.

Nevertheless, the service is not expected to return helpful answers for infamous custom tasks. Below is an example of the same binary search problem but with slight modifications to the input format.

Apparently, the engine couldn't understand the slight modification to the problem (i.e., allowing duplicated elements) and still produce the same code suggested earlier.

As the service is still in preview, it is expected to have areas for improvement. Below is a curated list of actions that can make the service much better.

Inference speed:

As might be noted in the examples above, the service takes non-trivial time to suggest recommendations. I believe there is considerable space for improvement in this aspect.

Consistency and real-time capabilities:

The service is expected to give real-time recommendations while the developer writes the code. However, the real-time suggestions might not give any output at a specific time instant. Surprisingly, hitting the (Alt+C) shortcut returns workable solutions without changing anything (i.e., at the same time instant).

End-user Customization:

The recommendation engine under the hood uses a huge code library from many sources that were written for different purposes. It is justified to enable more customization for the sources that you accept for some projects.

Also, it might be beneficial to predict the codes based on the project theme. For example, Machine Learning development is completely different from developing mobile applications.

As another example, the user might want to work on a project that requires multiple blocks of codes that need to be designed and aggregated. On other projects, it might be needed to prioritize line completion more than block suggestions.

The list of examples for customization is huge and needs careful design.

Solutions Ranking:

It is a great feature to suggest multiple solutions. However, in practice, the ranking of these solutions is not optimal and the user needs to navigate all solutions to find the right suggestion. This is can be tedious and reduce the overall productivity gain.

Problem Customization:

The engine effectively understands common problems found in the training corpus. However, it is a challenge to adapt to the subtle modifications of the same problem.

Conclusion

In summary, AWS CodeWhisprer (and AI code companions in general) is not magic after all that can solve all problems. However, it is a great tool to enhance developer productivity by focusing on the right problems instead of tedious repetitive tasks.

To get the most out of the AWS CodeWhisprer (and AI code companions in general), the following actions might help in achieving the desired goals:

Concise comments: the clearer and more well-defined the input tasks, the higher the probability of getting quality results.
Unified Projects: The AI engine collects information from the whole document. Hence, it enriches the context continuously. Thus, it is more beneficial to use it on tasks that have a connection in one way or another.
Avoid advanced custom problems: the less popular the problem, the higher the probability it will not return any helpful answer.