Forem: ahmeddaker

SageMaker Studio Administration Best Practices part-1

ahmeddaker — Sat, 21 Jan 2023 16:41:24 +0000

In this series of articles we will discuss and explain the best practices for SageNaker Studio Administration from AWS whitepapers docs.

Introduction
When using SageMaker Studio as your ML platform, it's important to follow best practices for scaling and organization. Consider the following:

Choose an operating model and organize ML environments to meet your business goals.
Set up domain authentication for user identities and be aware of limitations.
Federate user identity and authorization for fine-grained access control and auditing.
Set up permissions and guardrails for various ML roles.
Plan your VPC network topology based on workload sensitivity, user numbers, and launched instances and jobs.
Monitor your platform's performance and resource usage to ensure that it meets your needs and identify any potential bottlenecks.
Regularly review and update your security and compliance controls to ensure that they are up-to-date and meet your organization's requirements.
Automate your ML pipeline as much as possible to reduce manual errors and improve the efficiency of your ML development and deployment process.
Continuously evaluate and update your ML models to ensure that they are performing well and that the data used for training and validation is up-to-date.
Collaborate with your team to share best practices, knowledge and expertise to improve the performance of your ML platform.

Recommended account structure
When setting up an operating model for your SageMaker Studio platform, it's important to follow best practices for organization and management. Here are a few recommendations:

Use AWS Control Tower for account setup, management and governance
Centralize identities with an Identity Provider and AWS IAM Identity Center and enable secure access to workloads
Isolate ML workloads across development, test, and production accounts
Stream logs to a log archive account for analysis and filtering
Use a centralized governance account for data access provisioning and auditing
Embed security and governance services in each account for security and compliance.

Model account structures for data science teams

*1. Centralized: *
all data science activities are managed by one team or organization.

In this model, the ML platform team will be responsible for:

Providing shared services and tools for MLOps across data science teams
Managing shared accounts for ML workload development, testing, and production
Implementing governance policies for workload isolation
Ensuring adherence to common best practices for the platform.

Decentralized: data science activities are spread across different business functions or divisions.

In this model, each ML team is responsible for their own ML accounts and resources. However, it is recommended to use a centralized approach for monitoring and managing data governance for ease of audit management.

Federated: shared services are managed by a centralized team, while business units or product teams are managed by decentralized teams. This model is similar to a hub and spoke model, where each business unit has its own team, but they coordinate with the central team.

This model is similar to the centralized model, but with the added benefit of each data science/ML team having their own set of accounts for development, testing, and production. This allows for better isolation of resources and independent scaling for each team without affecting others.

ML platform multitenancy
**Multitenancy is a way to manage multiple user groups within a single software instance. Each group, called a tenant, has its own set of privileges and access to the software.
In Machine Learning (ML) platforms like SageMaker Studio, multitenancy allows multiple teams to work within the same platform, but with separate access and resources.
It is possible to have multiple teams within one SageMaker Studio instance, but it's important to consider factors such as cost, security and account limitations.
A best practice is to have each team work within its own Studio Domain, using separate accounts. This can be done with the help of tools like AWS Service Catalog. It allows self-service deployment of Studio resources in multiple accounts and regions.

Domain management

- Set up your domain for Identity and Access Management (IAM) federation.
Before you can use IAM federation for your Studio Domain, you need to create an IAM federation user role (like a platform administrator) in your IdP (Identity Provider). You can find more information on how to do this in the Identity Management section. To set up SageMaker Studio with IAM, refer to the guide "Onboard to Amazon SageMaker Domain Using IAM Identity Center" for detailed instructions.

- Set up your domain for single sign-on (SSO) federation.
To use Single Sign-On (SSO) with SageMaker Studio, you must first enable AWS SSO in your AWS Organization management account in the same region where you plan to run SageMaker Studio. The process for setting up the domain is similar to setting up IAM federation, with the exception of selecting SSO in the authentication section.
For more information on how to set it up, please refer to the guide "Onboard to Amazon SageMaker Domain Using IAM Identity Center".

- Studio user profile.
A user profile is an entity that represents an individual user within a SageMaker Studio domain. It's created when a user is onboarded to SageMaker Studio and it's used for sharing, reporting and other user-related features. When an administrator invites a person by email or imports them from SSO, a user profile is automatically created. Each user profile has its own private Amazon EFS home directory, settings and it's the main way to reference a user. It's recommended to create a user profile for each physical user of the application. Each user profile has its own dedicated directory on EFS, there is no shared directory between users.
Each user profile has its own dedicated compute resources, such as EC2 instances, to run notebooks. The resources allocated to one user are completely isolated from those allocated to another user and resources allocated to users in one account are separate from those allocated to users in another account. Each user can run up to four applications within isolated Docker containers or images on the same instance type.

- Jupyter Server app.
When you start a Studio Notebook for a user by using the pre-signed URL or by logging in with AWS SSO, the Jupyter Server App will be launched on a SageMaker service-managed VPC instance. Each user has their own dedicated Jupyter Server App. By default, the Jupyter Server App for SageMaker Studio Notebooks runs on a dedicated ml.t3.medium instance, which is reserved as a "system" instance type and the compute for this instance is not charged to the customer.

The Jupyter Kernel Gateway app. The Kernel Gateway app allows users to run multiple Jupyter notebook kernels, terminal sessions and interactive consoles within a SageMaker Studio image/Kernel Gateway app. Users can create it through the API or the SageMaker Studio interface and it runs on a chosen instance type. Users can also run up to four Kernel Gateway apps or images on the same physical instance, each one isolated by its container or image. Users can use built-in SageMaker Studio images that are preconfigured with popular data science, and deep learning packages such as TensorFlow, Apache MXNet, and PyTorch. To create more apps, you'll need to use a different instance type. Each user profile can only have one running instance of any type. Users will be billed for the time the instance is running. To save costs, users can shut down the instance when not in use. When a user shuts down and reopens a Kernel Gateway app from the SageMaker Studio interface, the app starts on a new instance, so the packages installed will not be persisted. Similarly, if a user changes the instance type on a notebook, the packages and session variables will be lost. However, users can use features such as bring your own image and lifecycle scripts to bring their own packages to Studio and persist them through instance switches and new instance launches. For more information, refer to Shut down and Update Studio Apps.

- EFS volume.
When a domain is created, a single EFS volume is created for use by all the users within the domain. Each user profile receives a private home directory within the EFS volume for storing the user’s notebooks, GitHub repositories, and data files. Access to the folders is segregated by user, through filesystem permissions. SageMaker Studio creates a global unique user ID for each user profile, and applies it as a Portable Operating System Interface (POSIX) user/group ID for the user’s home directory on EFS, which prevents other users from accessing its data.

It's important to backup your EFS volume to another EFS volume or Amazon S3 in case of accidental deletion, in order to restore the SageMaker Studio domain. The administrator needs to list all user profiles and associated EFS user IDs, delete all apps, user profiles, and the SageMaker Studio domain, create a new Studio domain, create the user profiles, and copy the files from the backup on EFS/Amazon S3.
You can use LifecycleConfigurations to back up data to and from S3 every time a user starts their app.
For more detailed instructions refer to the appendix section Studio Domain Backup and Recovery.

- EBS volume.
When you launch a SageMaker Studio Notebook instance, an EBS storage volume is also attached to it. It's used as the main storage for the container or image running on the instance. However, unlike EFS storage, which is persistent, the EBS volume attached to the container is temporary. That means that if you delete the app or image, the data stored locally on the EBS volume will be lost.

- Securing access to the pre-signed URL.

When a user opens a notebook link in SageMaker Studio, the Studio validates the user's IAM policy to authorize access and generates a pre-signed URL for the user. However, since the Studio console runs on an internet domain, this generated pre-signed URL is visible in the browser session, which could present a security risk for data theft if proper access controls are not in place.

To prevent this, Studio offers several methods to enforce access controls against pre-signed URL data theft:

Client IP validation using the IAM policy condition aws:sourceIp
Client VPC validation using the IAM condition aws:sourceVpc
Client VPC endpoint validation using the IAM policy condition aws:sourceVpce When accessing Studio notebooks from the Studio console, the only available option is to use client IP validation with the IAM policy condition aws:sourceIp. But you can use browser traffic routing products like Zscaler to ensure scale and compliance for internet access. These products generate their own source IP, which is not controlled by the enterprise customer, so it's impossible to use the aws:sourceIp condition. To use client VPC endpoint validation with the IAM policy condition aws:sourceVpce, the pre-signed URL needs to be created in the same customer VPC where Studio is deployed, and accessed via a Studio VPC endpoint on the customer VPC. This can be done by using DNS forwarding rules in Zscaler and corporate DNS, then using an Amazon Route 53 inbound resolver to access the customer VPC endpoint.

- SageMaker domain quotas and limits.

Each AWS account can only have one domain per region in IAM mode and one domain per account in SSO mode.
SageMaker Studio domain SSO federation is only supported in the region where AWS SSO is set up, across member accounts of the AWS Organization.
Once created, the VPC and subnet configuration of a domain cannot be changed.
It is not possible to switch between IAM and SSO modes after creating a domain.
Each user can only launch four Kernel Gateway apps per instance type.
Each user can only launch one instance of each instance type.
There are limits on the resources consumed within a domain such as the number of instances launched by instance types and number of user profiles that can be created, you can refer to the service quota page for a complete list of limits.
There is a hard limit of 1,000 user profiles per SageMaker Studio Domain.
Customers can request to increase the default resource limits by submitting an enterprise support case with a business justification and it will be subjected to account-level guardrails.

Architectural Patterns to Build End-to-End Data Driven Applications on AWS

ahmeddaker — Wed, 11 Jan 2023 15:37:24 +0000

Amazon Web Services (AWS) offers a wide range of services to customers, with over 200 options available. While this diversity can make it easier for customers to find the right tool for a specific task, it can also make it challenging to understand how different services can be used together to achieve a particular outcome. In particular, customers may find it difficult to identify proven patterns for building data-driven applications using the various AWS services. Conducting multiple proof-of-concepts (POCs) to try and find a suitable pattern can be a time-consuming process and may not provide the level of confidence needed to implement a solution based on other customers' experiences. This guide aims to provide guidance through the process of developing data-driven applications on AWS by providing examples of proven patterns based on customer's successful implementation.

Modern data architecture

When building a modern data strategy, there are three main stages to consider:

Modernization: This stage involves updating your data infrastructure to take advantage of the scalability, security, and reliability offered by a cloud provider like AWS.
Unification: This stage involves consolidating and integrating your data in data lakes and purpose-built data stores, such as Amazon S3, Amazon DynamoDB and Amazon Redshift. This allows you to easily access, process and analyze your data.
Innovation: This stage involves leveraging artificial intelligence (AI) and machine learning (ML) to create new experiences, optimize business processes and gain insights from your data.

It is important to note that these stages do not have to be completed in a specific order, and it is possible to start at any stage, depending on your organization's data journey.

Modern data architecture on AWS

Choose the right type of database that works best for your modern application, it can be a traditional database or something new like NoSQL database. This will help your application work better.
Use a data lake, which is a big storage area for storing all kinds of data in a single place. This storage is often provided by Amazon S3. It will give you more freedom to use your data in different ways.
Build analytics on top of the data stored in the data lake, this can be used for things like getting reports, finding trends, and making predictions.
Use AI and Machine Learning to make predictions and make your systems and applications smarter. AWS has many different tools to help with this.
Make sure to properly manage and share your data, this includes making sure it's safe, making sure the right people have access to it and making sure it complies with regulations.

Data driven architectural patterns
the five most commonly seen architecture patterns on AWS, that cover several use cases for various different industries and customer sizes:

1. Customer 360 architecture

- Data Ingestion: This involves consolidating data from various sources and storing it in a scheme less manner. This data is ingested as close to the source system as possible, and includes historical data as well as data that is predicted to be useful in the future. This can be done using connectors such as SAP, and services like AppFlow, Google analytics and data movement service.
- Building a Unified Customer Profile: This step involves extracting and linking elements from each customer record to create a single, 360-degree dataset that serves as the source of truth for customer data. Amazon Neptune is used to create a near real-time, persistent, and precise view of the customer and their journey.
- Intelligence Layer: This step involves analyzing the data using analytical stores such as Amazon Redshift or S3 to refine the ontologies, and access raw information using AWS Glue DataBrew and Amazon Athena serverless.
- Activation Layer: This final step involves activating the refined customer ontology by using AI/ML to make recommendations, predictions and create next best action APIs using Amazon personalize and pinpoint. These actions are then integrated and presented across various channels for optimized personal experience.

2. Event-driven architecture with IOT data

The architecture "Derive near real-time predictive analytics from IOT data" focuses on using IoT data to gain insights through predictive analytics. The process involves several steps including:

Data collection from IoT devices such as medical devices, car sensors, industrial IOT sensors. This telemetry data is collected close to the devices using AWS IoT Greengrass which enables cloud capabilities to local devices.
Data ingestion into the cloud using edge-to-cloud interface services such as AWS IoT Core, which is a managed cloud platform that allows connected devices to easily and securely interact with cloud applications, and AWS IoT SiteWise, which is a managed service that allows you to collect, model, analyze, and visualize data from industrial equipment at scale.
Data transformation in near real-time using Amazon Kinesis Data Analytics, which offers an easy way to transform and analyze streaming data in near real-time with Apache Flink and Apache Beam frameworks. The stream data is often enriched using lookup data hosted in a data warehouse such as Amazon Redshift.
Machine learning models are trained and deployed in Amazon SageMaker, the inferences are invoked in micro-batch using AWS Lambda. Inferenced data is sent to Amazon OpenSearch Service to create personalized monitoring dashboards.
The data lake stores telemetry data for future batch analytics, this is done by micro-batch streaming the data into the S3 data lake using Amazon Kinesis Data Firehose, which is a fully managed service for delivering near real-time streaming data to destinations such as S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, and any custom HTTP endpoints or owned by supported third-party service providers.

3. Personalized architecture recommendations

The architecture "Build real-time recommendations on AWS" focuses on using user interaction data to create personalized recommendations in real-time. The process involves several steps including:

- Data preparation: Collect user interaction data, such as item views and item clicks. Upload this data into Amazon S3 and perform data cleaning using AWS Glue DataBrew to train the model in Amazon Personalize for real-time recommendations.
- Train the model with Amazon Personalize: The data used for modeling on Amazon Personalize consists of three types: user activity or events, details about items in the catalog, and details about the users.
- Get real-time recommendations: After training the model, it can be used to provide recommendations to users through an API exposed through Amazon API Gateway. These recommendations are custom, private, and personalized. The process of creating these recommendations is simple and done by few clicks.

4. Near real-time customer engagement

The architecture "Near real-time customer engagement architecture on AWS" focuses on using Amazon Pinpoint to collect customer engagement data and use it for creating insights through machine learning and data visualization. The process involves several steps:

- Initializing Pinpoint and creating a marketing project: Setting up the project to add users and their contact information, such as email addresses and configuring the metrics collection to capture customer interactions to Amazon S3.
- Near real-time data ingestion: Collecting data from Amazon Pinpoint in near real-time through Amazon Kinesis Data Firehose (optionally changed to Kinesis Data Stream for near real-time use cases), and storing it in S3.
- SageMaker model implementation: Training a model using a combination of Amazon Pinpoint engagement data and other customer demographic data, to predict the likelihood of customer churn or segmentation. This is done in an iterative manner and hosted in SageMaker endpoint.
- Data consumption with Athena and QuickSight: Analyzing the data from Amazon Pinpoint engagement and combining it with other data facts from data lake using Amazon Athena and visualizing it using Amazon QuickSight to share insights with others in the organization.

5. Data anomaly and fraud detection

The architecture "Fraud detection architecture on AWS" is a solution for fraud detection by training machine learning models on credit card transaction data and using it to predict fraud. The process involves several steps:

- Develop a fraud prediction machine learning model: A dataset of credit card transactions is deployed using AWS CloudFormation template and an Amazon SageMaker notebook is trained with different ML models.
- Perform fraud prediction: an AWS Lambda function processes transactions from the dataset, assigning anomaly and classification scores to incoming data points using SageMaker endpoints. An Amazon API Gateway REST API is used to initiate predictions. The processed transactions are loaded into an S3 bucket for storage using Amazon Kinesis Data Firehose.
- Analyze fraud transactions: Once the transactions are loaded in S3, different analytics tools and services such as visualization, reporting, ad-hoc queries can be used for further analysis.

In summary, when building data-driven applications on AWS, it's important to start by identifying key business requirements and user personas, and then use reference patterns to select the appropriate services for the use case. This includes using purpose-built services for data ingestion, such as Amazon Kinesis for real-time data and AWS Database Migration Service for batch data, and using tools like AWS DataSync and Amazon AppFlow to move data from file shares and SaaS applications to the storage layer. Data should be treated as an organizational asset and made available to the entire organization to drive actionable insights.