Forem: Michael Stainsbury

AWS Machine Learning exam guide

Michael Stainsbury — Wed, 02 Aug 2023 17:03:11 +0000

A guide to the guide
Syllabus, specification and blue print are all terms to describe the knowledge domain of an exam or course. However AWS call the description of the content of their Machine Learning exam the Exam Guide. Perhaps this is a telling choice since the information they provide is far from comprehensive, it is just a guide.

If you come from an AWS Machine Learning background the Exam Guide PDF will be sufficient for you. However if you are already a Data Scientist and wish to move into AWS, or you use AWS and want to learn to SageMaker and Machine Learning then large chunks of the Exam Guide will be unintelligible. This guide to the guide fills the gaps and explains high level concepts.

The Exam Guide is where the exam subjects are listed, split into four domains and fifteen sub-domains. This article describes each sub-domain in enough detail for the complete newbie to get a good idea of what it is about. If you intend to study for the AWS Machine Learning certificate this will give you an overview of what you are getting yourself into.

AWS pdf: https://d1.awsstatic.com/training-and-certification/docs-ml/AWS-Certified-Machine-Learning-Specialty_Exam-Guide.pdf

Domain 1: Data Engineering

Domain 1 Data Engineering is concerned with obtaining the data, transforming it and putting it in a repository. It comprises 20% of the exam marks. There are three sub-domains that can be summarised as:

1.1 Data repositories
1.2 Data ingestion
1.3 Data transformation

1.1 Data repositories

Create data repositories for machine learning

The data repository is where raw and processed data is stored. S3 is the repository of choice for Machine Learning in AWS and all built-in algorithms and services can consume data from S3. Other data stores are also mentioned in the exam guide:

Database (Relational Database Service)
Data Lake (LakeFormation)
EFS
EBS Often data is generated by the business themselves, but sometimes data from other sources is needed to train the model. For example libraries of image data to train the Object Detection algorithm. Many data sources are publicly available.

1.2 Data ingestion

Identify and implement a data ingestion solution

The data ingestion sub-domain is concerned with gathering the raw data into the repository. This can be via batch processing or streaming data. With batch processing, data is collected and grouped at a point in time and passed to the data store. Streaming data is constantly being collected and fed into the data store. The AWS streaming services are:

Kinesis family of streaming data services:

Kinesis Data Streams
Kinesis Firehose
Kinesis Analytics
Kinesis Video Streams

Batch processing requires a way to schedule or trigger the processing, also called job scheduling. Examples are Glue Workflow and Step Functions. AWS batch services include:

EMR (Hadoop)
Glue (Spark)

1.3 Data transformation

Identify and implement a data transformation solution

The third Data Engineering sub-domain focuses on how raw data is transformed into data that can be used for ML processing. The transformation process changes the data structure. The data may also need to be clean up, de-duplicated, incomplete data managed and have it’s attributes standardised. The AWS Services are similar to those used with data ingestion:

Glue (Spark)
EMR (Hadoop, Spark, Hive)
AWS Batch Once these data engineering processes are complete the data is ready for further pre-processing prior to being fed into a Machine Learning algorithm. This preprocessing is covered by the second knowledge domain Exploratory Data Analysis.

Domain 2: Exploratory Data Analysis

In the Exploratory Data Analysis domain the data is analysed so it can be understood and cleaned up. It comprises 24% of the exam marks. There are three sub-domains:

2.1 Prep and sanitise data
2.2 Feature engineering
2.3 Analyse and visualize data

2.1 Prep and sanitise data

Sanitize and prepare data for modeling

In Sanitize and prepare data for modeling, the data can be cleaned up using techniques to remove distortions and fill in gaps.

missing data
corrupt data
stop words
formatting
normalizing
augmenting
scaling data

Data labeling is the process of identifying raw data and adding one or more meaningful and informative labels to provide context. (AWS)

Data labeling can be costly and time consuming because it involves applying the labels manually. AWS provides the service called Mechanical Turk to reduce the cost and speed up the labelling process.

2.2 Feature engineering

Perform feature engineering

Feature Engineering is about creating new features from existing ones to make the Machine Learning algorithms more powerful. Feature Engineering techniques are used to reduce the number of features and categorise the data.

binning
tokenization
outliers
synthetic features
1 hot encoding
reducing dimensionality of data

2.3 Analyse and visualize data

Analyze and visualize data for machine learning

Analyzing and visualizing the data overlaps with the other two sub-domains which use these techniques. The techniques include graphs, charts and matrices.

scatter plot
histogram
box plot
Before data can be sanitized and prepared is has to bet understood. This is done using statistics that focus on specific aspects of the data and graphs and charts that allow relationships and distributions to be seen.
correlation
summary statistics
p value
elbow plot
cluster size
You now understand your data and have cleaned it up ready for the next stage, modeling.

Domain 3: Modeling

When people talk about Machine Learning they are mostly thinking about Modeling. Modeling is selecting and testing the algorithms to process data to find the information of value. It comprises 36% of the exam marks. This domain has five sub-domains:

3.1 Frame the business problem
3.2 Select the appropriate model
3.3 Train the model
3.4 Tune the model
3.5 Evaluate the model

3.1 Frame the business problem

Frame business problems as machine learning problems

First we decide if Machine Learning is appropriate for this problem. Machine Learning is good for data driven problems involving large amounts of data where the rules cannot easily be coded. The business problem can probably be framed in many ways and this determines what kind of Machine Learning problem is being solved. For example the business problem could be framed to require a yes/no answer as in fraud detection, or a value as in share price. Also in this sub-domain we can find out the type of data and so if the algorithm will use a supervised, or unsupervised paradigm. From the type of problem that has to be solved features of the algorithm can be identified, for example classification, regression, forecasting, clustering or recommendation.

3.2 Select the appropriate model

Select the appropriate model(s) for a given machine learning problem

Many models are available through AWS Machine Learning services with SageMaker having over seventeen built-in algorithms. Each model has it’s own use cases and requirements. Once the model has been chosen an iterative process of training, tuning and evaluation is undertaken.

The exam guide lists the SageMaker built-in algorithms XGboost, K-means. Since there are many built-in algorithms perhaps these are just the most important. Modeling concepts are also listed:

linear regression — Linear Learner, K-Nearest Neighbors, Factorization Machines
logistic regression — XGBoost
decision trees — XGBoost
random forests — Random Cut Forest
RNN — DeepAR forecasting, Sequence to Sequence
CNN — Sequence to Sequence
ensemble learning — XGBoost
transfer learning — Image classification

3.3 Train the model

Train machine learning models

Model training is the process of providing a model with data to learn from. During model training the data is split into three parts. Most is used as training data with the remainder used for validation and testing. Cross validation is a technique used when training data is limited. By understanding the concepts of the internal workings of algorithms, model training can be optimised. Concepts used by models in training include gradient descent, loss functions, local minima, convergence, batches, optimizer and probability.

The speed and cost of training depends on the choices about the compute resources used. The type of instance central processing unit can be specified. Graphical Processing Units (GPU) can provide more compute power, but not all algorithms can utilise them and may require cheaper CPU instances. For heavy training loads distributed processing options may be available to speed up training. Spark and non-Spark data processing can be used to pre-process training data.

Model training is also concerned with how and when models are updated and retrained.

3.4 Tune the model

Perform hyperparameter optimization

Model tuning is also known as hyperparameter optimisation. Machine Learning algorithms can be thought of as black boxes with hyperparameters being the exposed controls that can be changed and optimised. Hyperparameters settings do not change during training. They can be tuned manually before training commences, using search methods or automatically by using SageMaker guided search. Model tuning can be improved by using:

Regularization
Drop out
L1/L2
Model initialization
Models that utilise a neural network architecture use other hyperparameters:
layers / nodes
learning rate
activation functions
Tree-based models have hyperparameters that influence the number of trees and number of levels. The learning rate is used to optimise Linear models.

3.5 Evaluate the model

Evaluate machine learning models

Model evaluation is used to find out how well a Model will perform in predicting the desired outcome. This is done using metrics to measure the performance of the Model. Metrics measure accuracy, precision and other features of the Model by comparing the results from the Model with the known contents of the training data.

Metrics commonly used:

AUC-ROC
accuracy
precision
recall
RMSE
F1 score To measure the correlation between features a Confusion matrix is used.

Evaluation methods can be performed offline or online. A/B testing can also be used to compare the performance of model variants. Metrics allow the detecting of a poorly fitting model caused by bias or variance. This is where a model performs poorly with real world data.

Other metrics allow models and model variants to be compared using metrics that are not directly related to data:

time to train a model
quality of model
engineering costs
Cross validation

Your model is now ready to be used with real data. But before it can be let loose on your corporate data it has to be deployed into the production environment.

Domain 4: Machine Learning Implementation and Operations

This domain is about Systems Architecture and DevOps skills to make everything work in production. It comprises 20% of the exam marks. There are four sub-domains:

4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
4.3 Apply basic AWS security practices to machine learning solutions.
4.4 Deploy and operationalize machine learning solutions.

4.1 The production environment

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Designing AWS production environments for performance, availability, scalability, resiliency, and fault tolerance is part of AWS best practice. Resilience and availability is provided by deploying models in multiple AWS regions and multiple Availability Zones. Auto Scaling groups and Load balancing provide scalability for compute resources. Performance is optimised by rightsizing EC2 instances, volumes and provision IOPS. There are a variety of deployment options including EC2, SageMaker managed EC2 via endpoints and docker containers. CloudTrail and CloudWatch are used for AWS environment logging and monitoring. This assists in creating fault tolerance systems and build error monitoring.

4.2 ML services and features

_Recommend and implement the appropriate machine learning services and features for a given problem
_
AWS provides a range of services and features to choose from for a given Machine Learning problem. AWS provide AI services which are highly optimised algorithms deployed on AWS managed infrastructure. Some of the services contain pre-trained models ready for production inferencing. Some examples are:

Poly, text to speech
Lex, chatbot
Transcribe, speech to text

When using AI services AWS does all the heavy lifting of managing infrastructure, models and training. There are other features if you need to have more control of these aspects. SageMaker built-in algorithms can be used or you can bring your own model. This allows cost considerations to influence the choice of compute services. Even more sophisticated cost control can be achieved with using spot instances to train deep learning models using AWS Batch

AWS Service limits are used to limit the amount of resources. This may be a limit on the number of instances of a service used in an account. AWS Service limits can be increased by AWS on request. Sometimes there is a hard limit which is the maximum for that service in a single AWS Account or Region.

4.3 Security

Apply basic AWS security practices to machine learning solutions

Security in AWS starts with the ubiquitous IAM, Identity and Access Management, which controls the activities of all AWS services. Since S3 is the most common storage for Machine Learning services S3 bucket policies are also included. It may seem that access to VPCs, Amazon Virtual Private Cloud, and VPC Security Groups may not be needed if you are implementing serverless applications. However, under the hood, SageMaker uses these services and the security has to be configured. As well as configuring security for the services, data security also has to be considered. This includes encryption of data both at rest and in transit. Anonymisation can be used to protect PII data, Personally identifiable information.

4.4 Deploy and operationalize

Deploy and operationalize machine learning solutions

There are many ways to deploy Machine Learning models in production, one method is to use SageMaker endpoints. Despite the name a SageMaker endpoint is more than an isolated interface, it sits on top of serious processing power. This is provided by SageMaker managed EC2 instances which are set up by the endpoint configuration. SageMaker endpoints can host multiple variants of the same model. This enables different variants of the same Model to be compared using testing strategies, for example A/B testing.

Once in production the model is monitored because the performance of a model may degrade over time as real world data changes. This drop in performance can be detected and used to trigger the retraining of the model via a retrain pipeline.

Summary

The AWS Certified Machine Learning — Speciality, exam guide is good for outlining the breadth of the course and how it is divided up into four domains and fifteen sub-domains. Whilst it lists and mentions many subjects only a few are described in any detail and it is still a bit light with those. This article provides additional description of the subjects to allow someone considering studying for the exam to understand what has to be learnt to achieve exam success.

Credits

Photo by Daniel Gonzalez on Unsplash
Originally published at www.mlexam.com on October 8, 2020.
All infographics by Michael Stainsbury
Copyright Michael Stainsbury 2020, 2023

AWS SageMaker BlazingText Algorithm

Michael Stainsbury — Wed, 26 Jul 2023 05:00:00 +0000

BlazingText is the name AWS has given it’s SageMaker built-in algorithm that can identify relationships between words in text documents. These relationships, which are also called embeddings, are expressed as vectors. The semantic relationship between words is preserved by the vectors which cluster words with similar semantics together. This conversion of words to meaningful numeric vectors is very useful for Natural Language Processing which requires input data in vector format. This is why BlazingText is used as a precursor for Natural Language Processing.

Word2Vec is used to pre-process documents containing text to be used by other systems. For example: sentiment analysis; machine language from one language to another. Word2Vec generates a numerical representation of words called embedding. This captures the relationships between words so king, queen and president would be closely related. These relationships are used by Natural Language processing systems. BlazingText is an implementation of the Word2Vec algorithm. Word2Vec was published by Google in 2013 and is compatible with Facebook’s FastText.

Text Classification is used to classify documents, search engines and for document ranking. Text Classification uses embeddings generated by Word2Vec.

This article contains revision notes for the AWS certified exam MLS-C01, Machine Learning — Specialty.

What does the BlazingText algorithm do

BlazingText is used for Textual analysis and text classification problems. BlazingText is the only SageMaker built in algorithm to have both Unsupervised and Supervised learning modes. Word2Vec is Unsupervised and Text Classification is Supervised learning.

Word2Vec — Unsupervised learning
Text Classifier — Supervised learning Usually for Text Classification you would pre-process the data by passing it through a Word2Vec algorithm and then a Text Classifier. The BlazingText algorithm implements the Word2Vec and Text Classifier as a single process.

How is BlazingText implemented

BlazingText is a SageMaker built-in algorithm and so can be trained via SageMaker Jupyter Notebooks and deployed on SageMaker endpoints. Blazing Test processes text data. The input data is presented in a single file with one sentence per line.

What are the training data formats for BlazingText

There are two input file formats:

File Mode
Augmented Manifest Text (AMT) format The data in File Mode is text with space separated words and one sentence per line. Each line begins with a label like this:

__label__1

The data in Augmented Manifest Text format is in JSON (json lines) format. Each line can contain a single sentence or be split up into phrases by commas as a JSON array. Here are some examples:

A single line in File Mode:

__label__1 Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring.

A single JSON line in Augmented Manifest Text format:

{"source":"Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring","label":1}

A single JSON array containing phrases in Augmented Manifest Text format:

{"source":"Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring","label":[1,3]}

Model artifacts and inference

Blazing Text uses different artifacts depending on it’s processing mode. This table summarises the file names and formats.

Artifacts and files used by BlazingText

Processing environment

Word2Vec

Model binaries: vectors.bin
Supporting artifacts: Vectors.txt Eval.json (optional)
Request format: JSON
Result: List of vectors. If word not found: zeros

Text Classification

Model binaries: model.bin
Supporting artifacts: none
Request format: JSON
Result: One prediction

Processing environment

BlazingText can be run on a single CPU or GPU instance, or multiple CPU instances. The choice depends on the type of processing being performed. Word2Vec has three processing methods:

Skip-gram
Continuous Bag Of Words (CBOW)
Batch Skip-gram These modes are the complete opposites to each other. In skip-gram mode you supply a word and the model returns the context of the word. With CBOW you provide the context and a predicted word is returned.

Below is a summary showing mode types and processing modes used by Blazing test.

Word2Vec

Single CPU instance: Skip-gram, CBOW, Batch skip-gram
Single GPU instance (with 1 or more GPUs): Skip-gram, CBOW
Multiple CPU instances: No

Text Classification

Single CPU instance: Yes
Single GPU instance (with 1 or more GPUs): Yes
Multiple CPU instances: No

From this summary you can see that all processing methods can be performed on a single CPU instance. Only Word2Vec using batch skip-gram method can run on multiple CPUs and this method cannot utilise GPUs.

What are BlazingText’s strengths and weaknesses

The strength of BlazingText is high performance. BlazingText is more than 20x faster than other popular alternatives such as Facebook’s FastText. This enables inferences to be done in real time for online transactions rather than batch processing. The main weakness of BlazingText is handling words that were not presented during training. These are called Out Of Vocabulary (OOV) words. Typically such words will be marked as Unknown. There are other ways to perform Word2Vec processing, but they do not have the high performance of BlazingText.

What is the Use Case for BlazingText

BlazingText can only ingest words, so the input data must be text. Word2Vec is required to convert data to vectors for Natural Language Processing.

Word2Vec:

Sentiment analysis
Named entity recognition
Machine translation

Text classification:

Web searches
Information retrieval
Ranking
Document classification

Video: AWS re:Invent 2019: Natural language modeling with Amazon SageMaker BlazingText algorithm (AIM375-P)

This is a 50.36 minutes video from AWS by Denis Batalov. The presentation can be split into four parts as shown in the timestamps below. I suggest you skip the first two parts and start with the overview of SageMaker BlazingText at 17.13 minutes. This is the link to the Jupyter Notebook used in the demo (part 4):

SageMaker notebook on Github: https://github.com/dbatalov/wikipedia-embedding
0 — Introduction
2.17 — Word embedding
2.56 — Word representations
3.43 — One hot encoding
4.37 — Intuition, given a sentence, try to maximise the probability of predicting the context of words.
6.20 — Word2Vec algorithm
8.20 — t-SNE diagram
9.23 — Overview of Amazon SageMaker
12.20 — Build, train and deploy ML Models
13.16 — Built-in algorithms
14.10 — Deep learning frameworks
15.17 — Automatic Model Tuning
16.27 — Amazon SageMaker Neo
17.13 — Overview of SageMaker BlazingText
18.28 — BlazingText highlights
18.45 — Optimization on CPU negative samples sharing
19.40 — Through characteristics
20.35 — BlazingText benchmarking
23.00 — Demo — Georgian Wikipedia
Selected articles with examples of BlazingText being used
This article, by Evan Harris, describes the usefulness of having a website search feature tuned to the specific vocabulary used on the website. The example Evan uses is for a search for a specific grape variety which returns a list of wines that use that variety.

https://medium.com/building-ibotta/heating-up-word2vec-blazingtext-for-real-time-search-c2121bd1396
This article has a good worked example of BlazingText being used:

https://t-redactyl.io/blog/2020/09/training-and-evaluating-a-word2vec-model-using-blazingtext-in-sagemaker.html
This article is a worked example of using BlazingText in Word2Vec mode: Training Word Embeddings On AWS SageMaker Using BlazingText by Roald Schuring.

https://towardsdatascience.com/training-word-embeddings-on-aws-sagemaker-using-blazingtext-93d0a0838212
This example, from AWS, uses a method to enable BlazingText to generate vectors for out-of-vocabulary (OOV) words.

https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_word2vec_subwords_text8/blazingtext_word2vec_subwords_text8.html
This is an example SageMake Notebook on Github which uses a dataset derived from Wikipedia.

https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/blazingtext_text_classification_dbpedia/blazingtext_text_classification_dbpedia.ipynb

Video: Amazon SageMaker’s Built-in Algorithm Webinar Series: Blazing Text

This is a 1.14.36 minutes video from AWS by Pratap Ramamurthy. This is a very long video so use the timestamps below to select the part you wish to see.

0 — Introduction
2.19 — What are Amazon algorithms
3.08 — BlazingText algorithms
3.17 — BlazingText use case
4.16 — Typical deep learning task on Text
5.36 — Integer encoding
9.20 — One hot encoding
14.00 — Requirements for word vectors
16.32 — Word2Vec mechanism
16.42 — Word2Vec setup
18.07 — Skip-gram preprocessing
20.30 — Neural network setup
25.38 — BlazingText word embedding
27.35 — Word vectors used for further ML training
28.20 — Intuition
28.25 — Random or is there a pattern? (t-SNE plot)
31.14 — Distance between related words
32.26 — How did the magic work?
35.08 — OOV handling using BlazingText
39.38 — Subword detection
41.43 — Text classification with BlazingText
42.18 — Typical NLP pipeline
44.25 — Parameters
47.43 — Demo
100.11 — Questions

Summary

BlazingText is a high performance algorithm for analyzing text. The two modes of processing producing either numeric vectors for Natural Language Processing via the Word2Vec algorithm or Text Classifications that can infer words from context or context from words.

Resources

These revision notes support subdomain 3.2 Select the appropriate model(s) for a given machine learning problem of the AWS certification exam: AWS Machine Learning — Speciality (MLS-C01).

3.2 Select the appropriate model(s) for a given machine learning problem.
Xgboost, logistic regression, K-means, linear regression, decision trees, random forests, RNN, CNN, Ensemble, Transfer learning. Express intuition behind models
AWS Certified Machine Learning — Specialty, (MLS-C01) Exam Guide

AWS Certified Machine Learning exam guide
Domain 3 Modeling articles index
3.2 Text processing algorithms
Questions for SageMaker built-in algorithms and their uses
Free Practice exam with 65 questions
Overview

AWS docs: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html
Wikipedia Word2vec: https://en.wikipedia.org/wiki/Word2vec
Google original papers from 2013: https://arxiv.org/abs/1301.3781
Google original papers from 2013: https://arxiv.org/abs/1310.4546
Training data format resources

Augmented Manifest Text (AMT) format: https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html
Json lines format: http://jsonlines.org/
Text examples from https://www.britishsummerfruits.co.uk/about
Processing environment

https://aws.amazon.com/blogs/machine-learning/enhanced-text-classification-and-word-vectors-using-amazon-sagemaker-blazingtext

Credits

Burning book photo by Gaspar Uhas on Unsplash

Originally published at www.mlexam.com on March 2, 2021.