Forem: Prasanth Mathesh

Serverless Full Stack Data Analytics Engineering on AWS Cloud

Prasanth Mathesh — Thu, 27 Oct 2022 07:59:33 +0000

Overview
A Full-Stack Developer is someone who does both client-side and server-side development for mobile and web applications. In the data domain, a full-stack engineer performs similar kinds of job duties that involve full ownership, like ingestion, processing, and publishing data. Full stack data engineering would require data analysis, SQL scripting, ETL/ELT programming, data model design, workload orchestration, visualization, etc .

In some cases, a cloud full-stack data analytics engineer will have more responsibilities like provisioning data infrastructure, dataops, devops, etc. Full stack development and full stack data analytics are two different domains, and the only layer of intersection is the datastore. The backend of a full-stack web or mobile application can be a source or a consumer of the data analytics application. But there is another layer in the full-stack data analytics architecture, the modern visualisation stack, that needs full-stack dev engineering.

In this first part of the article, we will see the need for full stack development in the data analytics architecture and how it can be developed and owned within the data analytics application in a serverless manner.

Embedded Analytics
The industry's popular low code/no code visualisation SaaS products are built using full stack technologies like GraphQL, Java Script, etc. But these products still require training about their inbuilt modules and, often, they are not fully customizable. The metrics that are visualised in the dashboards are the output of the heavy lifting done in the data processing layer. GBs of data are processed and finally shared with the semantic layer for building self-service analytics and interactive dashboards. There are cases where visualisation teams need to embed the analytical metrics into the web application/apps for business users. These kinds of metrics are called embedded analytics and often come with minimal filters.

These kinds of implementations need full stack development that requires both client and server side programming using Angular and ReactJS frameworks. For mobile devices, it comes with another set of tech stacks like React Native, etc.

The modern data stack has evolved a lot to have embedded web/mobile analytics. AWS QuickSight is one such example that supports low-code embedded analytics via APIs. SaaS data platforms like Snowflake and Databricks have inbuilt dashboard features but cannot be embedded with mobile/web apps.

Embedded Analytics using AWS Amplify
AWS Amplify lets full-stack developers host their server-less backend for the front end. The Amplify Studio has features to build front-end UI, design app data models with minimal effort,etc. The JavaScript modules that support charts and graphs like D3.JS and Chart.JS can be used along with popular web frameworks like React to build custom dashboards using AWS Amplify.

The periodic data refresh for the back-end data stores can be performed by using continuous data engineering pipelines within the AWS cloud. The payload size plays a pivotal role in providing low latency custom dashboards. So one has to choose the right web framework and data model based on the traverse pattern of front-end analytics dashboards.

Architecture
The diagram below depicts the end-to-end flow for embedding analytics into any business application, product, website, or portal. There is no additional BI tool or licencing involved for an organisation that relies on static dashboards with basic filters.

The data engineering pipeline processes the raw data into aggregates that are suitable for business strategy decisions, measuring KPIs, etc. The processed data is made available for business users both in real time and at batch intervals. AppSync acts as an integration layer to share data with the front-end applications in a real-time manner.

Few of the advantages of using AWS Amplify in Embedded Analytics:

Faster Time to Market
Serverless
Pay as per the usage
UI based development to build analytics apps
Option to have own Data Cache layer
Build cloud native apps with the responsive design
Support for additional JavaScript modules for the web frameworks

Conclusion
In this article, we have seen embedded analytics usage and how it can be achieved using AWS Amplify. Due to the complexity of the technical stack, full stack engineering is kept as a separate product and is often not owned by the data engineering or the analytics team. AWS Amplify has enough features to simplify the process of prototyping and development of embedded analytical dashboards. Going forward, one can expect wider adoption of AWS Amplify for data analytics use cases. In the next article, I will describe how raw data is enriched, stored, and visualised using AWS Amplify and AppSync.

Amazon SQS and serverless DataEngineering workloads

Prasanth Mathesh — Tue, 25 Oct 2022 16:05:41 +0000

Overview
Amazon SQS provides fully managed message queuing for microservices, distributed systems, and serverless applications. Amazon SQS is one of the earliest services launched by Amazon and still widely used by many organizations and it forms one of the core services of SaaS / PaaS products that were built on top of the AWS cloud.

A variety of use cases exist in the microservices architecture for AWS SQS but when it comes to data engineering, SQS is commonly used to publish messages with dynamic configurations that in turn trigger consumers to scale/parallelize workloads based on the SQS message data. The reason is that SQS by default, doesn't handle messages greater than 256KB.

Not all SaaS applications support bulk load or query due to API rate limits, performance factors, and so on. In an event-driven data architecture, especially when producers and consumers are SaaS applications, cloud web apps, etc., event ingestion is done using AWS AppFlow, Kinesis, or custom batch query processes.

If any SaaS is customizable, then the team can develop their own API using the AWS SDK to support ingestions and queries. There are cases where GBs of processed data have to be uploaded in bulk mode to a SaaS or Cloud Web Apps datastore in less than a minute interval. Below are few scenarios that can be used in serverless event-driven data architectures.

DataEngineering Workload
SQS can be used to receive large payloads with the help of the extended client library. This feature has been in place since 2015 but is less commonly used in data engineering workloads. The AWS SDK extended SQS client library can be used to process up to 2 GB of data using S3 object storage. The producers of SQS can write messages > 256KB in S3 and publish metadata in SQS, and SQS consumers can read and process the data. After the data processing is complete, write the data to the target object store or share it via API using any SQS message consumer.

Consider an organisation that is using a customizable SaaS that was built using AWS services. The extended Java client library can be leveraged when a user needs to upload documents > 256KB or an event is triggered for user action. These events can be used for building a decoupled data engineering workload that requires document parsing or tagging using services like AWS Textract, AWS Glue, or Serverless EMR. The DataSourceRegister Spark API along with Spark structured streaming can be used as an SQS consumer for any high volume. The processed data can be shared with a webapp or SaaS using microservices or loaded into OLAP or OLTP services.

Multipart files of less than 2GB can be written into S3 using SQL unload. Later, SQS consumers like AWS Lambda , AWS Glue, etc. can be invoked to write the processed files into target SaaS object stores in a concurrent manner, provided the SaaS application supports the bulk load API.

Limitations
Unlike kafka, SQS can't support non-ASCII characters like emojis. There is limited support for Unicode characters in the SQS service. So one has to consider the type of message, the complexity of transformation/aggregation, message retention period, etc. before deciding on the type and need of message brokers for data engineering workloads.

Conclusion
Typically, it's a common design pattern to have MQs, Kafka as message brokers, but Amazon SQS can also be leveraged for building loosely coupled data engineering pipelines for data generated by SaaS applications. The serverless AWS services are inherently scalable, and SQS can help to achieve the parallelism of data processing in an event-driven data architecture without the need for any additional technical stack.

Spark as function - Containerize PySpark code for AWS Lambda and Amazon Kubernetes

Prasanth Mathesh — Wed, 29 Sep 2021 02:32:19 +0000

Introduction

The python, java etc applications can be containerized as a docker image for deployment in AWS Lambda and AWS EKS using the AWS ECR as container registry. The spark framework commonly used for distributed big data processing applications supports various deployment modes like local, cluster, yarn, etc. I have discussed serverless data processing architecture patterns in my other articles and in this, we will see how one can build and run a Spark data processing application using AWS EKS and also serverless lambda runtime. The working version code used for this article is kept in Github

Requirements

The following are the set of client tools that should be already installed in the working dev environment.
AWS CLI, Kubectl, Eksctl, Docker
One should ensure the right version for each set of tools including the spark, AWS SDK and delta.io dependencies.

Kubernetes Deployment

AWS EKS anywhere which was launched recently can enable organizations to create and operate Kubernetes clusters on customer-managed infrastructure. This new service by AWS is going to change the way of scalability, disaster plan and recovery option that are being followed for Kubernetes currently.

The following are the few native Kubernetes deployments since containerized applications will run in the same manner in different hosts.
1.Build and test application on-premise and deploy on the cloud for availability and scalability
2.Build, test and run applications on-premise and use the cloud environment for disaster recovery
3.Build, test and run application on-premise, burst salves on the cloud for on-demand scaling
4.Build and test application on-premise and deploy master on a primary cloud and create slaves on secondary cloud

For ever-growing,data-intensive applications that process and store terabytes of data, RPO is critical and its better to use on-premise dev and cloud for PROD

Spark on Server

Local
First, let's containerize the application and test it in the local environment.

The pyspark code used in this article reads a S3 csv file and writes it into a delta table in append mode. After the write operation is complete, spark code displays the delta table records.

Build the image with dependencies and push the docker image to AWS ECR using the below command.

./build_and_push.sh cda-spark-kubernetes

After the build, the docker image is available in the local dev host too which can be tested locally using docker CLI

docker run cda-spark-kubernetes driver local:///opt/application/cda_spark_kubernetes.py {args}

The above image shows the output of the delta read operation

AWS EKS
Build AWS EKS cluster using eksctl.yaml and apply RBAC role for spark user using below cli.

eksctl create cluster -f ./eksctl.yaml
kubectl apply -f ./spark-rbac.yaml

Once the cluster is cluster is created, verify nodes and cluster IP.

The above is a plain cluster that is ready without any application and its dependencies.

Install spark-operator

Spark Operator is an open-source Kubernetes Operator to deploy Spark applications. Helm is similar to yum, apt for K8s and using helm, spark operator can be installed.

Install spark-operator using below helm CLI

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install spark-operator spark-operator/spark-operator --set webhook.enable=true
kubectl get pods

The containerized spark code can be submitted from a client in cluster mode using spark operator and its status can be checked using kubectl cli.

Run the spark-job.yaml that contains config parameters required for the spark operator in the command line.

kubectl apply -f ./spark-job.yaml

The cli to get application is given below

kubectl get sparkapplication

The cli to get logs of the spark driver at the client side is given below.

kubectl logs spark-job-driver

The delta operation has done the append to the delta table and it's displayed on driver logs as given below.

Additionally, the driver Spark-UI can be forwarded to the localhost port too.

The Kubernetes deployment requests driver and executor pods on demand and shuts them down once processing is complete. This pod level resource sharing and isolation is a key difference between spark on yarn and kubernetes

Spark on Serverless

Spark is a distributed data processing framework that thrives on RAM and CPU. Spark on AWS lambda function is suitable for all kinds of workload that can complete within 15 mins.

For the workloads that take more than 15 mins, by leveraging continuous/event-driven pipelines with proper CDC, partition and storage techniques, the same code can be run in parallel to achieve the latency of the data pipeline.

The base spark image used for AWS EKS deployment is taken from the docker hub and it is pre-built with AWS SDK and delta.io dependencies.

For AWS Lambda deployment, AWS supported python base image is used to build code along with its dependencies using the Dockerfile and then it is pushed to the AWS ECR.

FROM public.ecr.aws/lambda/python:3.8

ARG HADOOP_VERSION=3.2.0
ARG AWS_SDK_VERSION=1.11.375

RUN yum -y install java-1.8.0-openjdk

RUN pip install pyspark

ENV SPARK_HOME="/var/lang/lib/python3.8/site-packages/pyspark"
ENV PATH=$PATH:$SPARK_HOME/bin
ENV PATH=$PATH:$SPARK_HOME/sbin
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
ENV PATH=$SPARK_HOME/python:$PATH

RUN mkdir $SPARK_HOME/conf

RUN echo "SPARK_LOCAL_IP=127.0.0.1" > $SPARK_HOME/conf/spark-env.sh

#ENV PYSPARK_SUBMIT_ARGS="--master local pyspark-shell"
ENV JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.302.b08-0.amzn2.0.1.x86_64/jre"
ENV PATH=${PATH}:${JAVA_HOME}/bin

# Set up the ENV vars for code
ENV AWS_ACCESS_KEY_ID=""
ENV AWS_SECRET_ACCESS_KEY=""
ENV AWS_REGION=""
ENV AWS_SESSION_TOKEN=""
ENV s3_bucket=""
ENV inp_prefix=""
ENV out_prefix=""

RUN yum install wget
# copy hadoop-aws and aws-sdk
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -P ${SPARK_HOME}/jars/ && \ 
    wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -P ${SPARK_HOME}jars/

COPY spark-class $SPARK_HOME/bin/
COPY delta-core_2.12-0.8.0.jar ${SPARK_HOME}/jars/
COPY cda_spark_lambda.py ${LAMBDA_TASK_ROOT}

CMD [ "cda_spark_lambda.lambda_handler" ]

Local
Test the code using a local machine using docker CLI as given below.

docker run -e s3_bucket=referencedata01 -e inp_prefix=delta/input/students.csv -e out_prefix=/delta/output/students_table -e AWS_REGION=ap-south-1 -e AWS_ACCESS_KEY_ID=$(aws configure get default.aws_access_key_id) -e AWS_SECRET_ACCESS_KEY=$(aws configure get default.aws_secret_access_key) -e AWS_SESSION_TOKEN=$(aws configure get default.aws_session_token) -p 9000:8080 kite-collect-data-hist:latest cda-spark-lambda

The local mode testing will require an event to be triggered and AWS lambda will be in wait mode.

Trigger an event for lambda function using below cli in another terminal

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

Once AWS lambda is completed, we can see the output as given below in the local machine.

AWS Lambda
Deploy a lambda function using the ECR image and set necessary ENV variables for the lambda handler. Once lambda is triggered and completed successfully we can see the logs in cloud watch.

AWS Lambda currently supports 6 vCPU cores and 10 gb memory and it is billed for the elapsed run time and memory consumption as shown below.

The AWS Pricing is based on the number of requests and GB-Sec.

The same code is run for various configurations and it is evident from the below table that even if memory is overprovisioned, AWS lambda pricing methodology saves the cost.

Conclusion

Going forward, wider adoption to use containerized data pipelines for spark will be the need of the hour since sources like web apps, SaaS products that are built on top of Kubernetes generates a lot of data in a continuous manner for the big data platforms.

The most common operations like data extraction and ingestion in the S3 data lake, loading processed data into the data stores and pushing down SQL workloads on AWS Redshift can be done easily using AWS lambda Spark.

Thus by leveraging AWS Lambda along with Kubernetes, one can bring down TCO along with build planet-scale data pipelines.

Machine Learning Predictions using AWS Redshift ML

Prasanth Mathesh — Wed, 02 Jun 2021 06:49:04 +0000

Introduction

In the previous article, we have seen how to train and infer the predictions for the bring your own Algorithms ( BYOA) models. Standard ML pipeline features to infer the predictions will be either raw format or output of feature engineering pipeline stored in a feature store. Redshift-ML enabled the creation, train, and deployment models using SQL and enables predictions in SQL. This enables the creation of feature stores in Redshift DB and infer the predictions and share the predictions without much overhead or orchestration services.

Redshift-ML

ML provides the below options using SQL.

Create, Train and Deploy the Model
Localize the model in Redshift DB
Infer the predictions for the deployed Model

Additionally, users can bring their own model (BYOM) trained in Amazon sagemaker. The inference can be local and also remote using the sagemaker endpoint.

The reference architecture for BYOM with local inference will be as shown below.

The local inference saves the infra cost of batch transforms and removes the overhead of setting up endpoints for the models especially when models are served in real-time mode. The Redshift cluster scale and can share the predictions via data APIs. The materialized view/table can be created and predictions can be inferred from web apps.

Automate SageMaker Real-Time ML Inference in a ServerLess way

Prasanth Mathesh — Wed, 19 May 2021 08:56:30 +0000

Introduction

Amazon SageMaker is a fully managed service that enables data scientists and ML engineers to quickly create, train and deploy models and ML pipelines in an easily scalable and cost-effective way. The SageMaker was launched around Nov 2017 and I had a chance to get to know about inbuilt algorithms and features of SageMaker from Kris Skrinak during a boot camp roadshow for the Amazon Partners. Over the period, SageMaker has matured a lot to enable ML engineers to deploy and track models quickly and scalable. Apart from its built-in Algorithms, there were many new features like AutoPilot, Model Clarify and Feature Store, Docker Container. This blog will look into these new SageMaker features and the ServerLess way of training, deployment, and real-time inference.

Architecture

The steps for the below reference architecture are explained at the end of the SageMaker Pipeline section of this article.

SageMaker Features

A) Auto Pilot-Low Code Machine Learning

Launched around DEC 2019
Industry-first Automated ML to give control and visibility to ML Models
Does Feature Processing, picks the best algorithm, trains and selects the best model with just a few clicks
Vertical AI services like Amazon Personalize and Amazon Forecast can be used for personalized recommendation and forecasting problems
AutoPilot is a generic ML service for all kinds of classification and regression problems like fraud detection and churn analysis and targeted marketing
Supports inbuilt Algorithms of SageMaker like xgboost and linear learner
Default max size of input dataset is 5 GB but can be increased in GBs only

Auto-Pilot Demo
Data for AutoPilot Experiment
The dataset considered is public data provided by UCI.
Data Set Information
The survey data describes different driving scenarios including the destination, current time, weather, passenger, etc., and then asks the person whether he will accept the coupon if he is the driver. The task we will be performing on this dataset is Classification

AutoPilot Experiment

Import the data for training.

%%sh
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00603/in-vehicle-coupon-recommendation.csv

Once data is uploaded, the AutoPilot can be set up within minutes using the SageMaker studio. Add the training input and output data paths, Label to predict and enable the auto-deployment of the model. SageMaker deploys the best model and creates an endpoint after the successful training.

Alternately one can select the model of their wish and deploy it.

The endpoint configurations and endpoint details of deployed model can be found in the console

Infer and Evaluate Model

Take a validation record and invoke the endpoint. The feature engineering tasks are done by autopilot and thus raw features data can infer the trained model and predict.

No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,1,Some college - no 
degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,1

Infer the model using validation data set using the code given in Github.

B) SageMaker Clarify

Launched around DEC 2020
Explains how machine learning (ML) models made predictions during the Autopilot experiments
Monitors Bias Drift for Models in Production
Provides components that help AWS customers build less biased and more understandable machine learning models
Provides explanations for individual predictions available via API
Helps in establishing the model governance for ML applications

The bias information can be generated for the AutoPilot experiment.

bias_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=training_data_s3_uri,
    s3_output_path=bias_report_1_output_path,
    label="Y",
    headers=train_cols,
    dataset_type="text/csv",
)

model_config = sagemaker.clarify.ModelConfig(
    model_name=model_name,
    instance_type=train_instance_type,
    instance_count=1,
    accept_type="text/csv",
)

C) SageMaker Feature Store

Launched around DEC 2020
Amazon SageMaker Feature Store is a fully managed repository to store, update, retrieve, and share machine learning (ML) features in S3.
The feature set that was used to train the model needs to be available to make real-time predictions (inference).
Data Wrangler of SageMaker Studio can be used to engineer features and ingest features into a feature store
Feature Store - both online and offline stores can be ingested via separate Featuring Engineering Pipeline via SDK
Streaming sources can directly ingest features to the online feature store for inference or feature creation
Feature Store automatically builds an Amazon Glue Data Catalog when Feature Groups are created and can optionally be turned off

The below table shows various data stores used to maintain the features. Some open source frameworks like Feast have evolved as feature store platform and any key-value data store that supports fast lookup can be used as Feature Store.

The feature stores are end-stage of the feature engineering pipeline and the features can be stored in cloud Data Warehouses like Snowflake, RedShift too as shown in the image of featurestore.org.

record_identifier_value = str(2990130)
featurestore_runtime.get_record(FeatureGroupName=transaction_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)

The feature group can be accessed as Hive external table too.

CREATE EXTERNAL TABLE IF NOT EXISTS sagemaker_featurestore.coupon (
  write_time TIMESTAMP
  event_time TIMESTAMP
  is_deleted BOOLEAN
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS
  INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
  OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
LOCATION 's3://coupon-featurestore/onlinestore/139219451296/sagemaker/ap-south-1/offline-store/coupon-1621050755/data'

D) SageMaker Pipelines

Launched around DEC 2020
SageMaker natively supports MLOPS via the SageMaker project and pipelines are created during the SageMaker Project creation
MLOPS is a standard to streamline the continuous delivery of models. It is essential for a successful production-grade ML application.
SageMaker pipeline is a series of interconnected steps that are defined by a JSON pipeline definition to perform build, train and deploy or only train and deploy etc.
The alternate ways to set up the MLOPS in SageMaker are Mlflow, Airflow and Kubeflow, Step Functions, etc.

Docker Containers

SageMaker Studio itself runs from a Docker container. The docker containers can be used to migrate the existing on-premise live ML pipelines and models into the SageMaker environment.

Both stateful and stateless inference pipelines can be created. For example the anomaly and fraud detection pipelines are stateless and the example considered in this article is a stateful model inference pipeline.

SageMaker Container Demo
Download the Github folder. The container folder should show files as shown in the image.

The dataset is the same as we have considered for Autopilot Experiment.

The sckit-learn algorithm is used for the local training and model tuning. After various iterations, the features having less importance have been removed and then encoding has been performed for the key features.

The final encoded features (97 labels) are stored in coupon_train.csv and will be used for training and validation locally.

Docker Container Build

The following steps have to be performed in an orderly manner.

Build the image

docker build -t recommend-in-vehicle-coupon:latest .

Train the features in local mode

./train_local.sh recommend-in-vehicle-coupon:latest

Serve the model in local mode

./serve_local.sh recommend-in-vehicle-coupon:latest

The servers are up and waiting for request.

Predict locally

The payload.csv will have features to predict the model. Run below command to predict the response for the features available in the csv.

./predict.sh payload.csv

Once the request is accepted, servers listening will respond to the requests received.

Push Image

Once the local testing is completed, the container train, deploy and serve image can be pushed to AWS ECR. In case any code change is done, the final build and push step alone is enough.

./build_and_push.sh

Deployed Container Image

The AWS ECR images can be pulled and containers can be run from Lambda, AWS EKS etc.

Lambda Function

The SageMaker API calls meant for training, deployment and inference are created as Lambda Functions. Then deployed Lambda handler function should be integrated with API Gateway so that pipeline can be run for any triggered API event.
The lambda function kept in Github has three major blocks.

Create SageMaker Training Function

The lambda will read features from s3 and complete the training.

client = boto3.client("sagemaker", region_name=region)
        client.create_training_job(**create_training_params) 
        status = client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]

Create SageMaker Model and Endpoint Function

Create the model
The training job will place model artifacts in s3 and that model has to be registered with SageMaker.

create_model_response = client.create_model(
              ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
              )

Create End Point Config

response = client.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ProductionVariants=[
                {
                    'VariantName': 'variant-1',
                    'ModelName': model_name,
                    'InitialInstanceCount': 1,
                    'InstanceType': 'ml.t2.medium'
                }
            ]
        )

Create End Point

response = client.create_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=endpoint_config_name
        )

Invoke SageMaker Model Function
Based on the API request body message, the endpoint will be invoked by the Lambda.

response = client.invoke_endpoint(
            EndpointName=EndpointName,
            Body=event_body.encode('utf-8'),
            ContentType='text/csv'
        )

The status of the In-service endpoint and the requests made to the endpoint can be checked in the cloud watch logs.

Testing State-full Real-time Inference

Trigger SageMaker Training

Once API Gateway and Lambda have been integrated, Training Job can be triggered by passing the below request body to Lambda function.

{"key":"train_data"}

Trigger SageMaker Model and Endpoint Deployment

Once the training job is completed, deploy the model with the below request body. The training job should be the job which we created recently.

{"key" : "deploy_model",
"training_job" :"<training job name>"
}

Trigger SageMaker Model Endpoint

Invoke the endpoint with the below request. The feature is encoded and should be the same as we used to train.

The predicted response will be as shown below.

The events created during invoking can be viewed in cloud watch logs.

Conclusion

Machine Learning inference costs account for more than 80 percent of operational costs for running the ML workloads. The SageMaker capabilities like container orchestration, multi-model endpoint, serverless inference can save both operational and development costs. Also,the event-driven training and inference pipelines can enable any non-technical person from the sales or marketing team to refresh both batch and real-time predictions with a click of a button built using the mechanisms like API, webhooks from their sales portal on an Adhoc basis before running their campaign.

IoT/TimeSeries event processing using AWS Serverless Services and AWS Managed Kafka Streaming

Prasanth Mathesh — Sun, 14 Mar 2021 08:24:32 +0000

Introduction

The TimeSeries and IoT data share most of the characteristics except a few like the timestamp attribute. The time-series data arrival is predefined but IoT events can be in a random window. The IoT event analytics can be done either with IoT analytics or with Kafka but both have certain data processing limitations. The solution I have considered for this article has a mix of Kafka Streaming and IoT Analytics. The integration of AWS MSK with Kinesis Data Analytics or AWS Glue will be covered in detail in another article.

Architecture

Why IoT Core?

Kafka Protocol was built on top of TCP/IP whereas standard IoT devices support MQTT connections. MQTT was built considering bad network and communication. AWS IoT Core Device Gateway supports both HTTP and MQTT protocol, provides bi-directional communication with IoT devices and can filter and route data across the enterprise applications. It enables device registration in a simple manner thus accelerating IoT application development.

Why AWS MSK?

Apache Kafka is distributed messaging framework that provides fast, scalable and throughput ingestions. Kafka can replay IoT events, provides long-term storage, acts as a buffer when there is high-velocity data, provide easy integration with other enterprise applications. Real-time device monitoring is not possible in IoT Analytics whereas Kafka streaming along with an event processing framework can trigger actions for the anomalies in real-time. No downtime during Kafka cluster upgrade and clusters can be provisioned in 15 mins. Another main advantage is no charge for data replication traffic across AZ and this is a key factor when compared with expensive highly available self-hosted Kafka clusters.

Why IoT Analytics?

AWS IoT Analytics can enrich the IoT data, perform ad-hoc analysis and build dashboards using QuickSight. It is a simple and serverless way to do data prep, clean and feature engineering and can be integrated with notebooks, AWS SageMaker to build machine learning models. Custom analysis code packaged in a container can also be executed on AWS IoT Analytics for use cases like understanding the performance of devices, predicting device failures, etc.

Why TimeStream?

Timestream can easily store and analyze trillions of events per day. The data retention can be controlled based on the analytics need. It has built-in time-series analytics functions, helping you identify trends and patterns in your data in near real-time. Timestream can provide 1000x faster query performance along with 1/10th the cost of relational databases. When the same record is received for a timestamp, timestream can deduplicate it which is a common problem with streaming events. When there is a need for high volume of ingestion, events need to be written into canonical datastore like kinesis firehose and then it should be written to S3 for long-term storage but TimeStream can be used as serving db.

Why AWS Lambda?

The AWS Lambda can process data from an event source like Apache Kafka. The lambda is very cheap and its memory can be scaled from 128MB to 10240MB. Also, the processing timeout can be set for Lambda functions. The IoT device's payload will be lesser in size and real-time device control operations based on incoming payload value can be easily done with AWS Lambda rather than using serverless services like AWS Glue or AWS Kinesis Data Analytics. Lambda can be triggered by an event source like AWS MSK or it can be scheduled in AWS.

Let's Get Started

The IoT device registration and IoT message simulation during development are critical tasks in IoT development and there is a need for the data engineering team to simulate the IoT events using various SaaS providers like MQTTLab, Cumulocity IoT etc.
The device registration and IoT Event data simulation for this article were done through AWS IoT simulator. The simulator provides device type registrations, controlling the number of devices to simulate the data and to define payload structure for each device. The Automotive telemetry data was considered and simulated as shown in the steps below.

Add Device Type
Create a simulation stack in an AWS region and add custom devices. The automotive telemetry payload attributes are inbuilt and can’t be changed.

Simulate IoT Data
Automotive Telemetry data is quite comprehensive and the simulator we considered here can publish messages on three topics

Telemetry Topic Payload

{ "name": "speed", "value": 47.4, "vin": "1NXBR32E84Z995078", "trip_id": "799fc110-fee2-43b2-a6ed-a504fa77931a", "timestamp": "2018-02-15 08:20:18.000000000" }

Trip Topic Payload

{"vehicle_speed_mean":64.10065503146477,"engine_speed_mean":3077.59476197646,"torque_at_transmission_mean":210.70915084517395,"oil_temp_mean":237.417022870719,"accelerator_pedal_position_mean":28.819512721817887,"brake_mean":4.268754736044446,"high_speed_duration":0,"high_acceleration_event":3,"high_braking_event":0,"idle_duration":75323,"start_time":"2021-03-06T07:40:02.454Z","ignition_status":"run","brake_pedal_status":false,"transmission_gear_position":"fifth","odometer":35.27425650210172,"fuel_level":97.7129231345363,"fuel_consumed_since_restart":0.9155057461854811,"latitude":38.938734,"longitude":-77.269385,"timestamp":"2021-03-06 08:13:04.981000000","trip_id":"c6bacef5-1bfc-4a72-8261-6c1272772f13","vin":"5JO226H6QR3J3T7TI","name":"aggregated_telemetrics"}

Diagnostics Topic Payload

{"timestamp":"2021-03-06 08:14:33.087000000","trip_id":"a4c55e6e-a7eb-4c3e-b0a0-dfcace119e03","vin":"MQYK4Z8WGTJDFDA05","name":"dtc","value":"P0404"}

After completing the exercises, stop the simulation.

It is evident from the below image that the costs for IoT are cheap.

IoT Analytics

IoT Analytics setup is a simple process and it can quickly create channel, pipeline and datastore etc. in a single click as shown below.

The above setup was created to select all telemetry payloads from the subscribed topic.

The pipeline can be used to filter any unwanted attributes from the payload before creating a datastore.

S3 is used as an underlying data store and the data format can be parquet or JSON.

Setup IoT Rule and Action
Create a Rule and Action for the Telemetry Payload as given below. The objective is to select all records and ingest them into Telemetry Channel.

Once the simulation is started, the ingestion process can be monitored in the IoT Analytics window.

The dataset can be created and scheduled to be refreshed from the datastore. The dataset can be refreshed from the last import timestamp using the delta time option.

The IoT Analytics dataset can be used in aws sagemaker notebooks and can be used as a data source for QuickSight.

QuickSight

Create a new IoT analytics data source using the telemetry_dataset

Once the import is completed, the records can be enriched or filtered.

The import process can be scheduled for auto-refresh.

The imported telemetry dataset values can be visualized in
QuickSight.

SageMaker

Create a notebook and open the notebook using the sagemaker instance.

Import dataset and analyze the data for building machine learning models for predictive maintenance of devices.

A basic working version of SageMaker code is kept in Github. The feature engineering, model training and inference pipeline for near real-time data will be covered in a separate article in a detailed manner.

TimeStream

Pricing of timestream is 1 million writes of 1KB size = $0.50. Avg size of telemetry data is 145 bytes. Avg size of aggregated trip data is 800 bytes. By merging all the dimensions like speed, fuel_level etc at the VIN level, we can reduce the number of writes per batch and cost. A detailed explanation about payload structure and batching the writes have been explained in my previous article. Create a Rule and Action for Timestream if the data is aggregated at the source itself or process IoT events stored in MSK using the Kinesis Data Analytics for Flink or using a custom apache-spark application.

AWS MSK

AWS MSK was recently introduced as one of the Actions for IoT Core. IoT Rule Action can act as a producer for Managed Streaming Kafka. Create a VPC destination and add the cluster details and authentication mechanisms of the Kafka cluster to receive the messages.

Enable error logging for action

Enable CloudWatch Logs and cloud trail for IoT and AWS MSK to analyze the log and API calls.

Lambda

Create a lambda function to consume and process the data.
for partition in partitions: consumer.seek_to_beginning(partition) #sample read. use offset

The diagnostic data is less in volume but needs immediate action. The lambda will consume, check for a problem code and issue STOP signal to the device by publishing a payload to MQTT topic meant for a shadow device.

response = IOTCLIENT.publish(topic=iottopic, qos=0, payload=json.dumps(msg))

The working version of the lambda function is kept in GitHub kafka_consumer.py

Conclusion

IoT data can be processed and stored by various AWS services based on the use cases. The serverless AWS services can aid the rapid development of large-scale complicated architecture in an easily scalable manner. The managed services like AWS MSK along with a choice of distributed processing frameworks like Apache Flink based Kinesis data analytics or Apache Spark based AWS Glue and AWS Lambda will help in building scalable highly available and cost-effective processing for IoT/TimeSeries events.

Collect and Store Streaming TimeSeries data into Amazon TimeStream DB

Prasanth Mathesh — Sat, 26 Dec 2020 15:48:30 +0000

Introduction

This post discusses serverless architectural consideration and configuration steps for deploying the Streaming TimeSeries Data Solution for Amazon TimeStream DB in the Amazon Web Services (AWS) Cloud.It includes links to a code repository that can be used as a base to deploy this solution by following AWS best practices for security and availability.

Components Basics

AWS Lambda
Lambda is a serverless compute service that lets you run code without provisioning or managing servers.
Amazon Timestream
Amazon Timestream is a fast, scalable, and serverless time series database service for IoT and other operational applications.
AWS Kinesis
Amazon Kinesis Data Streams ingests a large amount of data in real-time, durably stores the data, and makes the data available for consumption.

DataFLow

The streaming data pipeline will look alike as given below after the deployment.

Getting Started

AWS Kinesis Setup
Create timeseries-stream DataStream with a Shard.

Amazon TimeStream Setup
Create a database named ecomm in the same region as kinesis datastream and table named inventory in database ecomm using the gists shared in github.

AWS Lambda Setup
Create a Kinesis producer to create and ingest time series data into the kinesis data stream which has to be read in the same order. To do so, create a kinesis consumer. The python SDK examples used for this article has been kept at github repository TimeStream.

Deployment

Serverless framework makes deployment and development faster. Deploy Lambda Producers and Consumers into AWS and schedule them to run based on time-series event triggers or any schedule. In a typical production scenario, the producers might be outside of the cloud region and events might arrive through the API gateway.
Producer and Consumer Logs will be available in cloud watch.

The written results can be quired using the query editor of Amazon TimeStream.
select * from “ecomm”.”inventory” limit 10

Conclusion

In most organizations, Timeseries data points are written once and read multiple times. It is clear that time-series data can be collected and stored using serverless services. Though Timestream can be integrated with various AWS services, kinesis is chosen since it has data retention and replay features. The next article about time-series data will have a use case using kappa data processing architecture.