<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Prasanth Mathesh</title>
    <description>The latest articles on Forem by Prasanth Mathesh (@prasanth_mathesh).</description>
    <link>https://forem.com/prasanth_mathesh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F538904%2F1c81f004-cae8-48c8-bb7a-d02a220006f3.png</url>
      <title>Forem: Prasanth Mathesh</title>
      <link>https://forem.com/prasanth_mathesh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/prasanth_mathesh"/>
    <language>en</language>
    <item>
      <title>Serverless Full Stack Data Analytics Engineering on AWS Cloud</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Thu, 27 Oct 2022 07:59:33 +0000</pubDate>
      <link>https://forem.com/aws-builders/serverless-full-stack-data-analytics-engineering-on-aws-cloud-43n4</link>
      <guid>https://forem.com/aws-builders/serverless-full-stack-data-analytics-engineering-on-aws-cloud-43n4</guid>
      <description>&lt;p&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;br&gt;
A Full-Stack Developer is someone who does both client-side and server-side development for mobile and web applications. In the data domain, a full-stack engineer performs similar kinds of job duties that involve full ownership, like ingestion, processing, and publishing data. Full stack data engineering would require data analysis, SQL scripting, ETL/ELT programming, data model design, workload orchestration, visualization, etc .&lt;/p&gt;

&lt;p&gt;In some cases, a cloud full-stack data analytics engineer will have more responsibilities like provisioning data infrastructure, dataops, devops, etc. Full stack development and full stack data analytics are two different domains, and the only layer of intersection is the datastore. The backend of a full-stack web or mobile application can be a source or a consumer of the data analytics application. But there is another layer in the full-stack data analytics architecture, the modern visualisation stack, that needs full-stack dev engineering.&lt;/p&gt;

&lt;p&gt;In this first part of the article, we will see the need for full stack development in the data analytics architecture and how it can be developed and owned within the data analytics application in a serverless manner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedded Analytics&lt;/strong&gt;&lt;br&gt;
The industry's popular low code/no code visualisation SaaS products are built using full stack technologies like GraphQL, Java Script, etc. But these products still require training about their inbuilt modules and, often, they are not fully customizable. The metrics that are visualised in the dashboards are the output of the heavy lifting done in the data processing layer. GBs of data are processed and finally shared with the semantic layer for building self-service analytics and interactive dashboards. There are cases where visualisation teams need to embed the analytical metrics into the web application/apps for business users. These kinds of metrics are called embedded analytics and often come with minimal filters.&lt;/p&gt;

&lt;p&gt;These kinds of implementations need full stack development that requires both client and server side programming using Angular and ReactJS frameworks. For mobile devices, it comes with another set of tech stacks like React Native, etc.&lt;/p&gt;

&lt;p&gt;The modern data stack has evolved a lot to have embedded web/mobile analytics. AWS QuickSight is one such example that supports low-code embedded analytics via APIs. SaaS data platforms like Snowflake and Databricks have inbuilt dashboard features but cannot be embedded with mobile/web apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedded Analytics using AWS Amplify &lt;/strong&gt;&lt;br&gt;
AWS Amplify lets full-stack developers host their server-less backend for the front end. The Amplify Studio has features to build front-end UI, design app data models with minimal effort,etc. The JavaScript modules that support charts and graphs like D3.JS and Chart.JS can be used along with popular web frameworks like React to build custom dashboards using AWS Amplify.&lt;/p&gt;

&lt;p&gt;The periodic data refresh for the back-end data stores can be performed by using continuous data engineering pipelines within the AWS cloud. The payload size plays a pivotal role in providing low latency custom dashboards. So one has to choose the right web framework and data model based on the traverse pattern of front-end analytics dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
The diagram below depicts the end-to-end flow for embedding analytics into any business application, product, website, or portal. There is no additional BI tool or licencing involved for an organisation that relies on static dashboards with basic filters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foogy8w6mxahyghbrvfth.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foogy8w6mxahyghbrvfth.png" alt="Image description" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The data engineering pipeline processes the raw data into aggregates that are suitable for business strategy decisions, measuring KPIs, etc. The processed data is made available for business users both in real time and at batch intervals. AppSync acts as an integration layer to share data with the front-end applications in a real-time manner.&lt;/p&gt;

&lt;p&gt;Few of the advantages of using AWS Amplify in Embedded Analytics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster Time to Market&lt;/li&gt;
&lt;li&gt;Serverless &lt;/li&gt;
&lt;li&gt;Pay as per the usage&lt;/li&gt;
&lt;li&gt;UI based development to build analytics apps&lt;/li&gt;
&lt;li&gt;Option to have own Data Cache layer&lt;/li&gt;
&lt;li&gt;Build cloud native apps with the responsive design &lt;/li&gt;
&lt;li&gt;Support for additional JavaScript modules for the web frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
In this article, we have seen embedded analytics usage and how it can be achieved using AWS Amplify. Due to the complexity of the technical stack, full stack engineering is kept as a separate product and is often not owned by the data engineering or the analytics team. AWS Amplify has enough features to simplify the process of prototyping and development of embedded analytical dashboards. Going forward, one can expect wider adoption of AWS Amplify for data analytics use cases. In the next article, I will describe how raw data is enriched, stored, and visualised using AWS Amplify and AppSync.&lt;/p&gt;

</description>
      <category>dataanalytics</category>
      <category>spark</category>
      <category>amplify</category>
      <category>appsync</category>
    </item>
    <item>
      <title>Amazon SQS and serverless DataEngineering workloads</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Tue, 25 Oct 2022 16:05:41 +0000</pubDate>
      <link>https://forem.com/aws-builders/amazon-sqs-and-serverless-dataengineering-workloads-1b40</link>
      <guid>https://forem.com/aws-builders/amazon-sqs-and-serverless-dataengineering-workloads-1b40</guid>
      <description>&lt;p&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;br&gt;
Amazon SQS provides fully managed message queuing for microservices, distributed systems, and serverless applications. Amazon SQS is one of the earliest services launched by Amazon and still widely used by many organizations and it forms one of the core services of SaaS / PaaS products that were built on top of the AWS cloud.&lt;/p&gt;

&lt;p&gt;A variety of use cases exist in the microservices architecture for AWS SQS but when it comes to data engineering, SQS is commonly used to publish messages with dynamic configurations that in turn trigger consumers to scale/parallelize workloads based on the SQS message data. The reason is that SQS by default, doesn't handle messages greater than 256KB.&lt;/p&gt;

&lt;p&gt;Not all SaaS applications support bulk load or query due to API rate limits, performance factors, and so on. In an event-driven data architecture, especially when producers and consumers are SaaS applications, cloud web apps, etc., event ingestion is done using AWS AppFlow, Kinesis, or custom batch query processes.&lt;/p&gt;

&lt;p&gt;If any SaaS is customizable, then the team can develop their own API using the AWS SDK to support ingestions and queries. There are cases where GBs of processed data have to be uploaded in bulk mode to a SaaS or Cloud Web Apps datastore in less than a minute interval. Below are few scenarios that can be used in serverless event-driven data architectures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DataEngineering Workload&lt;/strong&gt;&lt;br&gt;
SQS can be used to receive large payloads with the help of the extended client library. This feature has been in place since 2015 but is less commonly used in data engineering workloads. The AWS SDK extended SQS client library can be used to process up to 2 GB of data using S3 object storage. The producers of SQS can write messages &amp;gt; 256KB in S3 and publish metadata in SQS, and SQS consumers can read and process the data. After the data processing is complete, write the data to the target object store or share it via API using any SQS message consumer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5a49cpefqnvb5jo3k0v.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5a49cpefqnvb5jo3k0v.jpg" alt="Image description" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Consider an organisation that is using a customizable SaaS that was built using AWS services. The extended Java client library can be leveraged when a user needs to upload documents &amp;gt; 256KB or an event is triggered for user action. These events can be used for building a decoupled data engineering workload that requires document parsing or tagging using services like AWS Textract, AWS Glue, or Serverless EMR. The DataSourceRegister Spark API along with Spark structured streaming can be used as an SQS consumer for any high volume. The processed data can be shared with a webapp or SaaS using microservices or loaded into OLAP or OLTP services.&lt;/p&gt;

&lt;p&gt;Multipart files of less than 2GB can be written into S3 using SQL unload. Later, SQS consumers like AWS Lambda , AWS Glue, etc. can be invoked to write the processed files into target SaaS object stores in a concurrent manner, provided the SaaS application supports the bulk load API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;br&gt;
Unlike kafka, SQS can't support non-ASCII characters like emojis. There is limited support for Unicode characters in the SQS service. So one has to consider the type of message, the complexity of transformation/aggregation, message retention period, etc. before deciding on the type and need of message brokers for data engineering workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Typically, it's a common design pattern to have MQs, Kafka as message brokers, but Amazon SQS can also be leveraged for building loosely coupled data engineering pipelines for data generated by SaaS applications. The serverless AWS services are inherently scalable, and SQS can help to achieve the parallelism of data processing in an event-driven data architecture without the need for any additional technical stack.&lt;/p&gt;

</description>
      <category>sqs</category>
      <category>amazon</category>
      <category>dataengineering</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Spark as function - Containerize PySpark code for AWS Lambda and Amazon Kubernetes</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Wed, 29 Sep 2021 02:32:19 +0000</pubDate>
      <link>https://forem.com/aws-builders/spark-as-function-containerize-pyspark-code-for-aws-lambda-and-amazon-kubernetes-1bka</link>
      <guid>https://forem.com/aws-builders/spark-as-function-containerize-pyspark-code-for-aws-lambda-and-amazon-kubernetes-1bka</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pm8mgd3ypmtjz4r4pao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pm8mgd3ypmtjz4r4pao.png" alt="Alt Text" width="800" height="120"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;The python, java etc applications can be containerized as a docker image for deployment in AWS Lambda and AWS EKS using the AWS ECR as container registry. The spark framework commonly used for distributed big data processing applications supports various deployment modes like local, cluster, yarn, etc. I have discussed serverless data processing architecture patterns in my other articles and in this, we will see how one can build and run a Spark data processing application using AWS EKS and also serverless  lambda runtime. The working version code used for this article is kept in &lt;strong&gt;&lt;a href="https://github.com/prasanth-m/AWS/tree/master/Spark-Docker" rel="noopener noreferrer"&gt;Github&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Requirements
&lt;/h1&gt;

&lt;p&gt;The following are the set of client tools that should be already installed in the working dev environment.&lt;br&gt;
AWS CLI, Kubectl, Eksctl, Docker&lt;br&gt;
One should ensure the right version for each set of tools including the spark, AWS SDK and delta.io dependencies.&lt;/p&gt;
&lt;h1&gt;
  
  
  Kubernetes Deployment
&lt;/h1&gt;

&lt;p&gt;AWS EKS anywhere which was launched recently can enable organizations to create and operate Kubernetes clusters on customer-managed infrastructure. This new service by AWS is going to change the way of scalability, disaster plan and recovery option that are being followed for Kubernetes currently.&lt;/p&gt;

&lt;p&gt;The following are the few native Kubernetes deployments since containerized applications will run in the same manner in different hosts.&lt;br&gt;
1.Build and test application on-premise and deploy on the cloud for availability and scalability&lt;br&gt;
2.Build, test and run applications on-premise and use the cloud environment for disaster recovery&lt;br&gt;
3.Build, test and run application on-premise, burst salves on the cloud for on-demand scaling&lt;br&gt;
4.Build and test application on-premise and deploy master on a primary cloud and create slaves on secondary cloud&lt;/p&gt;

&lt;p&gt;For ever-growing,data-intensive applications that process and store terabytes of data, RPO is critical and its better to use on-premise dev and cloud for PROD&lt;/p&gt;
&lt;h1&gt;
  
  
  Spark on Server
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Local&lt;/strong&gt;&lt;br&gt;
First, let's containerize the application and test it in the local environment.&lt;/p&gt;

&lt;p&gt;The pyspark &lt;strong&gt;&lt;a href="https://github.com/prasanth-m/AWS/blob/master/Spark-Docker/cda-spark-kubernetes/cda_spark_kubernetes.py" rel="noopener noreferrer"&gt;code&lt;/a&gt;&lt;/strong&gt; used in this article reads a S3 csv file and writes it into a delta table in append mode. After the write operation is complete, spark code displays the delta table records.&lt;/p&gt;

&lt;p&gt;Build the image with dependencies and push the docker image to AWS ECR using the below command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./build_and_push.sh cda-spark-kubernetes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfdrm4gt2nbah9pyjqsu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfdrm4gt2nbah9pyjqsu.png" alt="Alt Text" width="214" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After the build, the docker image is available in the local dev host too which can be tested locally using docker CLI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run cda-spark-kubernetes driver local:///opt/application/cda_spark_kubernetes.py {args}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpjreouseys105wfeixs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpjreouseys105wfeixs.png" alt="Alt Text" width="800" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above image shows the output of the delta read operation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS EKS&lt;/strong&gt;&lt;br&gt;
Build AWS EKS cluster using eksctl.yaml and apply RBAC role for spark user using below cli.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eksctl create cluster -f ./eksctl.yaml
kubectl apply -f ./spark-rbac.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the cluster is cluster is created, verify nodes and cluster IP.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxecdwov0yqcwjtfdgiy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxecdwov0yqcwjtfdgiy.png" alt="Alt Text" width="639" height="30"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above is a plain cluster that is ready without any application and its dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install spark-operator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark Operator is an open-source Kubernetes Operator to deploy Spark applications. Helm is similar to yum, apt for K8s and using helm, spark operator can be installed.&lt;/p&gt;

&lt;p&gt;Install spark-operator using below helm CLI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install spark-operator spark-operator/spark-operator --set webhook.enable=true
kubectl get pods
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd48q63uf9tlmd38zksh7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd48q63uf9tlmd38zksh7.png" alt="Alt Text" width="663" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The containerized spark code can be submitted from a client in cluster mode using spark operator and its status can be checked using kubectl cli.&lt;/p&gt;

&lt;p&gt;Run the spark-job.yaml that contains config parameters required for the spark operator in the command line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply -f ./spark-job.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cli to get application is given below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get sparkapplication
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fph7h8ee74yp4zy66w6mx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fph7h8ee74yp4zy66w6mx.png" alt="Alt Text" width="711" height="35"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cli to get logs of the spark driver at the client side is given below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs spark-job-driver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The delta operation has done the append to the delta table and it's displayed on driver logs as given below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxz2spf6bot3kjqjxsfn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxz2spf6bot3kjqjxsfn.png" alt="Alt Text" width="800" height="216"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additionally, the driver Spark-UI can be forwarded to the localhost port too. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7698q5pk9uagomlkxka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7698q5pk9uagomlkxka.png" alt="Alt Text" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes deployment requests driver and executor pods on demand and shuts them down once processing is complete. This pod level resource sharing and isolation is a key difference between spark on yarn and kubernetes&lt;/p&gt;

&lt;h1&gt;
  
  
  Spark on Serverless
&lt;/h1&gt;

&lt;p&gt;Spark is a distributed data processing framework that thrives on RAM and CPU. Spark on AWS lambda function is suitable for all kinds of workload that can complete within 15 mins.&lt;/p&gt;

&lt;p&gt;For the workloads that take more than 15 mins, by leveraging continuous/event-driven pipelines with proper CDC, partition and storage techniques, the same code can be run in parallel to achieve the latency of the data pipeline.&lt;/p&gt;

&lt;p&gt;The base spark image used for AWS EKS deployment is taken from the docker hub and  it is pre-built with AWS SDK and delta.io dependencies.&lt;/p&gt;

&lt;p&gt;For AWS Lambda deployment, AWS supported python base image is used to build code along with its dependencies using the &lt;strong&gt;&lt;a href="https://github.com/prasanth-m/AWS/blob/master/Spark-Docker/cda-spark-lambda/Dockerfile" rel="noopener noreferrer"&gt;Dockerfile&lt;/a&gt;&lt;/strong&gt; and then it is pushed to the AWS ECR.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM public.ecr.aws/lambda/python:3.8

ARG HADOOP_VERSION=3.2.0
ARG AWS_SDK_VERSION=1.11.375

RUN yum -y install java-1.8.0-openjdk

RUN pip install pyspark

ENV SPARK_HOME="/var/lang/lib/python3.8/site-packages/pyspark"
ENV PATH=$PATH:$SPARK_HOME/bin
ENV PATH=$PATH:$SPARK_HOME/sbin
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
ENV PATH=$SPARK_HOME/python:$PATH

RUN mkdir $SPARK_HOME/conf

RUN echo "SPARK_LOCAL_IP=127.0.0.1" &amp;gt; $SPARK_HOME/conf/spark-env.sh

#ENV PYSPARK_SUBMIT_ARGS="--master local pyspark-shell"
ENV JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.302.b08-0.amzn2.0.1.x86_64/jre"
ENV PATH=${PATH}:${JAVA_HOME}/bin

# Set up the ENV vars for code
ENV AWS_ACCESS_KEY_ID=""
ENV AWS_SECRET_ACCESS_KEY=""
ENV AWS_REGION=""
ENV AWS_SESSION_TOKEN=""
ENV s3_bucket=""
ENV inp_prefix=""
ENV out_prefix=""

RUN yum install wget
# copy hadoop-aws and aws-sdk
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -P ${SPARK_HOME}/jars/ &amp;amp;&amp;amp; \ 
    wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -P ${SPARK_HOME}jars/

COPY spark-class $SPARK_HOME/bin/
COPY delta-core_2.12-0.8.0.jar ${SPARK_HOME}/jars/
COPY cda_spark_lambda.py ${LAMBDA_TASK_ROOT}

CMD [ "cda_spark_lambda.lambda_handler" ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F923lcrqenghm6mp7nb93.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F923lcrqenghm6mp7nb93.png" alt="Alt Text" width="196" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local&lt;/strong&gt;&lt;br&gt;
Test the code using a local machine using docker CLI as given below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -e s3_bucket=referencedata01 -e inp_prefix=delta/input/students.csv -e out_prefix=/delta/output/students_table -e AWS_REGION=ap-south-1 -e AWS_ACCESS_KEY_ID=$(aws configure get default.aws_access_key_id) -e AWS_SECRET_ACCESS_KEY=$(aws configure get default.aws_secret_access_key) -e AWS_SESSION_TOKEN=$(aws configure get default.aws_session_token) -p 9000:8080 kite-collect-data-hist:latest cda-spark-lambda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local mode testing will require an event to be triggered and AWS lambda will be in wait mode.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16gejydizn8m1eqjcwl4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16gejydizn8m1eqjcwl4.png" alt="Alt Text" width="800" height="49"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Trigger an event for lambda function using below cli in another terminal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once AWS lambda is completed, we can see the output as given below in the local machine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxon7bd59h33wlw0iouv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxon7bd59h33wlw0iouv.png" alt="Alt Text" width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Lambda&lt;/strong&gt;&lt;br&gt;
Deploy a lambda function using the ECR image and set necessary ENV variables for the lambda handler. Once lambda is triggered and completed successfully we can see the logs in cloud watch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsc8k2s1e1q5x1kfzgr7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsc8k2s1e1q5x1kfzgr7.png" alt="Alt Text" width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Lambda currently supports 6 vCPU cores and 10 gb memory and it is billed for the elapsed run time and memory consumption as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj34juhcp8f95qj6ijr4k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj34juhcp8f95qj6ijr4k.png" alt="Alt Text" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AWS Pricing is based on the number of requests and GB-Sec.&lt;br&gt;
 &lt;br&gt;
The same code is run for various configurations and it is evident from the below table that even if memory is overprovisioned, AWS lambda pricing methodology saves the cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm65xowku8vajxj2cdej6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm65xowku8vajxj2cdej6.png" alt="Alt Text" width="567" height="135"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Going forward, wider adoption to use containerized data pipelines for spark will be the need of the hour since sources like web apps, SaaS products that are built on top of Kubernetes generates a lot of data in a continuous manner  for the big data platforms.&lt;/p&gt;

&lt;p&gt;The most common operations like data extraction and ingestion in the S3 data lake, loading processed data into the data stores and pushing down SQL workloads on AWS Redshift can be done easily using AWS lambda Spark.&lt;/p&gt;

&lt;p&gt;Thus by leveraging AWS Lambda along with Kubernetes, one can bring down TCO along with  build planet-scale data pipelines.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>aws</category>
      <category>serverless</category>
      <category>docker</category>
    </item>
    <item>
      <title>Machine Learning Predictions using AWS Redshift ML</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Wed, 02 Jun 2021 06:49:04 +0000</pubDate>
      <link>https://forem.com/aws-builders/machine-learning-predictions-using-aws-redshift-ml-3gn6</link>
      <guid>https://forem.com/aws-builders/machine-learning-predictions-using-aws-redshift-ml-3gn6</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;In the previous &lt;a href="https://dev.to/aws-builders/automate-sagemaker-machine-learning-inference-pipeline-in-a-serverless-way-bpk"&gt;article&lt;/a&gt;, we have seen how to train and infer the predictions for the bring your own Algorithms ( BYOA) models. Standard ML pipeline features to infer the predictions will be either raw format or output of feature engineering pipeline stored in a feature store. Redshift-ML enabled the creation, train, and deployment models using SQL and enables predictions in SQL. This enables the creation of feature stores in Redshift DB and infer the predictions and share the predictions without much overhead or orchestration services.&lt;/p&gt;

&lt;h1&gt;
  
  
  Redshift-ML
&lt;/h1&gt;

&lt;p&gt;ML provides the below options using SQL.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create, Train and Deploy the Model&lt;/li&gt;
&lt;li&gt;Localize the model in Redshift DB&lt;/li&gt;
&lt;li&gt;Infer the predictions for the deployed Model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Additionally, users can bring their own model (BYOM) trained in Amazon sagemaker. The inference can be local and also remote using the sagemaker endpoint.&lt;/p&gt;

&lt;p&gt;The reference architecture for BYOM with local inference will be as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7ynz9c7fmnasetq9krp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7ynz9c7fmnasetq9krp.png" alt="Alt Text" width="731" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The local inference saves the infra cost of batch transforms and removes the overhead of setting up endpoints for the models especially when models are served in real-time mode. The Redshift cluster scale and can share the predictions via data APIs. The materialized view/table can be created and predictions can be inferred from web apps.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>machinelearning</category>
      <category>cloud</category>
      <category>database</category>
    </item>
    <item>
      <title>Automate SageMaker Real-Time ML Inference in a ServerLess way</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Wed, 19 May 2021 08:56:30 +0000</pubDate>
      <link>https://forem.com/aws-builders/automate-sagemaker-machine-learning-inference-pipeline-in-a-serverless-way-bpk</link>
      <guid>https://forem.com/aws-builders/automate-sagemaker-machine-learning-inference-pipeline-in-a-serverless-way-bpk</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Amazon SageMaker is a fully managed service that enables data scientists and ML engineers to quickly create, train and deploy models and ML pipelines in an easily scalable and cost-effective way. The SageMaker was launched around Nov 2017 and I had a chance to get to know about inbuilt algorithms and features of SageMaker from &lt;a href="https://www.linkedin.com/in/skrinak/" rel="noopener noreferrer"&gt;Kris Skrinak&lt;/a&gt; during a boot camp roadshow for the Amazon Partners. Over the period, SageMaker has matured a lot to enable ML engineers to deploy and track models quickly and scalable. Apart from its built-in Algorithms, there were many new features like AutoPilot, Model Clarify and Feature Store, Docker Container. This blog will look into these new SageMaker features and the ServerLess way of training, deployment, and real-time inference.&lt;/p&gt;

&lt;h1&gt;
  
  
  Architecture
&lt;/h1&gt;

&lt;p&gt;The steps for the below reference architecture are explained at the end of the SageMaker Pipeline section of this article.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr50n1d813i63fmvc5bd4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr50n1d813i63fmvc5bd4.png" alt="Alt Text" width="718" height="444"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  SageMaker Features
&lt;/h1&gt;
&lt;h1&gt;
  
  
  A) Auto Pilot-Low Code Machine Learning
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Launched around DEC 2019&lt;/li&gt;
&lt;li&gt;Industry-first Automated ML to give control and visibility to ML 
Models&lt;/li&gt;
&lt;li&gt;Does Feature Processing, picks the best algorithm, trains and 
selects the best model with just a few clicks&lt;/li&gt;
&lt;li&gt;Vertical AI services like Amazon Personalize and Amazon Forecast 
can be used for personalized recommendation and forecasting 
problems&lt;/li&gt;
&lt;li&gt;AutoPilot is a generic ML service for all kinds of 
classification and regression problems like fraud detection and 
churn analysis and targeted marketing&lt;/li&gt;
&lt;li&gt;Supports inbuilt Algorithms of SageMaker like xgboost and linear 
learner&lt;/li&gt;
&lt;li&gt;Default max size of input dataset is 5 GB but can be increased 
in GBs only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Auto-Pilot Demo&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Data for AutoPilot Experiment&lt;/em&gt;&lt;br&gt;
The dataset considered is public data provided by &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/00603/" rel="noopener noreferrer"&gt;UCI&lt;/a&gt;.&lt;br&gt;
&lt;em&gt;Data Set Information&lt;/em&gt;&lt;br&gt;
The survey data describes different driving scenarios including the destination, current time, weather, passenger, etc., and then asks the person whether he will accept the coupon if he is the driver. The task we will be performing on this dataset is Classification&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AutoPilot Experiment&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Import the data for training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sh
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00603/in-vehicle-coupon-recommendation.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once data is uploaded, the AutoPilot can be set up within minutes using the SageMaker studio. Add the training input and output data paths, Label to predict and enable the auto-deployment of the model. SageMaker deploys the best model and creates an endpoint after the successful training.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5slt29v6vkep9v8i5iz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5slt29v6vkep9v8i5iz.png" alt="Alt Text" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alternately one can select the model of their wish and deploy it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzomm03ijc5fs8ewfigv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzomm03ijc5fs8ewfigv.png" alt="Alt Text" width="800" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The endpoint configurations and endpoint details of deployed model can be found in the console&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw56e41qwppc01lhrvtx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw56e41qwppc01lhrvtx4.png" alt="Alt Text" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2q0b8pxpl0pyokv7k6po.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2q0b8pxpl0pyokv7k6po.png" alt="Alt Text" width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Infer and Evaluate Model&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Take a validation record and invoke the endpoint. The feature engineering tasks are done by autopilot and thus raw features data can infer the trained model and predict.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No Urgent Place,Friend(s),Sunny,80,10AM,Carry out &amp;amp; Take away,2h,Female,21,Unmarried partner,1,Some college - no 
degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnt1rk8uea4p9wra4vep5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnt1rk8uea4p9wra4vep5.png" alt="Alt Text" width="800" height="136"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Infer the model using validation data set using the code given in &lt;a href="https://github.com/prasanth-m/sagemaker/blob/master/Auto-Pilot/infer.ipynb" rel="noopener noreferrer"&gt;Github&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sjx660pcfi391syfmc6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sjx660pcfi391syfmc6.png" alt="Alt Text" width="800" height="132"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  B) SageMaker Clarify
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Launched around DEC 2020&lt;/li&gt;
&lt;li&gt;Explains how machine learning (ML) models made predictions 
during the Autopilot experiments&lt;/li&gt;
&lt;li&gt;Monitors Bias Drift for Models in Production&lt;/li&gt;
&lt;li&gt;Provides components that help AWS customers build less biased 
and more understandable machine learning models&lt;/li&gt;
&lt;li&gt;Provides explanations for individual predictions available via 
API&lt;/li&gt;
&lt;li&gt;Helps in establishing the model governance for ML applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bias information can be generated for the AutoPilot experiment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bias_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=training_data_s3_uri,
    s3_output_path=bias_report_1_output_path,
    label="Y",
    headers=train_cols,
    dataset_type="text/csv",
)

model_config = sagemaker.clarify.ModelConfig(
    model_name=model_name,
    instance_type=train_instance_type,
    instance_count=1,
    accept_type="text/csv",
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  C) SageMaker Feature Store
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Launched around DEC 2020&lt;/li&gt;
&lt;li&gt;Amazon SageMaker Feature Store is a fully managed repository to 
store, update, retrieve, and share machine learning (ML) 
features in S3.&lt;/li&gt;
&lt;li&gt;The feature set that was used to train the model needs to be 
available to make real-time predictions (inference).&lt;/li&gt;
&lt;li&gt;Data Wrangler of SageMaker Studio can be used to engineer 
features and ingest features into a feature store&lt;/li&gt;
&lt;li&gt;Feature Store - both online and offline stores can be ingested 
via separate Featuring Engineering Pipeline via SDK&lt;/li&gt;
&lt;li&gt;Streaming sources can directly ingest features to the online 
feature store for inference or feature creation&lt;/li&gt;
&lt;li&gt;Feature Store automatically builds an Amazon Glue Data Catalog 
when Feature Groups are created and can optionally be turned off&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The below table shows various data stores used to maintain the features. Some open source frameworks like Feast have evolved as feature store platform and any key-value data store that supports fast lookup can be used as Feature Store.&lt;/p&gt;

&lt;p&gt;The feature stores are end-stage of the feature engineering pipeline and the features can be stored in cloud Data Warehouses like Snowflake, RedShift too as shown in the image of &lt;a href="https://www.featurestore.org" rel="noopener noreferrer"&gt;featurestore.org&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe89ykt7n071k0nmwy05p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe89ykt7n071k0nmwy05p.png" alt="Alt Text" width="250" height="513"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;record_identifier_value = str(2990130)
featurestore_runtime.get_record(FeatureGroupName=transaction_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The feature group can be accessed as Hive external table too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE EXTERNAL TABLE IF NOT EXISTS sagemaker_featurestore.coupon (
  write_time TIMESTAMP
  event_time TIMESTAMP
  is_deleted BOOLEAN
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS
  INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
  OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
LOCATION 's3://coupon-featurestore/onlinestore/139219451296/sagemaker/ap-south-1/offline-store/coupon-1621050755/data'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  D) SageMaker Pipelines
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Launched around DEC 2020&lt;/li&gt;
&lt;li&gt;SageMaker natively supports MLOPS via the SageMaker project and 
pipelines are created during the SageMaker Project creation&lt;/li&gt;
&lt;li&gt;MLOPS is a standard to streamline the continuous delivery of 
models. It is essential for a successful production-grade ML 
application.&lt;/li&gt;
&lt;li&gt;SageMaker pipeline is a series of interconnected steps that are 
defined by a JSON pipeline definition to perform build, train 
and deploy or only train and deploy etc.&lt;/li&gt;
&lt;li&gt;The alternate ways to set up the MLOPS in SageMaker are  Mlflow, 
Airflow and Kubeflow, Step Functions, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Docker Containers
&lt;/h1&gt;

&lt;p&gt;SageMaker Studio itself runs from a Docker container. The docker containers can be used to migrate the existing on-premise live ML pipelines and models into the SageMaker environment.&lt;/p&gt;

&lt;p&gt;Both stateful and stateless inference pipelines can be created. For example the anomaly and fraud detection pipelines are stateless and the example considered in this article is a stateful model inference pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SageMaker Container Demo&lt;/strong&gt;&lt;br&gt;
Download the &lt;a href="https://github.com/prasanth-m/sagemaker/tree/master/container" rel="noopener noreferrer"&gt;Github&lt;/a&gt; folder. The container folder should show files as shown in the image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyesl78n0gn5hspmavd6r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyesl78n0gn5hspmavd6r.png" alt="Alt Text" width="294" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dataset is the same as we have considered for Autopilot Experiment. &lt;/p&gt;

&lt;p&gt;The sckit-learn algorithm is used for the local training and model tuning. After various iterations, the features having less importance have been removed and then encoding has been performed for the key features.&lt;/p&gt;

&lt;p&gt;The final encoded features (97 labels) are stored in coupon_train.csv and will be used for training and validation locally. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Container Build&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The following steps have to be performed in an orderly manner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build the image
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -t recommend-in-vehicle-coupon:latest .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ngomw9417p1t17crzq9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ngomw9417p1t17crzq9.png" alt="Alt Text" width="660" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train the features in local mode
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./train_local.sh recommend-in-vehicle-coupon:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fje73c1wea6abxckemplb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fje73c1wea6abxckemplb.png" alt="Alt Text" width="711" height="67"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serve the model in local mode
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./serve_local.sh recommend-in-vehicle-coupon:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The servers are up and waiting for request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx063fngh2s0n7ven234k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx063fngh2s0n7ven234k.png" alt="Alt Text" width="621" height="132"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predict locally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The payload.csv will have features to predict the model. Run below command to predict the response for the features available in the csv.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./predict.sh payload.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfgmxox8axjis2wadips.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfgmxox8axjis2wadips.png" alt="Alt Text" width="524" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the request is accepted, servers listening will respond to the requests received.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc4gqz5wzx8z6hdw170y9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc4gqz5wzx8z6hdw170y9.png" alt="Alt Text" width="668" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Push Image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the local testing is completed, the container train, deploy and serve image can be pushed to AWS ECR. In case any code change is done, the final build and push step alone is enough.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./build_and_push.sh

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0tncuyw8dpraxuwsjfw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0tncuyw8dpraxuwsjfw.png" alt="Alt Text" width="718" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Deployed Container Image&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsj4izs1q7wqy4h01zsc3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsj4izs1q7wqy4h01zsc3.png" alt="Alt Text" width="800" height="56"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AWS ECR images can be pulled and containers can be run from Lambda, AWS EKS etc.&lt;/p&gt;
&lt;h1&gt;
  
  
  Lambda Function 
&lt;/h1&gt;

&lt;p&gt;The SageMaker API calls meant for training, deployment and inference are created as Lambda Functions. Then deployed Lambda handler function should be integrated with API Gateway so that pipeline can be run for any triggered API event.&lt;br&gt;
The lambda function kept in &lt;a href="https://github.com/prasanth-m/sagemaker/blob/master/Lambda/function.py" rel="noopener noreferrer"&gt;Github&lt;/a&gt; has three major blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create SageMaker Training Function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The lambda will read features from s3 and complete the training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client = boto3.client("sagemaker", region_name=region)
        client.create_training_job(**create_training_params) 
        status = client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy53yp761oq1tw45dlme4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy53yp761oq1tw45dlme4.png" alt="Alt Text" width="357" height="81"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Create SageMaker Model and Endpoint Function
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Create the model&lt;/strong&gt;&lt;br&gt;
The training job will place model artifacts in s3 and that model has to be registered with SageMaker.&lt;/p&gt;

&lt;p&gt;Register the models in the SageMaker environment using the below API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create_model_response = client.create_model(
              ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
              )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ti4thk0853n3g4ivo6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ti4thk0853n3g4ivo6e.png" alt="Alt Text" width="800" height="44"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create End Point Config&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;response = client.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ProductionVariants=[
                {
                    'VariantName': 'variant-1',
                    'ModelName': model_name,
                    'InitialInstanceCount': 1,
                    'InstanceType': 'ml.t2.medium'
                }
            ]
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ti4thk0853n3g4ivo6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ti4thk0853n3g4ivo6e.png" alt="Alt Text" width="800" height="44"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create End Point&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;response = client.create_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=endpoint_config_name
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vvwvbhhjnqz38f0zr97.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vvwvbhhjnqz38f0zr97.png" alt="Alt Text" width="655" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invoke SageMaker Model Function&lt;/strong&gt;&lt;br&gt;
Based on the API request body message, the endpoint will be invoked by the Lambda.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;response = client.invoke_endpoint(
            EndpointName=EndpointName,
            Body=event_body.encode('utf-8'),
            ContentType='text/csv'
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The status of the In-service endpoint and the requests made to the endpoint can be checked in the cloud watch logs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0c352dvuwqbi3pvta3wt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0c352dvuwqbi3pvta3wt.png" alt="Alt Text" width="800" height="62"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Testing State-full Real-time Inference
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Trigger SageMaker Training&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once API Gateway and Lambda have been integrated, Training Job can be triggered by passing the below request body to Lambda function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"key":"train_data"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1w71cmqbrbbotz9oslm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1w71cmqbrbbotz9oslm.png" alt="Alt Text" width="507" height="186"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trigger SageMaker Model and Endpoint Deployment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the training job is completed, deploy the model with the below request body. The training job should be the job which we created recently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"key" : "deploy_model",
"training_job" :"&amp;lt;training job name&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trigger SageMaker Model Endpoint&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Invoke the endpoint with the below request. The feature is encoded and should be the same as we used to train.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk90q0pccyh0c0sywmfqg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk90q0pccyh0c0sywmfqg.png" alt="Alt Text" width="353" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The predicted response will be as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frozfy00i1r1e40k558ak.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frozfy00i1r1e40k558ak.png" alt="Alt Text" width="488" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The events created during invoking can be viewed in cloud watch logs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmv0eyrgv1p20ap1m0sva.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmv0eyrgv1p20ap1m0sva.png" alt="Alt Text" width="800" height="45"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Machine Learning inference costs account for more than 80 percent of operational costs for running the ML workloads. The SageMaker capabilities like container orchestration, multi-model endpoint, serverless inference can save both operational and development costs. Also,the event-driven training and inference pipelines can enable any non-technical person from the sales or marketing team to refresh both batch and real-time predictions with a click of a button built using the mechanisms like API, webhooks from their sales portal on an Adhoc basis before running their campaign.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>serverless</category>
      <category>aws</category>
      <category>cloud</category>
    </item>
    <item>
      <title>IoT/TimeSeries event processing using AWS Serverless Services and AWS Managed Kafka Streaming</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Sun, 14 Mar 2021 08:24:32 +0000</pubDate>
      <link>https://forem.com/aws-builders/iot-timeseries-event-processing-using-aws-serverless-services-and-aws-managed-kafka-streaming-42k0</link>
      <guid>https://forem.com/aws-builders/iot-timeseries-event-processing-using-aws-serverless-services-and-aws-managed-kafka-streaming-42k0</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;The TimeSeries and IoT data share most of the characteristics except a few like the timestamp attribute. The time-series data arrival is predefined but IoT events can be in a random window. The IoT event analytics can be done either with IoT analytics or with Kafka but both have certain data processing limitations. The solution I have considered for this article has a mix of Kafka Streaming and IoT Analytics. The integration of AWS MSK with Kinesis Data Analytics or AWS Glue will be covered in detail in another article.&lt;/p&gt;

&lt;h1&gt;
  
  
  Architecture
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folokpc5fpnn78f1z0xda.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folokpc5fpnn78f1z0xda.png" alt="Alt Text" width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Why IoT Core?
&lt;/h1&gt;

&lt;p&gt;Kafka Protocol was built on top of TCP/IP whereas standard IoT devices support MQTT connections. MQTT was built considering bad network and communication. AWS IoT Core Device Gateway supports both HTTP and MQTT protocol, provides bi-directional communication with IoT devices and can filter and route data across the enterprise applications. It enables device registration in a simple manner thus accelerating IoT application development.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why AWS MSK?
&lt;/h1&gt;

&lt;p&gt;Apache Kafka is distributed messaging framework that provides fast, scalable and throughput ingestions. Kafka can replay IoT events, provides long-term storage, acts as a buffer when there is high-velocity data, provide easy integration with other enterprise applications. Real-time device monitoring is not possible in IoT Analytics whereas Kafka streaming along with an event processing framework can trigger actions for the anomalies in real-time. No downtime during Kafka cluster upgrade and clusters can be provisioned in 15 mins. Another main advantage is no charge for data replication traffic across AZ and this is a key factor when compared with expensive highly available self-hosted Kafka clusters.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why IoT Analytics?
&lt;/h1&gt;

&lt;p&gt;AWS IoT Analytics can enrich the IoT data, perform ad-hoc analysis and build dashboards using QuickSight. It is a simple and serverless way to do data prep, clean and feature engineering and can be integrated with notebooks, AWS SageMaker to build machine learning models. Custom analysis code packaged in a container can also be executed on AWS IoT Analytics for use cases like understanding the performance of devices, predicting device failures, etc.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why TimeStream?
&lt;/h1&gt;

&lt;p&gt;Timestream can easily store and analyze trillions of events per day. The data retention can be controlled based on the analytics need. It has built-in time-series analytics functions, helping you identify trends and patterns in your data in near real-time. Timestream can provide 1000x faster query performance along with 1/10th the cost of relational databases. When the same record is received for a timestamp, timestream can deduplicate it which is a common problem with streaming events. When there is a need for high volume of ingestion, events need to be written into canonical datastore like kinesis firehose and then it should be written to S3 for long-term storage but TimeStream can be used as serving db.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why AWS Lambda?
&lt;/h1&gt;

&lt;p&gt;The AWS Lambda can process data from an event source like Apache Kafka. The lambda is very cheap and its memory can be scaled from 128MB to 10240MB. Also, the processing timeout can be set for Lambda functions. The IoT device's payload will be lesser in size and real-time device control operations based on incoming payload value can be easily done with AWS Lambda rather than using serverless services like AWS Glue or AWS Kinesis Data Analytics. Lambda can be triggered by an event source like AWS MSK or it can be scheduled in AWS.&lt;/p&gt;

&lt;h1&gt;
  
  
  Let's Get Started
&lt;/h1&gt;

&lt;p&gt;The IoT device registration and IoT message simulation during development are critical tasks in IoT development and there is a need for the data engineering team to simulate the IoT events using various SaaS providers like MQTTLab, Cumulocity IoT etc.&lt;br&gt;
The device registration and IoT Event data simulation for this article were done through AWS IoT simulator. The simulator provides device type registrations, controlling the number of devices to simulate the data and to define payload structure for each device. The Automotive telemetry data was considered and simulated as shown in the steps below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add Device Type&lt;/strong&gt;&lt;br&gt;
Create a simulation stack in an AWS region and add custom devices. The automotive telemetry payload attributes are inbuilt and can’t be changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulate IoT Data&lt;/strong&gt;&lt;br&gt;
Automotive Telemetry data is quite comprehensive and the simulator we considered here can publish messages on three topics&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/gy0cycmWk0L84HomqR/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/gy0cycmWk0L84HomqR/giphy.gif" alt="Alt text" width="480" height="173"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Telemetry Topic Payload&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{&lt;br&gt;
  "name": "speed",&lt;br&gt;
  "value": 47.4,&lt;br&gt;
  "vin": "1NXBR32E84Z995078",&lt;br&gt;
  "trip_id": "799fc110-fee2-43b2-a6ed-a504fa77931a",&lt;br&gt;
  "timestamp": "2018-02-15 08:20:18.000000000"&lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Trip Topic Payload&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{"vehicle_speed_mean":64.10065503146477,"engine_speed_mean":3077.59476197646,"torque_at_transmission_mean":210.70915084517395,"oil_temp_mean":237.417022870719,"accelerator_pedal_position_mean":28.819512721817887,"brake_mean":4.268754736044446,"high_speed_duration":0,"high_acceleration_event":3,"high_braking_event":0,"idle_duration":75323,"start_time":"2021-03-06T07:40:02.454Z","ignition_status":"run","brake_pedal_status":false,"transmission_gear_position":"fifth","odometer":35.27425650210172,"fuel_level":97.7129231345363,"fuel_consumed_since_restart":0.9155057461854811,"latitude":38.938734,"longitude":-77.269385,"timestamp":"2021-03-06 08:13:04.981000000","trip_id":"c6bacef5-1bfc-4a72-8261-6c1272772f13","vin":"5JO226H6QR3J3T7TI","name":"aggregated_telemetrics"}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Diagnostics Topic Payload&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{"timestamp":"2021-03-06 08:14:33.087000000","trip_id":"a4c55e6e-a7eb-4c3e-b0a0-dfcace119e03","vin":"MQYK4Z8WGTJDFDA05","name":"dtc","value":"P0404"}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After completing the exercises, stop the simulation.&lt;/p&gt;

&lt;p&gt;It is evident from the below image that the costs for IoT are cheap.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4l9srntc1n5rij8gqdi4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4l9srntc1n5rij8gqdi4.png" alt="Alt Text" width="700" height="44"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  IoT Analytics
&lt;/h1&gt;

&lt;p&gt;IoT Analytics setup is a simple process and it can quickly create channel, pipeline and datastore etc. in a single click as shown below.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8id9yrh6c55f5u2a53dy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8id9yrh6c55f5u2a53dy.png" alt="Alt Text" width="700" height="264"&gt;&lt;/a&gt;&lt;br&gt;
The above setup was created to select all telemetry payloads from the subscribed topic.&lt;/p&gt;

&lt;p&gt;The pipeline can be used to filter any unwanted attributes from the payload before creating a datastore.&lt;/p&gt;

&lt;p&gt;S3 is used as an underlying data store and the data format can be parquet or JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup IoT Rule and Action&lt;/strong&gt;&lt;br&gt;
Create a Rule and Action for the Telemetry Payload as given below. The objective is to select all records and ingest them into Telemetry Channel.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2m5s3esm7j0jpwzq9nq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2m5s3esm7j0jpwzq9nq.png" alt="Alt Text" width="700" height="145"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhqk7vmis6cbi060uf40.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhqk7vmis6cbi060uf40.png" alt="Alt Text" width="700" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the simulation is started, the ingestion process can be monitored in the IoT Analytics window.&lt;/p&gt;

&lt;p&gt;The dataset can be created and scheduled to be refreshed from the datastore. The dataset can be refreshed from the last import timestamp using the delta time option.&lt;/p&gt;

&lt;p&gt;The IoT Analytics dataset can be used in aws sagemaker notebooks and can be used as a data source for QuickSight.&lt;/p&gt;

&lt;h1&gt;
  
  
  QuickSight
&lt;/h1&gt;

&lt;p&gt;Create a new IoT analytics data source using the telemetry_dataset&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiuwxzn0ehtyi90uyoo89.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiuwxzn0ehtyi90uyoo89.png" alt="Alt Text" width="567" height="339"&gt;&lt;/a&gt;&lt;br&gt;
Once the import is completed, the records can be enriched or filtered.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhk2n8bum5tm46ajvh9fn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhk2n8bum5tm46ajvh9fn.png" alt="Alt Text" width="700" height="271"&gt;&lt;/a&gt;&lt;br&gt;
The import process can be scheduled for auto-refresh.&lt;/p&gt;

&lt;p&gt;The imported telemetry dataset values can be visualized in &lt;br&gt;
QuickSight.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6vsph8ecklpdjxguur9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6vsph8ecklpdjxguur9.png" alt="Alt Text" width="578" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  SageMaker
&lt;/h1&gt;

&lt;p&gt;Create a notebook and open the notebook using the sagemaker instance.&lt;/p&gt;

&lt;p&gt;Import dataset and analyze the data for building machine learning models for predictive maintenance of devices.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cqfvqol6eax4hl3nzht.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cqfvqol6eax4hl3nzht.png" alt="Alt Text" width="700" height="66"&gt;&lt;/a&gt;&lt;br&gt;
A basic working version of SageMaker code is kept in &lt;a href="https://github.com/prasanth-m/AWS/blob/master/IoT%20Analytics/telemetry.ipynb" rel="noopener noreferrer"&gt;Github&lt;/a&gt;. The feature engineering, model training and inference pipeline for near real-time data will be covered in a separate article in a detailed manner.&lt;/p&gt;

&lt;h1&gt;
  
  
  TimeStream
&lt;/h1&gt;

&lt;p&gt;Pricing of timestream is 1 million writes of 1KB size = $0.50. Avg size of telemetry data is 145 bytes. Avg size of aggregated trip data is 800 bytes. By merging all the dimensions like speed, fuel_level etc at the VIN level, we can reduce the number of writes per batch and cost. A detailed explanation about payload structure and batching the writes have been explained in my previous &lt;a href="https://dev.to/aws-builders/collect-and-store-streaming-timeseries-data-into-amazon-timestream-db-21jf"&gt;article&lt;/a&gt;. Create a Rule and Action for Timestream if the data is aggregated at the source itself or process IoT events stored in MSK using the Kinesis Data Analytics for Flink or using a custom apache-spark application.&lt;/p&gt;

&lt;h1&gt;
  
  
  AWS MSK
&lt;/h1&gt;

&lt;p&gt;AWS MSK was recently introduced as one of the Actions for IoT Core. IoT Rule Action can act as a producer for Managed Streaming Kafka. Create a VPC destination and add the cluster details and authentication mechanisms of the Kafka cluster to receive the messages.&lt;/p&gt;

&lt;p&gt;Enable error logging for action&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ypw1e5z282dy2ialz3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ypw1e5z282dy2ialz3j.png" alt="Alt Text" width="658" height="188"&gt;&lt;/a&gt;&lt;br&gt;
Enable CloudWatch Logs and cloud trail for IoT and AWS MSK to analyze the log and API calls.&lt;/p&gt;

&lt;h1&gt;
  
  
  Lambda
&lt;/h1&gt;

&lt;p&gt;Create a lambda function to consume and process the data.&lt;br&gt;
&lt;code&gt;for  partition in partitions:    &lt;br&gt;
consumer.seek_to_beginning(partition) #sample read. use offset&lt;/code&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswlodhokpunp23gn6vd7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswlodhokpunp23gn6vd7.png" alt="Alt Text" width="800" height="65"&gt;&lt;/a&gt;&lt;br&gt;
The diagnostic data is less in volume but needs immediate action. The lambda will consume, check for a problem code and issue STOP signal to the device by publishing a payload to MQTT topic meant for a shadow device.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;response = IOTCLIENT.publish(topic=iottopic, qos=0, payload=json.dumps(msg))&lt;/code&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft64xrrwq19ch38frs8se.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft64xrrwq19ch38frs8se.png" alt="Alt Text" width="700" height="205"&gt;&lt;/a&gt;&lt;br&gt;
The working version of the lambda function is kept in GitHub &lt;a href="https://github.com/prasanth-m/AWS/blob/master/IoT%20Analytics/kafk_consumer.py" rel="noopener noreferrer"&gt;kafka_consumer.py&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;IoT data can be processed and stored by various AWS services based on the use cases. The serverless AWS services can aid the rapid development of large-scale complicated architecture in an easily scalable manner. The managed services like AWS MSK along with a choice of distributed processing frameworks like Apache Flink based Kinesis data analytics or Apache Spark based AWS Glue and AWS Lambda will help in building scalable highly available and cost-effective processing for IoT/TimeSeries events.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>iot</category>
      <category>aws</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Collect and Store Streaming TimeSeries data into Amazon TimeStream DB</title>
      <dc:creator>Prasanth Mathesh</dc:creator>
      <pubDate>Sat, 26 Dec 2020 15:48:30 +0000</pubDate>
      <link>https://forem.com/aws-builders/collect-and-store-streaming-timeseries-data-into-amazon-timestream-db-21jf</link>
      <guid>https://forem.com/aws-builders/collect-and-store-streaming-timeseries-data-into-amazon-timestream-db-21jf</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;This post discusses serverless architectural consideration and configuration steps for deploying the Streaming TimeSeries Data Solution for Amazon TimeStream DB in the Amazon Web Services (AWS) Cloud.It includes links to a code repository that can be used as a base to deploy this solution by following AWS best practices for security and availability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Components Basics
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;AWS Lambda&lt;/strong&gt;&lt;br&gt;
Lambda is a serverless compute service that lets you run code without provisioning or managing servers.&lt;br&gt;
&lt;strong&gt;Amazon Timestream&lt;/strong&gt;&lt;br&gt;
Amazon Timestream is a fast, scalable, and serverless time series database service for IoT and other operational applications.&lt;br&gt;
&lt;strong&gt;AWS Kinesis&lt;/strong&gt;&lt;br&gt;
Amazon Kinesis Data Streams ingests a large amount of data in real-time, durably stores the data, and makes the data available for consumption.&lt;/p&gt;

&lt;h1&gt;
  
  
  DataFLow
&lt;/h1&gt;

&lt;p&gt;The streaming data pipeline will look alike as given below after the deployment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fkuuo33ltz7jf0i3igmlz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fkuuo33ltz7jf0i3igmlz.png" alt="Alt Text" width="681" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Getting Started
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;AWS Kinesis Setup&lt;/strong&gt;&lt;br&gt;
Create timeseries-stream DataStream with a Shard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon TimeStream Setup&lt;/strong&gt;&lt;br&gt;
Create a database named ecomm in the same region as kinesis datastream and table named inventory in database ecomm using the gists shared in github.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Lambda Setup&lt;/strong&gt;&lt;br&gt;
Create a Kinesis producer to create and ingest time series data into the kinesis data stream which has to be read in the same order. To do so, create a kinesis consumer. The python SDK examples used for this article has been kept at github repository &lt;a href="https://github.com/prasanth-m/AWS/tree/master/TimeStream" rel="noopener noreferrer"&gt;TimeStream&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Deployment
&lt;/h1&gt;

&lt;p&gt;Serverless framework makes deployment and development faster. Deploy Lambda Producers and Consumers into AWS and schedule them to run based on time-series event triggers or any schedule. In a typical production scenario, the producers might be outside of the cloud region and events might arrive through the API gateway.&lt;br&gt;
Producer and Consumer Logs will be available in cloud watch.&lt;/p&gt;

&lt;p&gt;The written results can be quired using the query editor of Amazon TimeStream.&lt;br&gt;
   &lt;code&gt;select * from “ecomm”.”inventory”&lt;br&gt;
    limit 10&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;In most organizations, Timeseries data points are written once and read multiple times. It is clear that time-series data can be collected and stored using serverless services. Though Timestream can be integrated with various AWS services, kinesis is chosen since it has data retention and replay features. The next article about time-series data will have a use case using kappa data processing architecture.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>database</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
