<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Chatchai Komrangded (Bas)</title>
    <description>The latest articles on Forem by Chatchai Komrangded (Bas) (@chatchaikomrangded).</description>
    <link>https://forem.com/chatchaikomrangded</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F693771%2Ff432c265-b1c2-4344-89b6-9fd13dec3fd2.jpeg</url>
      <title>Forem: Chatchai Komrangded (Bas)</title>
      <link>https://forem.com/chatchaikomrangded</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/chatchaikomrangded"/>
    <language>en</language>
    <item>
      <title>Build your own data quality rules with AWS Glue DataBrew</title>
      <dc:creator>Chatchai Komrangded (Bas)</dc:creator>
      <pubDate>Sun, 21 Nov 2021 09:53:31 +0000</pubDate>
      <link>https://forem.com/awscommunity-asean/validating-data-quality-with-aws-glue-databrew-4df4</link>
      <guid>https://forem.com/awscommunity-asean/validating-data-quality-with-aws-glue-databrew-4df4</guid>
      <description>&lt;p&gt;In previous post, I have showed How AWS Glue DataBrew can help you to &lt;a href="https://dev.to/awscommunity-asean/identifying-and-handling-personally-identifiable-information-pii-dwy-aws-glue-databrew-13ed"&gt;Handling PII data&lt;/a&gt; (The post is Thai language) For an English reader, the screenshot speaks for itself, you can easily follow.&lt;/p&gt;

&lt;p&gt;There was a recent announcement AWS Glue DataBrew users can now develop data quality rules, which are customized validation tests that set business needs for specific data, according to the company. As a result, Any data person who does not want to invest in a high-cost DQ licensing product or does not want to use an open-source framework that requires coding knowledge can define their own quality rules and populate them in a data quality dashboard and validation report, allowing customers to quickly view rule outcomes and determine whether their data is fit for use. As a Builder, with New two announcement AWS GlueDataBrew with &lt;a href="https://aws.amazon.com/about-aws/whats-new/2021/11/aws-glue-databrew-data-quality-rules-validate-requirements/" rel="noopener noreferrer"&gt;DQ&lt;/a&gt;, and &lt;a href="https://dev.to/awscommunity-asean/identifying-and-handling-personally-identifiable-information-pii-dwy-aws-glue-databrew-13ed"&gt;PII&lt;/a&gt; Handing will help us a lot in term of design DQ process at scale, as it's no compute server to maintenance. In this post I will walk you though How to do validate your data quality with AWS Glue DataBrew.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/glue/features/databrew/" rel="noopener noreferrer"&gt;What is AWS Glue Databrew?&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.aws.amazon.com/databrew/latest/dg/profile.data-quality-rules.html" rel="noopener noreferrer"&gt;What do you mean by Data Quality rule?&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Pre-requisites&lt;/li&gt;
&lt;li&gt;Preparing dataset&lt;/li&gt;
&lt;li&gt;Run data profile without DQ ruleset&lt;/li&gt;
&lt;li&gt;Create DQ ruleset&lt;/li&gt;
&lt;li&gt;Re-run data profile with DQ ruleset&lt;/li&gt;
&lt;li&gt;View DQ dashboard&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture Diagram &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx144lnkhg1a3w9j505kz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx144lnkhg1a3w9j505kz.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-requisites &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS Account❗ &lt;/li&gt;
&lt;li&gt;Download Data &lt;a href="https://synthetichealth.github.io/synthea-sample-data/downloads/10k_synthea_covid19_csv.zip" rel="noopener noreferrer"&gt;here&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Unzip, and upload only patient.csv upload to Amazon S3.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Preparing dataset &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Go to AWS Glue DataBrew console &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#landing" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyfcw6a1cjysiamnkpkq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyfcw6a1cjysiamnkpkq.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left you click &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#datasets" rel="noopener noreferrer"&gt;Datasets&lt;/a&gt; to create Dataset for Glue DataBrew&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlukia8aa3cgs4e9e087.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlukia8aa3cgs4e9e087.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;click &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#create-dataset" rel="noopener noreferrer"&gt;Create new dataset&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs60ps64thufsqa2tawbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs60ps64thufsqa2tawbd.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Choose patient.csv that you just uploaded, you also have an option to get the data from Glue data catalog, Amazon Redshift, Appflow, or Snowflake as well. Click Create dataset&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fao7lcm7d5lm592lvqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fao7lcm7d5lm592lvqu.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Run data profile &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;At Datasets, choose Patient dataset, and Click "Run data profile"&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk60w63dmvio4aik2riuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk60w63dmvio4aik2riuy.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You will wait few minutes for data profile to populate Data profile overview, Column statistics, Data quality rules suggestion, and Data lineage, once it's done click "View data profile"&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs8dg7hq1ufjj0lr6ievc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs8dg7hq1ufjj0lr6ievc.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You expect to see all the data statistics that useful to understand your dataset as follow&lt;/p&gt;

&lt;p&gt;Dataset preview&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funy4jbbp887qmmzd9tt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funy4jbbp887qmmzd9tt7.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Data profile overview&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fylkk2a3i2yinkg4e7o0s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fylkk2a3i2yinkg4e7o0s.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Potential PII detection&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz7ist9hg4djkhlnjmx1c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz7ist9hg4djkhlnjmx1c.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Attribute Correlations&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjm2onqc8l1spipkqldxk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjm2onqc8l1spipkqldxk.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Individual Column level statistic&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F997mdy45rhim9knhqiqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F997mdy45rhim9knhqiqz.png" alt="Image description"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv10deneqjqmdfpa9l8b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv10deneqjqmdfpa9l8b.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All Columns level statistic summary such as min, max, distribution, column type, unique value, and etc&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbd3p3iv39js8od25a5k7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbd3p3iv39js8od25a5k7.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also drill down to columns that identified as PII as well in All Column level statistic&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1jv7e8ktzasd5yepy5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1jv7e8ktzasd5yepy5n.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also see Data lineage as well!&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw1cn1dv81927bdbboip.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw1cn1dv81927bdbboip.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enough for Data exploration. The purpose of this post related to Data quality, let's take a look on the Data quality tab, you will see nothing, as you don't have a rule yet! but after you run Data profile job, it suggest you based on standard DQ such as uniqueness, completeness, and etc on the right hand side&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ii96dm1c9gk7vhkftwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ii96dm1c9gk7vhkftwj.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Create DQ ruleset &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;You can pick what that reasonable to your business criteria, Again there's zero code require! I choose to check uniqueness in my Id, SSN, and length for SSN, and etc&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv06n8s2xx1tvzdxbzf14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv06n8s2xx1tvzdxbzf14.png" alt="Image description"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Again this is recommendation from AWS DataBrew. you can create your own DQ rule later, after I have done, I just click "Create ruleset"&lt;/p&gt;

&lt;p&gt;I can add, remove, adjust, or review DQ rule I have just added&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvy8rsd78ciqakxwtpzs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvy8rsd78ciqakxwtpzs.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I just click "Create ruleset", I can see DQ rulesets &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#rulesets" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhhmuc0pcgvy8a45imv1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhhmuc0pcgvy8a45imv1.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;click DQ rulesets name, in my case is "DQ Rulesets for Patient data" I should be able to review all my DQ rules.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foyq04btu6zhbe1t6j3te.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foyq04btu6zhbe1t6j3te.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Associate DQ rulesets, with Data profile job, Click &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#jobs?tab=profile" rel="noopener noreferrer"&gt;here&lt;/a&gt; to see all your Data profile job&lt;/p&gt;

&lt;p&gt;For my case it's "Patients - Data Profile job, choose, and click Actions, and Edit&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbwjfvu5b6eyk9d4rjf7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbwjfvu5b6eyk9d4rjf7.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Search for "Data quality rules", and click "Apply data quality ruleset"&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fissvxe4yvygqn03bbdjl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fissvxe4yvygqn03bbdjl.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Choose your DQ Rulesets, and click Apply selected rulesets, and click Save&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7e9akxchdw04fip2p75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7e9akxchdw04fip2p75.png" alt="Image description"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdj76mjzokdoir68d6wv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdj76mjzokdoir68d6wv7.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Re-run data profile with DQ ruleset &lt;a&gt;
&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;You choose your Profile jobs, and Click "Run job"&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xt2egkjawrglavdclu0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xt2egkjawrglavdclu0.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  View DQ dashboard &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Wait a few minutes, you should be able to see Result from Data profile job, with DQ result, based on your DQ rules!&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvwj7gdmgzul1eql52tf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvwj7gdmgzul1eql52tf.png" alt="Image description"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7rjjgs1x9llhjbxvaju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7rjjgs1x9llhjbxvaju.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My dataset has 4 pass, with 6 failed, based on My DQ rules, obviously I have to fix my data, before downstream process will consume!&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7s9gd0ul0uvx2egupzbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7s9gd0ul0uvx2egupzbg.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;AWS Glue DataBrew is a visual data preparation tool that makes it easy to clean and normalize data using over 250 pre-built transformations, all without the need to write any code, with New feature for DQ, and &lt;a href="https://dev.to/awscommunity-asean/identifying-and-handling-personally-identifiable-information-pii-dwy-aws-glue-databrew-13ed"&gt;PII handing&lt;/a&gt; it will help you to add your own data quality rules, and PII tokenization in your automated data pipelines (AWS managed airflow, or AWS step function) to make sure your data are clean, and protected.&lt;/p&gt;

&lt;p&gt;You can think about How add to PII obfuscate, and DQ zone like this&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjezuvlc1g7sa4e4scjop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjezuvlc1g7sa4e4scjop.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" alt="good_stuff"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>awsthai</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Identifying and handling personally identifiable information (PII) ด้วย AWS Glue DataBrew</title>
      <dc:creator>Chatchai Komrangded (Bas)</dc:creator>
      <pubDate>Sat, 20 Nov 2021 06:33:26 +0000</pubDate>
      <link>https://forem.com/awscommunity-asean/identifying-and-handling-personally-identifiable-information-pii-dwy-aws-glue-databrew-13ed</link>
      <guid>https://forem.com/awscommunity-asean/identifying-and-handling-personally-identifiable-information-pii-dwy-aws-glue-databrew-13ed</guid>
      <description>&lt;p&gt;โพสต์นี้เราจะทดสอบการใช้ AWS Glue DataBrew ในการ handling sensitive data ที่อาจอยู่ในระบบบน AWS เช่น Amazon S3 ซึ่งกระบวนการนี้หลักการคือให้เนื้อข้อมูลไม่อยู่ในรูปปัจจุบันบางคนเรียกว่า Hasing, Encryption, or Masking/Redact ในโพสน์นี้เราจะทำการ Redact ข้อมูลแบบ One-way (เอาคืนกลับไม่ได้) และ Two-way (เราสามารถเรียกข้อมูลที่เป็น Orginal value ได้ อาจจะใช้กับปลายทางทีหลัง)&lt;/p&gt;

&lt;p&gt;สำหรับที่ใครไม่รู้จักที่ข้างบนคืออะไรสามารถอ่านได้ตามลิ้งนี้&lt;br&gt;
&lt;a href="https://aws.amazon.com/glue/features/databrew/"&gt;AWS Glue Databrew คืออะไร?&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.aws.amazon.com/databrew/latest/dg/personal-information-protection.html"&gt;Common Technique มีอยู่กี่วิธี?&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Pre-requisites&lt;/li&gt;
&lt;li&gt;Preparing dataset&lt;/li&gt;
&lt;li&gt;Build Transformation Receipt&lt;/li&gt;
&lt;li&gt;Obfuscation reversible&lt;/li&gt;
&lt;li&gt;Create Transformation job&lt;/li&gt;
&lt;li&gt;Query Result with Amazon Athena&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Architecture Diagram &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xpRZNurQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/19vskeq8n464w2x6oa0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xpRZNurQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/19vskeq8n464w2x6oa0z.png" alt="arc" width="880" height="355"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Pre-requisites &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS Account❗ &lt;/li&gt;
&lt;li&gt;Data ที่จะใช้ ที่ใช้ทั้งหมดโหลดจาก &lt;a href="https://synthetichealth.github.io/synthea-sample-data/downloads/10k_synthea_covid19_csv.zip"&gt;ที่นี่&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;พอ Download ข้อมูลเสร็จให้ unzip แล้วเอาเฉพาะ file patient.csv upload ขึ้น Amazon S3 ก่อนนะ&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Preparing dataset &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;เราจะเข้าไปที่ AWS Glue Databrew console &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#landing"&gt;ที่นี่&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--188djcuT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wyfcw6a1cjysiamnkpkq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--188djcuT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wyfcw6a1cjysiamnkpkq.png" alt="Image description" width="880" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ด้านซ้ายมือกด &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#datasets"&gt;Datasets&lt;/a&gt; เพื่อจะสร้าง Dataset สำหรับ Glue Databrew ก่อน &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wssHjMPI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tlukia8aa3cgs4e9e087.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wssHjMPI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tlukia8aa3cgs4e9e087.png" alt="Image description" width="880" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;กด &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#create-dataset"&gt;Create new dataset&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m9C53GxR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s60ps64thufsqa2tawbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m9C53GxR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s60ps64thufsqa2tawbd.png" alt="Image description" width="880" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เลือก csv file มะกี้ที่ upload ไปซึ่งตัวอย่างข้อมูลอยู่ใน Amazon S3 แต่อาจเอาข้อมูลมาจาก Glue data catalog, Amazon Redshift, Appflow, or Snowflake ก็ได้ จากนั้นกด Create dataset&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--14RUnr4z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6fao7lcm7d5lm592lvqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--14RUnr4z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6fao7lcm7d5lm592lvqu.png" alt="Image description" width="880" height="727"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Build Transformation Receipt &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;ในหน้า Glue Databrew console กด &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#create-project"&gt;Create project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ตั้งชื่อ Project แล้วเลือก Dataset ที่เราสร้างเมื่อกี้แล้วกด Create project&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PmydjHZ_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e5o86gmuz4q0fb9gbze3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PmydjHZ_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e5o86gmuz4q0fb9gbze3.png" alt="Image description" width="880" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราสามารถกำหนดจำนวน Row sampling ได้ และเราต้องกำหนด IAM role เพื่อให้ Data brew สำหรับเข้าถึง Datastore นั้นๆได้เช่น Amazon S3&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PubjBE7M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/35xl2wtcvqngf4ztw25j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PubjBE7M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/35xl2wtcvqngf4ztw25j.png" alt="Image description" width="880" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS จะ provision Glue databrew node สำหรับในการสร้าง Receipt (Transformation step ที่จะใช้ Handling PII data)&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JXSjegiN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pwbnw2kd95pp3kbt78be.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JXSjegiN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pwbnw2kd95pp3kbt78be.png" alt="Image description" width="880" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะมาลองทดสอบการใช้ Redaction function (Oneway ในการ masking ข้อมูลโดยที่เราจะไม่สามารถเอา orginal กลับมาได้) เราเลือก Data masking แล้วเลือก Technique Redaction &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_3ouSynL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qi3k843wyy750xfwot0s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_3ouSynL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qi3k843wyy750xfwot0s.png" alt="Image description" width="880" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราเลือก SSN column มาทดสอบในการทำ Redaction เรากำหนดได้ว่าจะ Redact กี่ค่าเช่น First N, Last N ในที่นี่เราเลือกเป็น full character&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NYAaITeP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a83d4k5a6egrylevcswj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NYAaITeP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a83d4k5a6egrylevcswj.png" alt="Image description" width="880" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เลือกเป็น Full string value, และ Apply กับทุก Rows&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3GvfxTzm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1xt6oy7btx80xd1eafyr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3GvfxTzm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1xt6oy7btx80xd1eafyr.png" alt="Image description" width="400" height="590"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Glue DataBrew สามารถให้เราเข้าใจข้อมูลที่จะเปลี่ยนไปเทียบกับ Original value จริงผ่าน Preview feature (Nice!!)&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IGMDTxn4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6yotvzcpqb5vi3rq51h6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IGMDTxn4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6yotvzcpqb5vi3rq51h6.png" alt="Image description" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;หลังจาก Ok แล้วให้กด Apply สิ! เราจะเห็น receipt อยู่ด้านขวาก็คือชุด Transformation ที่ AWS Glue databrew จะรันให้เรานั่นเอง&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LhRc53z7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gfzz4qfpix9nv8ey7m2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LhRc53z7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gfzz4qfpix9nv8ey7m2o.png" alt="Image description" width="880" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ทีนี้เรามาทดสอบการทำ obfuscation reversible ในบางกรณีแทนที่จะ masking. เราอาจต้องการแค่ encrypt column values เมื่อไหร่ก็ต้องที่เราอาจจะใช้ข้อมูลนั้น เราสามารถ decrypt เพื่อนำ orginal data กลับมา เราจะใช้ Deterministic encryption ในการทำการ encrypt columns first_name, last_name&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UEtmRLZJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xuxi5c8ett55m0qvrnzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UEtmRLZJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xuxi5c8ett55m0qvrnzo.png" alt="Image description" width="880" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เลือก FIRST, LAST columns สำหรับ encryption&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XVQO5AVU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4cxwxz9zmpqge8w6rzb9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XVQO5AVU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4cxwxz9zmpqge8w6rzb9.png" alt="Image description" width="880" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เลือก Secret key สำหรับการทำ encryption โดย default เราจะมี databrew!default ใน secret manager เราสามารถใช้ของเราเองก็ได้ ในที่นี้จะใช้ค่า default&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Nt_8qUIf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vqa9vr1hux3s4sr4akax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Nt_8qUIf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vqa9vr1hux3s4sr4akax.png" alt="Image description" width="382" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เลือก Apply all rows, และ Full string&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Nf9qKa4l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xdr26koq5m33ndjp3rnn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Nf9qKa4l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xdr26koq5m33ndjp3rnn.png" alt="Image description" width="391" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราสามารถ Preview ได้ก่อนเพื่อเปรียบเทียบ&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lqdHJcbJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1iq5sx9kdbgqf7osh450.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lqdHJcbJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1iq5sx9kdbgqf7osh450.png" alt="Image description" width="880" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;กด Apply เราก็จะได้ผลลัพธ์ตามนี้&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xJ7Jw16B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y6bos4dm2r7s5s1l9aks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xJ7Jw16B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y6bos4dm2r7s5s1l9aks.png" alt="Image description" width="880" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;หลังจากนั้นให้เรากด Publish Receipt&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qT5gsGI4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8an6tlc5yzbvmxr6s3y9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qT5gsGI4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8an6tlc5yzbvmxr6s3y9.png" alt="Image description" width="880" height="434"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3A7SIKm5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g4a8sr2r9vu1xf69b9ls.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3A7SIKm5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g4a8sr2r9vu1xf69b9ls.png" alt="Image description" width="603" height="548"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Obfuscation reversible &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;ในทางกลับกันถ้าเราต้องการ Decrypt ข้อมูลกลับไปใน Orginal format เราก็ใช้  Deterministic decryption โดยเลือก Secret key ตัวเดิมดังรูป เราจะได้ค่า FIRST กลับมาที่ Original format (Field ที่ highlight สีเหลือง)&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bz9EGvcu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q5aom48883tk9h1b0pjp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bz9EGvcu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q5aom48883tk9h1b0pjp.png" alt="Image description" width="880" height="385"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Create Transformation job &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;ถัดมาเราจะสร้าง Transformation job จาก Databrew receipt กดสร้าง &lt;a href="https://console.aws.amazon.com/databrew/home?region=us-east-1#create-job?type=recipe"&gt;Create job&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ตั้งชื่อ Job และเลือก Receipt ที่สร้างมะกี้&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DtI4an_E--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yh9phwjc6bboy6jkws7f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DtI4an_E--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yh9phwjc6bboy6jkws7f.png" alt="Image description" width="880" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เลือก Output to Amazon S3 กด Create and Run job&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fn5A9qGc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xy97ofc9h5k415hw39so.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fn5A9qGc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xy97ofc9h5k415hw39so.png" alt="Image description" width="880" height="757"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะเห็น Glue DataBrew job running&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UhCCIcEn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lvykzeujbvaxcg1onf73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UhCCIcEn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lvykzeujbvaxcg1onf73.png" alt="Image description" width="880" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ไปสร้าง Table ใน &lt;a href="https://aws.amazon.com/athena/?nc1=h_ls&amp;amp;whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;amp;whats-new-cards.sort-order=desc"&gt;Athena&lt;/a&gt; รอเลย!&lt;/p&gt;
&lt;h3&gt;
  
  
  Query Result with Amazon Athena &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;สร้าง Athena table แล้วทดสอบ Query&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;patient_masked_encrypted_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="nv"&gt;`id`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`birthdate`&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`deathdate`&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`ssn`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`drivers`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`passport`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`prefix`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`first`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`last`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`suffix`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`maiden`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`marital`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`race`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`ethnicity`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`gender`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`birthplace`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`address`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`city`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`state`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`county`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`zip`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`lat`&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`lon`&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`healthcare_expenses`&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`healthcare_coverage`&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt; 
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://&amp;lt;YOUR_S3_BUCKET&amp;gt;/cleaned_data_output/'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--J5706Rb---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6z31ca0l597zvls98677.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--J5706Rb---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6z31ca0l597zvls98677.png" alt="Image description" width="880" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ข้อมูลถูก masking, or encryption จาก Athena query เรียบร้อย&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Conclusion  &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;จบแล้ว! AWS Glue DataBrew สามารถ integrated ร่วมกับ AWS Services ตัวอืื่นในการทำ automation pipelines ได้ในการทำ obfuscate the PII data เช่น AWS Managed Apache Airflow หรือ Step function เยี่ยม! AWS Glue DataBrew เป็น serverless no code data prep เราสามารถ Scale out ตามขนาดข้อมูลที่เราต้องการได้แล้วอีกอย่างคือ &lt;a href="https://aws.amazon.com/glue/pricing/?nc1=h_ls"&gt;Pay per use model&lt;/a&gt; จ่ายเท่าที่ใช้ Technical blog &lt;a href="https://aws.amazon.com/blogs/big-data/introducing-pii-data-identification-and-handling-using-aws-glue-databrew/"&gt;สำหรับภาคภาษาอังกฤษอ่านได้ที่นี่&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YYkXzGR5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://i.pinimg.com/originals/88/71/3c/88713c217a98cac0d46781b38f11c019.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YYkXzGR5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://i.pinimg.com/originals/88/71/3c/88713c217a98cac0d46781b38f11c019.gif" alt="good_stuff" width="564" height="564"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>awsthai</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
    <item>
      <title>ทดสอบทำ Machine Learning predict customer churn โดยใช้งาน Amazon SageMaker กับ Snowflake!</title>
      <dc:creator>Chatchai Komrangded (Bas)</dc:creator>
      <pubDate>Fri, 17 Sep 2021 16:49:50 +0000</pubDate>
      <link>https://forem.com/awscommunity-asean/machine-learning-predict-customer-churn-amazon-sagemaker-snowflake-26bn</link>
      <guid>https://forem.com/awscommunity-asean/machine-learning-predict-customer-churn-amazon-sagemaker-snowflake-26bn</guid>
      <description>&lt;p&gt;โพสต์นี้เราจะทดสอบการใช้ Amazon SageMaker ด้วย AutoGlueon Tabular ในการ train และ inference machine learning โดยเอาข้อมูลจาก Cloud data warehouse อย่าง snowflake ซึ่ง snowflake ถือว่าเป็น AWS partner solution ที่อยู่บน &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-3gdrsg3vnyjmo?sr=0-1&amp;amp;ref_=beagle&amp;amp;applicationId=AWSMPContessa" rel="noopener noreferrer"&gt;AWS marketplace&lt;/a&gt;, หลายๆคนอาจสงสัยว่าเอ๊ะ AWS ก็มี Amazon Redshift นิ? ใช่ครับ เรามีตัวอย่างเยอะเลยที่ใช้ Amazon redshift ร่วมกัย Sage Maker หาแปปเดียวก็เจอ 55 &lt;a href="https://aws.amazon.com/blogs/machine-learning/exploring-data-warehouse-tables-with-machine-learning-and-amazon-sagemaker-notebooks/" rel="noopener noreferrer"&gt;ตัวอย่าง&lt;/a&gt;, แต่เราก็จะมีบางลูกค้าในไทยที่มีการใช้ Solution ของ AWS partner เช่น snowflake คำถามที่ลูกค้าถามบ่อยๆคือจะ integrated solution รอบๆเหล่านี้เข้าไปที่ AWS บ้านหลังใหญ่ๆของเค้าได้ยังไงเช่นทำพวก ML บน Sage Maker, ก็เลยคิดว่าคงมีประโยชน์ถ้าเราทำตัวอย่างการ integration ระหว่าง Amazon Sage maker กับ Snowflake นี้กันดู! แน่นอนเหมือนเดิมเราเน้นทฏษดีพอเพียงน้ำไม่ต้องเยอะ, Let's get hand dirty!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/sagemaker/?nc1=h_ls" rel="noopener noreferrer"&gt;Sage Maker คืออะไร?&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.snowflake.com/" rel="noopener noreferrer"&gt;Snowflake คืออะไร?&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Pre-requisites&lt;/li&gt;
&lt;li&gt;Preparing to Load Data &amp;amp; Loading Data&lt;/li&gt;
&lt;li&gt;Machine Learning Workflow in SageMaker&lt;/li&gt;
&lt;li&gt;Update the Churn Scores to Snowflake&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture Diagram &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftni9wm55dlmwuqbb2swy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftni9wm55dlmwuqbb2swy.png" alt="arc"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-requisites &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS Account❗ Sage Maker มี free tier นะ&lt;/li&gt;
&lt;li&gt;Snowflake Account❗ ทดลองเล่นได้ &lt;a href="https://signup.snowflake.com/" rel="noopener noreferrer"&gt;30 วัน&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Code ที่ใช้ทั้งหมดโหลดจาก&lt;a href="https://awsshare.ckomrangweb.com/snowflake_sagemaker.zip" rel="noopener noreferrer"&gt;ที่นี่&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Preparing to Load Data &amp;amp; Loading Data &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;เราจะทำการ load ข้อมูลเข้า snowflake ก่อนโดยเพื่อจะเอาไปทำการ predict churn ใน Sage Maker อีกทีนึง หลังจากเรา login เข้ามาแล้วหน้าตา snowflake console ก็จะประมาณนี้&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4j15bmcrbze9yc7rivl5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4j15bmcrbze9yc7rivl5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะทำการเปลี่ยน Role เป็น Account Admin ก่อน ใช้คำสั่งด้านล่างใน snowflake worksheet ได้เลย&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;ACCOUNTADMIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ทำการ configure the ตัว Warehouse spec, Role, User และ Database&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Create Warehouse for AI/ML work&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_WH&lt;/span&gt;
  &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'XSMALL'&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_SUSPEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_RESUME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
  &lt;span class="n"&gt;INITIALLY_SUSPENDED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--Create ROLE และ USER สำหรับ Sagemaker ในการ ดึงข้อมูลจาก Snowflake&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'SageMaker Role'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_WH&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SYSADMIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER&lt;/span&gt; &lt;span class="n"&gt;PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;YOUR_PASSWORD&amp;gt;'&lt;/span&gt; 
    &lt;span class="n"&gt;DEFAULT_ROLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt; 
    &lt;span class="n"&gt;DEFAULT_WAREHOUSE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SAGEMAKER_WH&lt;/span&gt;
    &lt;span class="n"&gt;DEFAULT_NAMESPACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ML_WORKSHOP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;PUBLIC&lt;/span&gt;
    &lt;span class="k"&gt;COMMENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'SageMaker User'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SYSADMIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ทำการสร้าง Databasse และ Schema และ grant access&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;ML_WORKSHOP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;ML_WORKSHOP&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;ML_WORKSHOP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;PUBLIC&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;--สร้าง Table แล้ว load Customer data จาก Amazon S3&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;ML_WORKSHOP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;PUBLIC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_WH&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;CUSTOMER_CHURN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Cust_ID&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;State&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;Account_Length&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Area_Code&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Phone&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;Intl_Plan&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;VMail_Plan&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;VMail_Message&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Day_Mins&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Day_Calls&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Day_Charge&lt;/span&gt;  &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Eve_Mins&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Eve_Calls&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Eve_Charge&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Night_Mins&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Night_Calls&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Night_Charge&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Intl_Mins&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Intl_Calls&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Intl_Charge&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CustServ_Calls&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Churn&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;CUSTOMER_CHURN&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;เนื่องจากเราจะเอา ML scoring result สุดท้ายมาโชใน snowflake ดังนั้นเราจะสร้าง Table รอไว้ก่อน!!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;ML_RESULTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Churn_IN&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Cust_ID&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Churn_Score&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;ML_RESULTS&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure External Data Stages ใน Snowflake เพื่อใช้เป็น pointer ชี้ไปที่ Amazon S3 ซึ่งเป็นที่เก็บข้อมูลของเรา&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;FILE&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;CSVHEADER&lt;/span&gt;
    &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'CSV'&lt;/span&gt;
    &lt;span class="n"&gt;FIELD_DELIMITER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;','&lt;/span&gt;
    &lt;span class="n"&gt;SKIP_HEADER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;FILE&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;CSVHEADER&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    

&lt;span class="c1"&gt;--Snowflake เค้ามี sample dataset ตัวนี้ไว้ให้ละ เราก็โหลดมาเลย&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;STAGE&lt;/span&gt; &lt;span class="n"&gt;CHURN_DATA&lt;/span&gt;
  &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'s3://snowflake-corp-se-workshop/sagemaker-snowflake-devdays-v1.5/sourcedata/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;--Grant Stage Object Piviliges ไปที่ SAGEMAKER_ROLE&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;STAGE&lt;/span&gt; &lt;span class="n"&gt;CHURN_DATA&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--เราสามารถดู files ที่อยู่ใน external stage ได้List and view the files in the external stage&lt;/span&gt;
&lt;span class="n"&gt;LIST&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;CHURN_DATA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--สามารถลอง Query ดูได้&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;CHURN_DATA&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;--โหลด Data เข้า Snowflake!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--เปลี่ยนมาใช้ SAGEMAKER_ROLE&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;SAGEMAKER_ROLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--โหลดข้อมูลจาก External Stage ไป Snowflake table&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;CUSTOMER_CHURN&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;CHURN_DATA&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FORMAT_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CSVHEADER&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;--ทดสอบดึงข้อมูลดู!&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUSTOMER_CHURN&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Machine Learning Workflow in SageMaker &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;เราจะใช้ตัว AutoGluon-Tabular บน &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-n4zf5pmjt7ism?sr=0-1&amp;amp;ref_=beagle&amp;amp;applicationId=AWSMPContessa" rel="noopener noreferrer"&gt;AWS marketeplace&lt;/a&gt; เพื่อช่วยประหยัดเวลาในการ handling พวก feature engineering ต่างๆเช่น missing data, feature transformation, model selection, algo selection, hyper parameter selection และ หรือ ensembling multiple models&lt;/p&gt;

&lt;p&gt;ถ้าจะใช้ code ตามตัวอย่างกด Continue to Subscibe (แต่ถ้าใครจะใช้ Built in sage maker algo อย่าง XGBoost หรือ Open source framework เองก็ไม่ต้องกดนะจ้า)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4g170iymbqti6r9tlhsf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4g170iymbqti6r9tlhsf.png" alt="sub"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราต้อง Provision SagaMaker notebook ก่อนนะ ถ้าก็กดเข้าไป ทีนี้เราก็จะเจอ &lt;a href="https://ap-southeast-1.console.aws.amazon.com/sagemaker/home?region=ap-southeast-1#/notebook-instances" rel="noopener noreferrer"&gt;notebook&lt;/a&gt; แล้ว open jupyer lab ก็จะได้ภาพแบบข้างล่าง Notebook ที่คุ้นเคย&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ksaqliy6101czfaihn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ksaqliy6101czfaihn6.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะมาไล่ในส่วนแรกก่อน BAU, import lib ที่จะใช้ใข้ boto3 sdk สร้าง sagemaker session, สร้าง ตัวแปร region และ iam role&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gmtime&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sagemaker&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sagemaker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AlgorithmEstimator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_execution_role&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sagemaker.predictor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RealTimePredictor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;csv_serializer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringDeserializer&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConfusionMatrixDisplay&lt;/span&gt;

&lt;span class="n"&gt;sess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sagemaker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_execution_role&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IAM role ARN: {}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;YOUR_BUCKET_NAME&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;prefix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-analytics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access ข้อมูลใน snowflake ผ่าน snowflake connector&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;snowflake.connector&lt;/span&gt;
&lt;span class="c1"&gt;# Connecting to Snowflake using the default authenticator
&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sagemaker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;YOUR_PASSWORD&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;snowflake_account&amp;gt;.&amp;lt;aws_region&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;warehouse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SAGEMAKER_WH&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ML_WORKSHOP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PUBLIC&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ในเชิง practice, security standard เราไม่ควรเอา credentials เป็น clear text ฝังในนี้!!!, best practices เราควรเก็บ secret credential สักที่นึงอย่าง &lt;a href="https://aws.amazon.com/th/secrets-manager/" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt; ช่วยท่านได้&lt;/p&gt;

&lt;p&gt;ดึงข้อมูลใน snowflake ดูหน่อย&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;allrows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;select Cust_ID,STATE,ACCOUNT_LENGTH,AREA_CODE,PHONE,INTL_PLAN,VMAIL_PLAN,VMAIL_MESSAGE,
                   DAY_MINS,DAY_CALLS,DAY_CHARGE,EVE_MINS,EVE_CALLS,EVE_CHARGE,NIGHT_MINS,NIGHT_CALLS,
                   NIGHT_CHARGE,INTL_MINS,INTL_CALLS,INTL_CHARGE,CUSTSERV_CALLS,
                   CHURN from CUSTOMER_CHURN &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;churn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allrows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Cust_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;State&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Account Length&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Area Code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Phone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Intl Plan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;VMail Plan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;VMail Message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Day Mins&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Day Calls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Day Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Eve Mins&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Eve Calls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Eve Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Night Mins&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Night Calls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Night Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Intl Mins&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Intl Calls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Intl Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CustServ Calls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Churn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;display.max_columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# Make sure we can see all of the columns
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;display.max_rows&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# Keep the output on one page
&lt;/span&gt;&lt;span class="n"&gt;churn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;เตรียม Dataset สำหรับเอาไป Train!!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;churn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Phone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Area Code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Area Code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;churn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Day Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Eve Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Night Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Intl Charge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;to_split_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Cust_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_split_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frac&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1729&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_split_data&lt;/span&gt;&lt;span class="p"&gt;))])&lt;/span&gt;
&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;train.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;display.max_columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;display.width&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get AutoGlue on container image ARN&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;AUTOGLUON_PRODUCT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;autogluon-tabular-v2-a11d018f6028f8192e60553704ee3d97&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_algorithm_arn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;algo_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;acct_mapping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;977537786026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;745090734665&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-southeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;192199979996&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-southeast-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;666831318237&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;865070037744&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-central-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;446921602837&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-south-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;077584701553&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sa-east-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;270155090741&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ca-central-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;470592106596&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;985815980388&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;856760150666&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;843114510376&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-north-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;136758871317&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;382657785993&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;057799348421&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;594846645681&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:sagemaker:{}:{}:algorithm/{}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acct_mapping&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;algo_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;algorithm_arn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_algorithm_arn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AUTOGLUON_PRODUCT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Tabular AutoGluon ARN in your region is {}.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;algorithm_arn&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;เมื่อเรา subscribed มาแล้วก็จะให้อารมณ์คล้ายกับ SageMaker built-in algorithm, ซึ่งเรียกได้ว่า deployed กันแบบ "low-to-no-code" กันไปเลย&lt;/p&gt;

&lt;p&gt;มี 3 ค่าที่ต้องคำนึง &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Hyperparamters: AutoML algorithms like AutoGluon ถูกออกมาให้ automated hpo, คือการ set คือ optional, แต่เราสามารถ override ได้! เราแค่ต้องกำหนด target column แค่นั้นอะ&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Infrastructure: ใน SageMaker เราจะใช้ remote training instance ข้างนอกซึ่งเรากำหนด size กับจำนวนตามใจชอบ SageMaker จะ provision, และ de-provision ให้เรา auto (อย่าใช้ notebook นี้ train นะ❗ นั่นคือการใช้ที่ผิด) &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ข้อมูลที่จะใช้ train เราควรเอาข้อมูลที่ train เก็บไว้ใน Amazon S3 นะ! สำหรับการรองรับ Scale training ใหญ่ๆในอนาคต&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hyperparameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;#"hyperparameters": {
&lt;/span&gt;    &lt;span class="c1"&gt;#    "NN":{"num_epochs": "1"}
&lt;/span&gt;    &lt;span class="c1"&gt;#},
&lt;/span&gt;    &lt;span class="c1"&gt;#"auto_stack": "True",
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Churn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;compatible_training_instance_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ml.m5.4xlarge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; 
&lt;span class="n"&gt;s3_input_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sagemaker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TrainingInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://{}/{}/train&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#กำหนดค่า Configuration
&lt;/span&gt;&lt;span class="n"&gt;autogluon&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AlgorithmEstimator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;algorithm_arn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;algorithm_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                  &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                  &lt;span class="n"&gt;instance_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                  &lt;span class="n"&gt;instance_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compatible_training_instance_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                  &lt;span class="n"&gt;sagemaker_session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                  &lt;span class="n"&gt;base_job_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;autogluon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="n"&gt;hyperparameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hyperparameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="n"&gt;train_volume_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;#สั่ง Train
&lt;/span&gt;&lt;span class="n"&gt;autogluon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s3_input_train&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28hcnavmjcwx6lki6nvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28hcnavmjcwx6lki6nvj.png" alt="train_result"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ทำ Batch inference คือการทำ offline สร้างตัว churn scores เพื่อดูว่าลูกค้าคนไหนที่มีโอกาสจะหนีจากเรา เตรียม Test data ก่อน&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;batch_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;batch_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nc"&gt;Bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch/in/batch.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;upload_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;s3uri_batch_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://{}/{}/batch/in&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Batch Transform input S3 uri: {}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3uri_batch_input&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;s3uri_batch_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://{}/{}/batch/out&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Batch Transform output S3 uri: {}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3uri_batch_output&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ทำการเรียก transformer() เพื่อสร้างและรัน batch inference process&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sagemaker.transformer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Transformer&lt;/span&gt;
&lt;span class="n"&gt;BATCH_INSTANCE_TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ml.c5.2xlarge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;transformer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;autogluon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SingleRecord&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;assemble_with&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Line&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;instance_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BATCH_INSTANCE_TYPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;accept&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text/csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s3uri_batch_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;transformer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3uri_batch_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;split_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Line&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text/csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   
                      &lt;span class="n"&gt;input_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$[1:]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;join_source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;output_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$[0,-1]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ลอง produce ผลผ่าน confusion matrix ง่ายๆดู&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;batched_churn_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3uri_batch_output&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/batch.csv.out&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usecols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;batched_churn_scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_churn_scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#batched_churn_scores['Churn'] = (churn['Churn'] == "True.").astype(int)
&lt;/span&gt;&lt;span class="n"&gt;gt_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;churn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Churn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results_df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;gt_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;batched_churn_scores&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;results_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Churn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;rownames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;colnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;predictions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Update the Churn Scores to Snowflake &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;สุดท้ายเราจะไปโชผลลัพธ์ของ Predict scoring ผ่าน snowflake กัน! เอาข้อมูลกลับ snowflake ผ่าน write_pandas&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.connector.pandas_tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;write_pandas&lt;/span&gt;

&lt;span class="n"&gt;results_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CHURN_IN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CUST_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CHURN_SCORE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Write the predictions to the table named "ML_RESULTS".
&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nchunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;write_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ML_RESULTS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;กลับมาที่หน้า snowflake เราจะมาลอง Query ดู&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5a3kgipge3o5bm4bbaqb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5a3kgipge3o5bm4bbaqb.png" alt="query"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--ดู CHURN SCORES output ที่ได้มาจาก Sagemaker&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ML_RESULTS&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--ตัวอย่างลอง identify พื้นที่ที่มี rate ที่คาดว่าลูกค้า จะ churn โดยเราจะทำการ join ผลลัพธ์จาก predict score กลับไปที่ข้อมูลลูกค้า โดยเงื่อนไขที่โอกาส churn มากกว่าหรือเท่ากับ 10 คน และ churn scoring confidence มากกว่า 75%&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;STATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUST_ID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ML_RESULTS&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt;  &lt;span class="n"&gt;CUSTOMER_CHURN&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUST_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUST_ID&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CHURN_SCORE&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;STATE&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUST_ID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUST_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdh69rmihbgvw7n642bz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdh69rmihbgvw7n642bz.png" alt="snow_query"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;มีประมาน 17 รัฐที่ต้อง focus ว่าลูกค้าเราน่ากำลังจะ churn!&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Conclusion  &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;จบแล้ว!&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" alt="good_stuff"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>awsthai</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
    <item>
      <title>ทดสอบการทำ Anonymize data in your data lake with Amazon Athena</title>
      <dc:creator>Chatchai Komrangded (Bas)</dc:creator>
      <pubDate>Tue, 31 Aug 2021 11:23:37 +0000</pubDate>
      <link>https://forem.com/awscommunity-asean/anonymize-data-in-your-data-lake-with-amazon-athena-m38</link>
      <guid>https://forem.com/awscommunity-asean/anonymize-data-in-your-data-lake-with-amazon-athena-m38</guid>
      <description>&lt;p&gt;การแชข้อมูลหรือใช้ต้อง comply ตาม regulations เวลา deal กับข้อมูลลูกค้าซึ่งอาจจะประกอบไปด้วยข้อมูลส่วนตัว (PII) ที่ส่วนมากจะถูกทำ data anonymized (การลบข้อมูลโดยไม่ระบุชื่อ) เช่นพวก PII อย่าง ชื่อ หรือ นามสกุล&lt;/p&gt;

&lt;p&gt;ในโพสต์นี้เราจะลองมาทำกระบวนการ annonymize data โดยใช้ Amazon Athena!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;amp;whats-new-cards.sort-order=desc" rel="noopener noreferrer"&gt;Athena คืออะไร?&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Pre-requisites&lt;/li&gt;
&lt;li&gt;Anonymize with Athena&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture Diagram &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2020%2F06%2F25%2FAnonymizeManageDataLake1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2020%2F06%2F25%2FAnonymizeManageDataLake1.png" alt="delta-lake"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;กระบวนการที่จะทำคือ เราจะ upload ข้อมูลเข้าไปใน Amazon S3 ทำการ annonymize data และเขียนข้อมูลลงไปใน S3 (persistent) แล้วสุดท้ายเราจะมาลองสร้าง annonymize data ให้อยู่ในรูปแบบ view&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-requisites &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS Account - ❗ ดูวิธีการคิดราคาของ &lt;a href="https://aws.amazon.com/athena/pricing/" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt; ก่อนนะจ๊ะ..&lt;/li&gt;
&lt;li&gt;โหลดข้อมูลแล้วลุยเลย! &lt;a href="https://awsshare.ckomrangweb.com/customers.csv" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Anonymize with Athena &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;เราจะสร้าง tables ที่มีข้อมูลที่ถูก anonymized แล้ว แต่ก่อนที่จะทำลองคำนึงสิ่งที่เราจะเอาข้อมูลไปใช้ต่อ ตัวอย่างเช่น ถ้า columns นั้นเราคิดว่าไม่มีโอกาสหรือไม่แน่ใจว่าจะใช้หรือใน analytics process บางครั้งการที่เราละเว้นหรือไม่นำข้อมูลนั้นมาอาจจะเป็นวิธีที่ดีกว่า วิธีนึงในการทำ anonymize sensitive information คือการทำ hashing &lt;/p&gt;

&lt;p&gt;hash function คือ function ที่เราใช้ในการ map data จากขนาดไหนก็ได้ให้อยู่ในรูปแบบขนาดที่คงที่ หาอ่านเพิ่มเกี่ยวกับ &lt;a href="https://en.wikipedia.org/wiki/Hash_function" rel="noopener noreferrer"&gt;Hash function&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;เริ่มสร้าง Table โดยใช้ Athena &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`cust_data`&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nv"&gt;`customer_id`&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`prefix`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`first_name`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`middle_name`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`last_name`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`suffix`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`gender`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`dob`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`address`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;DELIMITED&lt;/span&gt; 
  &lt;span class="n"&gt;FIELDS&lt;/span&gt; &lt;span class="n"&gt;TERMINATED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="s1"&gt;','&lt;/span&gt; 
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INPUTFORMAT&lt;/span&gt; 
  &lt;span class="s1"&gt;'org.apache.hadoop.mapred.TextInputFormat'&lt;/span&gt; 
&lt;span class="n"&gt;OUTPUTFORMAT&lt;/span&gt; 
  &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt;
  &lt;span class="s1"&gt;'s3://chatkom-doc-shared/data/'&lt;/span&gt;
&lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'classification'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="s1"&gt;'delimiter'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;','&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="s1"&gt;'skip.header.line.count'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="s1"&gt;'typeOfData'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ลอง Query ผ่าน &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;Athena&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"customers"&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fawsshare.ckomrangweb.com%2Fannonymize_athena.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fawsshare.ckomrangweb.com%2Fannonymize_athena.png" alt="annonymize_athena.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;สร้าง Table ที่ทำการ Anonymize data โดยใช้ Athena &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"cust_data_anonym"&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'parquet'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;as&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;first_name_anonym&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;middle_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;middle_name_anonym&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;gender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;dob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;address&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"cust_data"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ลอง Query ผ่าน &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;Athena&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"cust_data_anonym"&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fawsshare.ckomrangweb.com%2Fannonymize_athena_2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fawsshare.ckomrangweb.com%2Fannonymize_athena_2.png" alt="annonymize_athena.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะเห็นข้อมูลใน first_name กับ last_name ถูก Anonymize โดย hashing function&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ในกรณีเราไม่ต้องการสร้าง Table สามารถทำเป็น View ครอบได้เช่นกัน  โดยใช้ Athena &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"v_cust_data_anonym"&lt;/span&gt;
&lt;span class="k"&gt;as&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;first_name_anonym&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;middle_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;middle_name_anonym&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;gender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;dob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;address&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"cust_data"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ✅ Conclusion  &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;จบแล้ว! Hasing function เป็นทางเลือกนึงในการ Anonymizing dataset ซึ่งเป็นสิ่งเราเจอบ่อยๆก่อนที่จะ  analyze ข้อมูล ซึ่งในตัวอย่างนี้เรามีการ Anonymizing dataset ใน S3 กับ Athena hashing function &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" alt="good_stuff"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>awsthai</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
    <item>
      <title>เริ่มใช้งาน SQL-based INSERTS, DELETES and UPSERTS in S3 โดยใช้ AWS Glue 3.0 และ Delta Lake</title>
      <dc:creator>Chatchai Komrangded (Bas)</dc:creator>
      <pubDate>Mon, 30 Aug 2021 17:24:53 +0000</pubDate>
      <link>https://forem.com/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-aws-glue-3-0-delta-lake-4daf</link>
      <guid>https://forem.com/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-aws-glue-3-0-delta-lake-4daf</guid>
      <description>&lt;ul&gt;
&lt;li&gt;Credit &lt;a href="https://dev.to/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-using-aws-glue-3-0-and-delta-lake-42f0#conclusion"&gt;Original English Version&lt;/a&gt; written by Kyle Data Science Specialist at Info Alchemy, It's great work!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;โพสต์นี้เราจะมาดูวิธีการใช้ Spark SQL Engines โดยใช้ AWS Glue ในการทำ UPSERTS, DELETES, และ INSERTS&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/th/glue/" rel="noopener noreferrer"&gt;AWS Glue คืออะไร?&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ในเดือนนี้ AWS ออก Glue เวอร์ชั่น 3.0! AWS Glue 3.0 สิ่งที่เพิ่มมาคือ การทำให้ Performance Apache Spark 3.1 สำหรับงานที่เป็นการประมวลผลแบบแบตช์และสตรีม เพิ่มความเร็วทั้งในส่วนการทำ ingestion, processing, และ integration​! ลองมาดู benchmark กัน&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2021%2F08%2F16%2FBDB-1727-image001.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2021%2F08%2F16%2FBDB-1727-image001.jpg" alt="BDB-1727-image001.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2021%2F08%2F17%2FBDB-1727-image005_resized2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2021%2F08%2F17%2FBDB-1727-image005_resized2.jpg" alt="BDB-1727-image005_resized2.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ได้ Thoughput เพิ่มขึ้นหลายเท่า!&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Pre-requisites&lt;/li&gt;
&lt;li&gt;Format to Delta&lt;/li&gt;
&lt;li&gt;Upsert&lt;/li&gt;
&lt;li&gt;Delete&lt;/li&gt;
&lt;li&gt;Insert&lt;/li&gt;
&lt;li&gt;Partitioned Data&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture Diagram &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--CbQ7YnqO--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2inq5fmmbqogt8wffjf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--CbQ7YnqO--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2inq5fmmbqogt8wffjf1.png" alt="delta-lake"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Delta lake คืออะไร&lt;br&gt;
Storage layer open source project ที่อำนวยความสะดวกให้เราสามารถสร้าง Lakehouse architecture บน data lake โดยคุณสมบัติประกอบไปด้วย รองรับ ACID transactions (คล้าย RDBMS), การจัดการ metadata, และสามารถสร้าง streaming, หรือ batch processing บนพวก data lakes storage ต่างๆได้เช่น Amazon S3&lt;/p&gt;

&lt;p&gt;รายละเอียดเพิ่มเติมที่นี่&lt;br&gt;
&lt;a href="https://docs.delta.io/latest/delta-intro.html" rel="noopener noreferrer"&gt;https://docs.delta.io/latest/delta-intro.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ทฏษดีพอละ! ลอง Build เล่นกันดู! เราจะสร้าง flow ง่ายๆคือจะเอาข้อมูลใน csv file มาแล้ว upload ไปที่ S3 bucket แล้วจะ process โดยใช้ AWS Glue! หลังจากนั้นจะ Query ข้อมูลผ่าน Amazon Athena&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-requisites &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;โหลด the Delta Lake package &lt;a href="https://mvnrepository.com/artifact/io.delta/delta-core_2.12/1.0.0" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt; - หา file ในตารางที่ลงท้ายด้วย .jar &lt;/li&gt;
&lt;li&gt;AWS Account - ❗ Glue ETL ไม่อยู่ใน free tier นะจ๊ะ..&lt;/li&gt;
&lt;li&gt;โหลดข้อมูลก่อนเลย! &lt;a href="https://github.com/klescosia/aws-glue-delta-lake/tree/main/data" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Codes สามารถอ้างจาก Github เจ้าของเดิมได้ &lt;a href="https://github.com/klescosia/aws-glue-delta-lake" rel="noopener noreferrer"&gt;ที่นี่&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Format to Delta Table &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;อันดับแรกเราจะ convert dataset ของเราก่อน &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;สร้าง Spark job จาก Glue &lt;a href="https://ap-southeast-1.console.aws.amazon.com/gluestudio/home?region=ap-southeast-1#/jobs" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;เลือก Spark script editor&lt;/li&gt;
&lt;li&gt;เลือก Create a new script with boilerplate code&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# Import the packages
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark Session along with configs for Delta Lake
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Read Source
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-aws-glue-demo/raw/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write data as a DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read Source
&lt;/span&gt;&lt;span class="n"&gt;updatesDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-aws-glue-demo/updates/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write data as a DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;updatesDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/updates_delta/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;หลักๆคือการ converts ตัว data เป็น delta format ซึ่งทำ 2 datasets ที่เป็นข้อมูลตั้งต้นกับข้อมูลที่จะใช้ updates. หลังจาก convert เราจะมีการสร้าง SYMLINK MANIFEST file ซึ่งเราจะใช้ในการดูผ่าน Athena&lt;/p&gt;

&lt;p&gt;ลองสร้าง Table ใน &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;Athena&lt;/a&gt; บน Delta Lake storage&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;superstore&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; 
    &lt;span class="n"&gt;row_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ship_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ship_mode&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;segment&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;postal_code&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sub_category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;discount&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;profit&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;SERDE&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'&lt;/span&gt; 
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'&lt;/span&gt;
&lt;span class="n"&gt;OUTPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'&lt;/span&gt; 
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://delta-lake-aws-glue-demo/current/_symlink_format_manifest/'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;ลอง Query ผ่าน &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;Athena&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;superstore&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--_BTkt_1l--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpyc4kt35cqfrtwhi626.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--_BTkt_1l--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpyc4kt35cqfrtwhi626.PNG" alt="athena"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Upsert &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Upsert คือกระบวนการในการ inserts rows เข้าไปใน db table ถ้าข้อมูลนั้นยังไม่มี หรือ update ข้อมูลถ้าข้อมูลนั้นมีแล้ว ในตัวอย่างนี้เราจะ update ค่าบาง rows ของ ship_mode, customer_name, sales, and profit columns&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;สร้าง Spark job จาก Glue &lt;a href="https://ap-southeast-1.console.aws.amazon.com/gluestudio/home?region=ap-southeast-1#/jobs" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;เลือก Spark script editor&lt;/li&gt;
&lt;li&gt;เลือก Create a new script with boilerplate code&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# Import as always
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark Session along with configs for Delta Lake
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;updateDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;

MERGE INTO delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore
USING delta.`s3a://delta-lake-aws-glue-demo/updates_delta/` as updates
ON superstore.row_id = updates.row_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;คำสั่งด้านบนนั้นเราะทำการ updates ข้อมูลบน table หลัก (superstore) ถ้ามีการพบข้อมูลใน updates table (updates) จากการ matching ค่าใน row_id ถ้า row_id ตรงกันจะทำการ UPDATE ALL. ถ้าไม่มีจะทำการ INSERT ALL &lt;/p&gt;

&lt;p&gt;สามารถดูรายละเอียดได้ &lt;a href="https://docs.delta.io/latest/delta-update.html#operation-semantics" rel="noopener noreferrer"&gt;ที่นี่&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Delete &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;ลองทำการลบข้อมูลกันมั่ง! Delete จะค่อนข้างตรงไปตรงมา..&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;สร้าง Spark job จาก Glue &lt;a href="https://ap-southeast-1.console.aws.amazon.com/gluestudio/home?region=ap-southeast-1#/jobs" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;เลือก Spark script editor&lt;/li&gt;
&lt;li&gt;เลือก Create a new script with boilerplate code&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;deleteDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
DELETE 
FROM delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore 
WHERE CAST(superstore.row_id as integer) &amp;lt;= 20
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;หลักๆ code ด้านบนคือลบข้อมูลโดย row_id&lt;/p&gt;

&lt;p&gt;ลอง Query ผ่าน &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;Athena&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"superstore"&lt;/span&gt; 
&lt;span class="c1"&gt;-- Need to CAST hehe bec it is currently a STRING&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--3jmq_me1--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pm4a9w651ne1623k34p.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--3jmq_me1--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pm4a9w651ne1623k34p.PNG" alt="Athena_output"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะเห็น records หายไปละ Data Lake กลายเป็น Delta จริงๆ!&lt;/p&gt;

&lt;h3&gt;
  
  
  Insert &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;คล้ายๆ Deletes, Inserts ค่อนข้างตรงไปตรงมา มาลองดู&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;สร้าง Spark job จาก Glue &lt;a href="https://ap-southeast-1.console.aws.amazon.com/gluestudio/home?region=ap-southeast-1#/jobs" rel="noopener noreferrer"&gt;ลิ้งนี้&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;เลือก Spark script editor&lt;/li&gt;
&lt;li&gt;เลือก Create a new script with boilerplate code&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;insertDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
INSERT INTO delta.`s3a://delta-lake-aws-glue-demo/current/`
SELECT *
FROM delta.`s3a://delta-lake-aws-glue-demo/updates_delta/`
WHERE CAST(row_id as integer) &amp;lt;= 20
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Partitioned Data  &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;ทำไปหมดเกือบทุกกระบวนท่า แต่นี่เป็นแค่ dataset ง่ายๆ ซึ่งชีวิตจริงเราต้องคำนึงถึงการทำ parition ด้วยนะ! ทำไมต้องทำหละ คิดดูข้อมูลเราใหญ่นะ เราต้องหั่นไปชิ้นๆเพื่อเลือกดึงเฉพาะข้อมูลที่ใช้สิ, การทำ parition นี่จะช่วยเรา 2 เรื่องใหญ่ๆเลยคือ Performance กับ Cost กับบาง services ที่ใช้ในการ Scan data อย่าง Athena&lt;/p&gt;

&lt;p&gt;เรามีการแก้ไขบางอย่างใน Spark code ในการ write data เป็น parition ทำให้โครงสร้างการเก็บข้อมูลใน S3 เป็นแบบนี้&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--zMNRMxvw--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma6m9ykdzzuzjl7cdr0g.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--zMNRMxvw--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma6m9ykdzzuzjl7cdr0g.PNG" alt="s3_structure"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;แนวคิดของ Delta Lake ขึ้นกับ log history โดยตัว Delta lake จะมีการสร้าง delta logs ทุกๆการ committed transactions&lt;/p&gt;

&lt;p&gt;โดย Delta logs นั้นจะมี delta files ซึ่งเก็บเป็น JSON ซึ่งในนั้นเก็บข้อมูลของการทำที่เกิดขึ้นและรายละเอียดของ snapshpt file ล่าสุด และมีการเก็บข้อมูลเกี่ยวกับ statistics ของข้อมูลเช่นกัน &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;จาก &lt;a href="https://datafloq.com/read/understand-the-fundamentals-of-delta-lake-concept/7610#:~:text=How%20does%20it%20works%3F,existing%20cloud%20storage%20data%20lake.&amp;amp;text=When%20you%20store%20data%20as,logs%20for%20each%20committed%20transactions." rel="noopener noreferrer"&gt;Data Floq&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;เช่นตัวอย่างข้างล่าง&lt;/p&gt;
&lt;h5&gt;
  
  
  raw date_part=2014-08-27/
&lt;/h5&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--8_n1vYW5--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi081l8rb7l2x1c7jxx24.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--8_n1vYW5--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi081l8rb7l2x1c7jxx24.PNG" alt="date_part=2014-08-27"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h5&gt;
  
  
  current date_part=2014-08-27/ - DELETED ROWS
&lt;/h5&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--8_n1vYW5--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi081l8rb7l2x1c7jxx24.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--8_n1vYW5--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi081l8rb7l2x1c7jxx24.PNG" alt="date_part=2014-08-27_deleted"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ถ้าเราลองเปิด parquet files ดู&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--SLi2g-hQ--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84h396lvhqgpyx4uexmn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpracticaldev%2Fimage%2Ffetch%2Fs--SLi2g-hQ--%2Fc_limit%252Cf_auto%252Cfl_progressive%252Cq_auto%252Cw_880%2Fhttps%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84h396lvhqgpyx4uexmn.PNG" alt="open_parquet"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราจะเห็นว่า code เราทำให้เกิดการเขียน parquet file ใหม่ระหว่างการ delete โดย row_id ตั้งแต่ 21 ขึ้นไปไม่ได้ถูกลบตามเงื่อนไข หลังจากนั้น JSON file ก็จะทำการ map เข้ากับ file ที่เพิ่งถูกสร้างขึ้นมาใหม่ในการทำ &lt;a href="https://docs.delta.io/latest/quick-start.html#-read-older-versions-of-data-using-time-travel" rel="noopener noreferrer"&gt;versioning&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;เราสามารถสร้าง Table ที่มีการทำ partition แล้วใน &lt;a href="https://ap-southeast-1.console.aws.amazon.com/athena/home?region=ap-southeast-1#query" rel="noopener noreferrer"&gt;Athena&lt;/a&gt; ตามคำสั่งด้านล่าง  &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;superstore&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; 
    &lt;span class="n"&gt;row_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ship_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ship_mode&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;segment&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;postal_code&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sub_category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;discount&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;profit&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_part&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;

&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- Add PARTITIONED BY option&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date_part&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;SERDE&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'&lt;/span&gt; 
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'&lt;/span&gt;
&lt;span class="n"&gt;OUTPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'&lt;/span&gt; 
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://delta-lake-aws-glue-demo/current/_symlink_format_manifest/'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;หลังจากนั้นให้รันคำสั่งด้านล่างเพื่อทำการ update metadata ใน catalog เพื่อให้เราสามารถ query ข้อมูลใน paritions ได้&lt;/p&gt;

&lt;p&gt;MSCK REPAIR &lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt; to add the partitions.
&lt;h3&gt;
  
  
  ✅ Conclusion  &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;จบแล้ว!&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.pinimg.com%2Foriginals%2F88%2F71%2F3c%2F88713c217a98cac0d46781b38f11c019.gif" alt="good_stuff"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;/table&gt;&lt;/div&gt;

</description>
      <category>aws</category>
      <category>awsthai</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
