<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aparna Aravind</title>
    <description>The latest articles on Forem by Aparna Aravind (@aparnasaravind).</description>
    <link>https://forem.com/aparnasaravind</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F978811%2Fcdd22a28-bb43-4063-8b5c-0dfe448686ce.png</url>
      <title>Forem: Aparna Aravind</title>
      <link>https://forem.com/aparnasaravind</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aparnasaravind"/>
    <language>en</language>
    <item>
      <title>Snowflake AWS Lambda Integration</title>
      <dc:creator>Aparna Aravind</dc:creator>
      <pubDate>Fri, 25 Nov 2022 05:49:33 +0000</pubDate>
      <link>https://forem.com/aparnasaravind/snowflake-aws-lambda-integration-44ka</link>
      <guid>https://forem.com/aparnasaravind/snowflake-aws-lambda-integration-44ka</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ty0Z6aZs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ix6kp0j0n5k6sjbjfh8k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ty0Z6aZs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ix6kp0j0n5k6sjbjfh8k.png" alt="Image description" width="519" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post I would like to walk you through on the steps for Snowflake AWS lambda integration&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Lets learn it with a sample use case - read from a snowflake table where a data engineer have captured some anomalies in tables as per that day’s load insights and send the report to the required recipients. From technical perspective, we will go through the following&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Snowflake-AWS lambda Integration — Using Snowflake python connector in AWS lambda. Since AWS Lambda does not have snowflake connector dependency out of box we need to add it explicitly as Layers.&lt;/li&gt;
&lt;li&gt; Sending notifications from the snowflake table inference — Send the table content to an email using SNS, contents are formatted using an external python library called “tabulate”, which again needs to be added explicitly as Lambda layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Sample use case&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;em&gt;Steps&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;I. Create a lambda function&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Create a lambda function with python runtime as 3.6.Please choose correct execution roles.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--reWlKViL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297688929/5FFFjL_Ucl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--reWlKViL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297688929/5FFFjL_Ucl.png" alt="" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Go to configuration tab and increase the timeout to ~4mins , as by default it is 3 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ej3XKjM7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297690213/9_2gmKgaJ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ej3XKjM7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297690213/9_2gmKgaJ.png" alt="" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zsl8qzC4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297691914/417A0eSqC.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zsl8qzC4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297691914/417A0eSqC.png" alt="" width="655" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If you want to launch in a VPC, please choose accordingly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t51UYdLs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297693394/T6M4siTgu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t51UYdLs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297693394/T6M4siTgu.png" alt="" width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;II. Create zip archive for snowflake dependency and add Lambda layer&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As lambda does not have snowflake dependencies installed out of box, add the extra dependency using AWS Lambda layers&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“A Lambda layer is a .zip file archive that can contain additional code or data. A layer can contain libraries, a&lt;/em&gt; &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom.html"&gt;&lt;em&gt;custom runtime&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, data, or configuration files. Layers promote code sharing and separation of responsibilities so that you can iterate faster on writing business logic.”&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Spin up an ec2 t2.micro instance, use ubuntu/linux OS to avoid dependency issue since lambda runs on linux. I have used the following ec2 ami.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SONqwq6V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297694882/pRLJI4X69.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SONqwq6V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297694882/pRLJI4X69.png" alt="" width="800" height="72"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**\## Install Prerequisites  
~$** sudo apt update**~$** sudo apt-get install python3-pip**~$** sudo apt install zip  
**~$** sudo apt install awscli  
**~$** sudo apt-get install -y libssl-dev libffi-dev

**\## Install snowflake dependency  
~$** sudo apt install virtualenv  
**~$** mkdir sf\_lambda; cd sf\_lambda  
**~/sf\_lambda$** virtualenv v-env --python=python3;source v-env/bin/activate  
**~/sf\_lambda$** cd v-env/lib/python3.6/site-packages/  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** pip3 install -r [https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested\_requirements/requirements\_36.reqs](https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested_requirements/requirements_36.reqs) -t .  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** pip3 install snowflake-connector-python==2.3.10 -t .  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** chmod -R 755 .  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** deactivate

**\## Create zip archive  
~/sf\_lambda/v-env/lib/python3.6/site-packages**$ cd ../../../../  
**~/sf\_lambda$** mkdir python;cd python/  
**~/sf\_lambda/python$** cp -r ../v-env/lib/python3.6/site-packages/\* .  
**~/sf\_lambda/python$** cd ..  
**~/sf\_lambda$** zip -r sf\_lambda.zip python

**\## Upload zip file in S3  
~/sf\_lambda$** aws s3 cp sf\_lambda.zip s3://&amp;lt;bucket&amp;gt;/&amp;lt;prefix&amp;gt;/ --profile &amp;lt;profile-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  Please note that in the step -
pip3 install -r &lt;a href="https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested_requirements/requirements_36.reqs"&gt;https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested_requirements/requirements_&lt;strong&gt;36&lt;/strong&gt;.reqs&lt;/a&gt; -t .
based on your python version the &lt;strong&gt;36&lt;/strong&gt; will become &lt;strong&gt;35&lt;/strong&gt; for python version 3.5, &lt;strong&gt;38&lt;/strong&gt; for python 3.8 etc.&lt;/li&gt;
&lt;li&gt;  Please make sure your security group is open for ssh(for connecting via putty to create zip file) and tcp(to download the required python dependencies).And also make sure that you have configured your instance to access s3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Go to AWS Lambda console -&amp;gt; Layers -&amp;gt; Create Layers. Fill in the details as shown in the below screenshot and click create&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wuW76-_U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297696729/YfX5cWd0e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wuW76-_U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297696729/YfX5cWd0e.png" alt="" width="706" height="731"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Click on Code Tab and under Layers, click Add a Layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--n49L6Jn5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297698323/FRau0yB9x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--n49L6Jn5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297698323/FRau0yB9x.png" alt="" width="800" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Copy paste the layer ARN which we just created and click add&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cuouInpz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297700034/SnImpooK6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cuouInpz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297700034/SnImpooK6.png" alt="" width="799" height="651"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;III. Create SNS topic&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Go to SNS -&amp;gt; Topics -&amp;gt; create topics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v86B1c2Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297701778/lWwiExxXi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v86B1c2Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297701778/lWwiExxXi.png" alt="" width="678" height="686"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Click on the topic created and click on create subscription and add the required email id&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QBfCOlGc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297703683/ByWCodaTv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QBfCOlGc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297703683/ByWCodaTv.png" alt="" width="800" height="68"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Confirm subscription by clicking on the link in the inbox of the email id given in the previous step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Email Formatting&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  To pretty print the table content in our email content, I have used the python library “tabulate”, Create another layer in lambda following the steps similar to &lt;em&gt;II. Create zip archive for snowflake dependency and add Lambda layer&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**~$** mkdir tabulate\_lambda; cd tabulate\_lambda  
**~/tabulate\_lambda$** virtualenv v-env-t --python=python3  
**~/tabulate\_lambda$** source v-env-t/bin/activate  
**~/tabulate\_lambda$** cd v-env-t/lib/python3.6/site-packages/  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages$** pip3 install tabulate -t .  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages$** chmod -R 755  
 .  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages$** deactivate  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages**$ cd ../../../../  
**~/tabulate\_lambda$** mkdir python;cd python/  
**~/tabulate\_lambda/python$** cp -r ../v-env-t/lib/python3.6/site-packages/\* .  
**~/tabulate\_lambda/python$** cd ..  
**~/tabulate\_lambda$** zip -r tabulate\_lambda.zip python  
**~/tabulate\_lambda$** aws s3 cp tabulate\_lambda.zip s3://&amp;lt;bucket&amp;gt;/&amp;lt;prefix&amp;gt;/ --profile &amp;lt;profile-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Code&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now all the ingredients are ready, final step deploy the below code in the lambda function created, add your snowflake/SNS details in the &amp;lt;&amp;gt; place holders and test :)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/aparnasaravind/Code_Samples/tree/master/medium/snowflake/lambda"&gt;&lt;strong&gt;Code_Samples/medium/snowflake/lambda at master · aparnasaravind/Code_Samples&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
*Code Samples - learning. Contribute to aparnasaravind/Code_Samples development by creating an account on GitHub.*github.com&lt;/a&gt;&lt;a href="https://github.com/aparnasaravind/Code_Samples/tree/master/medium/snowflake/lambda"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In this post we have performed following steps&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Created a lambda function with required configuration&lt;/li&gt;
&lt;li&gt; Created 2 zip files for tabulate and snowflake dependencies and added as layers in the created lambda function&lt;/li&gt;
&lt;li&gt; Created SNS topic and subscription, confirmed the subscription&lt;/li&gt;
&lt;li&gt; Deployed the code in lambda&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thank you..&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>aws</category>
      <category>python</category>
      <category>awslambda</category>
    </item>
    <item>
      <title>Deequ for generating data quality reports</title>
      <dc:creator>Aparna Aravind</dc:creator>
      <pubDate>Fri, 25 Nov 2022 05:40:45 +0000</pubDate>
      <link>https://forem.com/aparnasaravind/deequ-for-generating-data-quality-reports-ne7</link>
      <guid>https://forem.com/aparnasaravind/deequ-for-generating-data-quality-reports-ne7</guid>
      <description>&lt;p&gt;Ensuring data quality checks are really important in data driven projects&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  To make sure of data correctness for correct business decisions&lt;/li&gt;
&lt;li&gt;  Validate the data beforehand to avoid broken production pipelines&lt;/li&gt;
&lt;li&gt;  Validate data from disperse sources(ftp, data lakes or sources other than RDBMS etc.)which doesn't have schema and integrity constraints&lt;/li&gt;
&lt;li&gt;  Data quality ensures Machine learning models performance&lt;/li&gt;
&lt;li&gt;  Many more…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It would be really time saving to have a tool/framework that could help us in ensuring data quality and we do have multiple tools/frameworks for the same. One such opensource project is Deequ by AWS.&lt;/p&gt;

&lt;p&gt;In this post we will try to explore more on the &lt;strong&gt;Constrain verification&lt;/strong&gt; module of Deequ, as we progress lets build a &lt;em&gt;dynamic verification&lt;/em&gt; module which takes the deequ rules from a file/or any store/dictionary.&lt;/p&gt;

&lt;p&gt;Before that we will go through a quick refresher on AWS Deequ&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/" rel="noopener noreferrer"&gt;aws documentation&lt;/a&gt; — Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt;&lt;/a&gt; and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Long story short, anything that you can fit into Spark dataframe(S3, Snowflake, RDBMS etc.), Deequ helps you to perform Data quality tests, that too at scale.&lt;/p&gt;

&lt;p&gt;Deequ have 4 main components&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Metrics computation&lt;/strong&gt; — Gives statistics insight on data quality such as completeness, correlation, uniqueness etc.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Constraints suggestion &lt;/strong&gt;— confused to use which all data qualities check needs to be done? AWS Deequ will give us some suggestions on the top of our data .Please see the constraints suggestion and see what makes more sense before using.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Constraint verification&lt;/strong&gt; — We can verify the data by defining quality constraint rules and gives back the status of our checks&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Metrics repository &lt;/strong&gt;— enables us to store the Deequ results(the metrics we have computed) and then we may use them to compare them with the subsequent Deequ results. During the time I write this post only 2 repository support is there- File and In-memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/awslabs/python-deequ" rel="noopener noreferrer"&gt;click here - aws reference&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS Deequ for generating data quality reports
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Constraint verification&lt;/strong&gt; module helps us to generate data quality reports based on a set of metrics that run on top of our data frame. Please find below example on the usage which I have taken from their git repository.&lt;/p&gt;

&lt;p&gt;Note — For this post I have used pydeequ — python wrapper around Deequ(Deequ is originally written in Scala).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql import SparkSession, Row  
import pydeequ  
from pydeequ.checks import \*  
from pydeequ.verification import \*  
spark = (SparkSession  
    .builder  
    .config("spark.jars.packages", pydeequ.deequ\_maven\_coord)  
    .config("spark.jars.excludes", pydeequ.f2j\_maven\_coord)  
    .getOrCreate())  

df = spark.sparkContext.parallelize(\[  
            Row(a="foo", b=1, c=5),  
            Row(a="bar", b=2, c=6),  
            Row(a="baz", b=3, c=None)\]).toDF()  


check = Check(spark, CheckLevel.Warning, "Review Check")  

checkResult = VerificationSuite(spark) \\  
    .onData(df) \\  
    .addCheck(  
        check.hasSize(lambda x: x &amp;gt;= 3) \\  
        .hasMin("b", lambda x: x == 0) \\  
        .isComplete("c")  \\  
        .isUnique("a")  \\  
        .isContainedIn("a", \["foo", "bar", "baz"\]) \\  
        .isNonNegative("b")) \\  
    .run()  

checkResult\_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)  
checkResult\_df.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  hasSize — Check the size of our data frame greater than 3 — Success/fail&lt;/li&gt;
&lt;li&gt;  hasMin — Check if the minimum value of b column is 0 — Success/fail&lt;/li&gt;
&lt;li&gt;  isComplete — Check if all the values of c are not null — Success/fail&lt;/li&gt;
&lt;li&gt;  isUnique — Check if all the values of a is unique — Success/fail&lt;/li&gt;
&lt;li&gt;  isContainedIn — Check if the value of is in the given list — Success/fail&lt;/li&gt;
&lt;li&gt;  isNonNegative — Check if all the values of b are non negative — Success/fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Details of all the available quality checks — &lt;a href="https://github.com/awslabs/python-deequ/blob/9bcc6bc69f450b5459866448ebcbc1f8d65d65a2/pydeequ/checks.py#L663" rel="noopener noreferrer"&gt;click here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqw7zdnhasnyingsiofr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqw7zdnhasnyingsiofr.png" width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample Use case&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We will try the same example from the pydeequ git repository, but passing those constraints dynamically. For simplicity, we will configure our deequ rules in a python dictionary, you shall configure it in a file/key value store/anywhere. Please refer &lt;a href="https://github.com/aparnasaravind/Code_Samples/tree/master/medium/deequ/checks" rel="noopener noreferrer"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt; for the complete code&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Configure Deequ rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3udrw9j6klkfkrhnubxy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3udrw9j6klkfkrhnubxy.png" width="800" height="181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Initialize Check object state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhd16unvj9enb6x04co4p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhd16unvj9enb6x04co4p.png" width="689" height="36"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Create constraints from string&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this we will leverage getattr in python — &lt;a href="https://www.w3schools.com/python/ref_func_getattr.asp" rel="noopener noreferrer"&gt;reference&lt;/a&gt;, which is a reflection enabling function in python&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdi4r758cugcemg3s60y7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdi4r758cugcemg3s60y7.png" width="800" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Call VerificationSuite which is responsible for running our checks and pass our dataframe on which checks needs to be run and the check object itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficua8wuleudhko8lcoxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficua8wuleudhko8lcoxx.png" width="541" height="140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Get check results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpha5olw84tnvt98zfv9l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpha5olw84tnvt98zfv9l.png" width="800" height="83"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thats it..&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5xi0cdx4n48soocjbmh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5xi0cdx4n48soocjbmh.png" width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;output&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this post we had a quick refresher on AWS Deequ and discussed a sample use case to build a &lt;em&gt;dynamic verification&lt;/em&gt; module which takes the deequ rules from a file/or any store/dictionary, which brought some flexibility to our module. You can always take this into another level by&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Describing appropriate actions to take based on the verification module status, like sending notification if you see a quality breach in your data.&lt;/li&gt;
&lt;li&gt;  Design and configure a Deequ rule repository, where users can configure new rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks hope this helps!&lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>Handling schema changes in snowflake</title>
      <dc:creator>Aparna Aravind</dc:creator>
      <pubDate>Fri, 25 Nov 2022 05:33:22 +0000</pubDate>
      <link>https://forem.com/aparnasaravind/handling-schema-changes-in-snowflake-54c0</link>
      <guid>https://forem.com/aparnasaravind/handling-schema-changes-in-snowflake-54c0</guid>
      <description>&lt;p&gt;&lt;em&gt;With spark-snowflake connector writes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In this post we will be observing how the schema changes such as missing column/extra columns, data type changes behaves with &lt;em&gt;spark-snowflake connector&lt;/em&gt; writes. In other words how the schema mismatch between the spark dataframe and snowflake table are handled.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aparnaaravind.hashnode.dev/spark-snowflake-writes-behind-the-scenes-34007bc490b7"&gt;Click here - Spark Snowflake Write - Behind the Scenes&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Say we have a requirement of ingesting and appending data from source(files or JDBC source systems) to a landing Snowflake table.&lt;/li&gt;
&lt;li&gt;  And there are n number of downstream systems that consumes data from these landing zone tables&lt;/li&gt;
&lt;li&gt;  The schema changes need not get propagated from source to this landing snowflake table and the data pipeline should not also fail on these schema changes( or in fact ignore the schema changes).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There might be some schema changes that won’t appear as straightforward and might need thought through data pipelines to handle these schema changes holistically, but the purpose of this post is to identify the &lt;strong&gt;straightforward scenarios&lt;/strong&gt; and corresponding solution, which might not need complicated solutions or overengineered data pipelines to handle such use cases. &lt;em&gt;Remember simplification is the ultimate sophistication&lt;/em&gt; :)&lt;/p&gt;

&lt;p&gt;There are 5 ways to handle simple schema changes in the sources(depending up on scenarios)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; See if we can create a dataframe with the same schema as snowflake landing table, irrespective of the schema changes and use that data frame to write/append to snowflake table.&lt;/li&gt;
&lt;li&gt; See if we can leverage the spark-Snowflake append mode with snowflake mapping properties to handle some of the schema changes scenario.&lt;/li&gt;
&lt;li&gt; Or by design using a landing Snowflake tables with variant columns and build a standardized layer after that.&lt;/li&gt;
&lt;li&gt; Create landing tables with all columns with varchar and cast it as per business meaning of the columns in the subsequent layers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this post lets explore more on Option 2.Hope it saves you some testing time as well&lt;/p&gt;

&lt;p&gt;Please note that all the schema change impact are discussed on the &lt;em&gt;append&lt;/em&gt; mode context(as append mode doesn’t change the schema of the target snowflake table), I have also added an extra note/reference on the overwrite mode in this post.&lt;/p&gt;

&lt;p&gt;Lets first create a table with some sample data as mentioned in below snippet. As I want spark connector to create the snowflake table for me, choosing overwrite mode for first time load.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Aparna", "Learning",'original')  
            \],  
            \["NAME", "ACTIVITY","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("overwrite")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Table Schema&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EoZtZ0KR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297717807/XIpPPOgVpo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EoZtZ0KR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297717807/XIpPPOgVpo.png" alt="" width="497" height="147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Table Content&lt;/p&gt;

&lt;h4&gt;
  
  
  Behavior of schema changes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  Column order change
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Learning", "Aparna",'column order change')  
            \],  
            \["ACTIVITY", "NAME","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5nOH4Bc3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297718948/PA7AjcjTH.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5nOH4Bc3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297718948/PA7AjcjTH.png" alt="" width="509" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 2 — Column order change&lt;/p&gt;

&lt;p&gt;By default spark snowflake connector maps the column based on column order, which have resulted in the wrong mapping here. If we need to map the columns based on column names irrespective of their order, we may use the property “column_mapping”:”name” while setting snowflake options.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K8ZoeqOB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297720444/HiBZkBRKn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K8ZoeqOB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297720444/HiBZkBRKn.png" alt="" width="524" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 3- Despite of column order change mapped based on the column names&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Column missing
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Learning", 'column missing')  
            \],  
            \["ACTIVITY","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a column is missing without the property column mapping=name(default column order mapping) set, it will throw an exception&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--J4coELbb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297721963/-vHNvR8yX.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--J4coELbb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297721963/-vHNvR8yX.png" alt="" width="800" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Like before lets set column_mapping as name and see, how it behaves with missing column&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--u3DWQnom--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297724004/5q_iFPLTk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--u3DWQnom--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297724004/5q_iFPLTk.png" alt="" width="800" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Again threw error complaining about column number mismatch. Lets try setting one more snowflake option &lt;em&gt;“column_mismatch_behavior”:”ignore”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VRatSgB0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297725363/Z6wOutYX1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VRatSgB0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297725363/Z6wOutYX1.png" alt="" width="566" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 4 — Missing column populated as null&lt;/p&gt;

&lt;p&gt;Voila, the missing column has been populated as null with the new property&lt;/p&gt;

&lt;p&gt;Please note that the column mismatch behavior is applicable when the column mapping param is set to “name”.By default the column_mismatch_behavior is ‘error’ like we have seen before.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Extra Column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since we learned from our previous example that column_mapping name with column_mismatch_behaviour as ignore, will help us with the scenario where number of columns differ between snowflake and spark data frame, lets try those property with a scenario where we have an extra column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning", 'medium','column addition')  
            \],  
            \["NAME","ACTIVITY","PORTAL","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vqIHQi5R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297726634/gp_jwoZXR.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vqIHQi5R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297726634/gp_jwoZXR.png" alt="" width="532" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 5 — extra column ignored&lt;/p&gt;

&lt;p&gt;As the column name ‘PORTAL’ is not defined in the target snowflake table, it got ignored and mapped the rest of the columns correctly&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Column data type change behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lets assume the name column have numeric data type&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                (123,"Learning",'data type change')  
            \],  
            \["NAME","ACTIVITY","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iIBlK1Gn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297727962/dC-oEPgkY.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iIBlK1Gn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297727962/dC-oEPgkY.png" alt="" width="501" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 6 — Numeric value 123 mapped to “123” varchar&lt;/p&gt;

&lt;p&gt;As in the target table, data type is varchar, the numeric column name got casted to varchar.The reverse is also true, as long as the value can be parsed to a number.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'original',"123")  
            \],  
            \["NAME","ACTIVITY","COMMENTS","USER\_ID"\])  
df.printSchema()  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8kEnidE7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297729291/hurT7JGGY.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8kEnidE7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297729291/hurT7JGGY.png" alt="" width="550" height="189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Spark Data types&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'varchar insert to numeric column',38.00)  
            \],  
            \["NAME","ACTIVITY","COMMENTS","USER\_ID"\])  
df.printSchema()  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pVtzqIFz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297730554/O_KrAQRqk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pVtzqIFz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297730554/O_KrAQRqk.png" alt="" width="521" height="207"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YMjHKs8o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297731725/fNbG0-Jsi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YMjHKs8o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297731725/fNbG0-Jsi.png" alt="" width="626" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 3 — 38.00 mapped to 38&lt;/p&gt;

&lt;p&gt;Date and Timestamp&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'original')  
            \],  
            \["NAME","ACTIVITY","COMMENTS"\])

#Add timestamp column in the dataframe  
df=df.withColumn("TIMESTAMP\_COLUMN",F.current\_timestamp())  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("overwrite")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oYJWiYJP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297733460/4AOzb2HaE.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oYJWiYJP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297733460/4AOzb2HaE.png" alt="" width="720" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row 1 — Table with Timestamp column&lt;/p&gt;

&lt;p&gt;Lets try to append the table with a string value for timestamp column as mentioned below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'Timestamp col as string','2021-11-18 20:15:49')  
            \],  
            \["NAME","ACTIVITY","COMMENTS","TIMESTAMP\_COLUMN"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("append")\\  
    .save()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nRzHxMLM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297734666/Ks4ACB-kI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nRzHxMLM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1669297734666/Ks4ACB-kI.png" alt="" width="800" height="43"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Error when trying to insert timestamp string in a timestamp column&lt;/p&gt;

&lt;p&gt;Even though we have handled these schema changes with the simple setting, it is a best practice to notify or log the schema changes somewhere, so that business can decide in future whether to incorporate these changes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Overwrite mode&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://community.snowflake.com/s/article/How-to-Load-Data-in-Spark-with-Overwrite-mode-without-Changing-table-Structure"&gt;How to: Load Data in Spark with Overwrite mode without Changing Table Structure (snowflake.com)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this post we have discussed&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  how the schema changes in spark dataframe while trying to append to a Snowflake schema of different schema impacts&lt;/li&gt;
&lt;li&gt;  how could we handle those changes with out much impacting our data pipeline in a &lt;em&gt;simple&lt;/em&gt; way.&lt;/li&gt;
&lt;li&gt;  This learning can be implemented for the use cases where the Snowflake tables needs to be of static standard schema because of the nature of the downstream systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you..&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/user-guide/spark-connector-use.html"&gt;Using the Spark Connector — Snowflake Documentation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>dataengineering</category>
      <category>spark</category>
      <category>schemaevolution</category>
    </item>
  </channel>
</rss>
