Forem: Aparna Aravind

Snowflake AWS Lambda Integration

Aparna Aravind — Fri, 25 Nov 2022 05:49:33 +0000

In this post I would like to walk you through on the steps for Snowflake AWS lambda integration

“AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.”

Lets learn it with a sample use case - read from a snowflake table where a data engineer have captured some anomalies in tables as per that day’s load insights and send the report to the required recipients. From technical perspective, we will go through the following

Snowflake-AWS lambda Integration — Using Snowflake python connector in AWS lambda. Since AWS Lambda does not have snowflake connector dependency out of box we need to add it explicitly as Layers.
Sending notifications from the snowflake table inference — Send the table content to an email using SNS, contents are formatted using an external python library called “tabulate”, which again needs to be added explicitly as Lambda layer

Sample use case

Steps

I. Create a lambda function

Create a lambda function with python runtime as 3.6.Please choose correct execution roles.

Go to configuration tab and increase the timeout to ~4mins , as by default it is 3 seconds.

If you want to launch in a VPC, please choose accordingly

II. Create zip archive for snowflake dependency and add Lambda layer

As lambda does not have snowflake dependencies installed out of box, add the extra dependency using AWS Lambda layers

“A Lambda layer is a .zip file archive that can contain additional code or data. A layer can contain libraries, a custom runtime, data, or configuration files. Layers promote code sharing and separation of responsibilities so that you can iterate faster on writing business logic.”

Spin up an ec2 t2.micro instance, use ubuntu/linux OS to avoid dependency issue since lambda runs on linux. I have used the following ec2 ami.

**\## Install Prerequisites  
~$** sudo apt update**~$** sudo apt-get install python3-pip**~$** sudo apt install zip  
**~$** sudo apt install awscli  
**~$** sudo apt-get install -y libssl-dev libffi-dev

**\## Install snowflake dependency  
~$** sudo apt install virtualenv  
**~$** mkdir sf\_lambda; cd sf\_lambda  
**~/sf\_lambda$** virtualenv v-env --python=python3;source v-env/bin/activate  
**~/sf\_lambda$** cd v-env/lib/python3.6/site-packages/  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** pip3 install -r [https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested\_requirements/requirements\_36.reqs](https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested_requirements/requirements_36.reqs) -t .  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** pip3 install snowflake-connector-python==2.3.10 -t .  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** chmod -R 755 .  
**~/sf\_lambda/v-env/lib/python3.6/site-packages$** deactivate

**\## Create zip archive  
~/sf\_lambda/v-env/lib/python3.6/site-packages**$ cd ../../../../  
**~/sf\_lambda$** mkdir python;cd python/  
**~/sf\_lambda/python$** cp -r ../v-env/lib/python3.6/site-packages/\* .  
**~/sf\_lambda/python$** cd ..  
**~/sf\_lambda$** zip -r sf\_lambda.zip python

**\## Upload zip file in S3  
~/sf\_lambda$** aws s3 cp sf\_lambda.zip s3://<bucket>/<prefix>/ --profile <profile-name>

Please note that in the step - pip3 install -r https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.3.10/tested_requirements/requirements_36.reqs -t . based on your python version the 36 will become 35 for python version 3.5, 38 for python 3.8 etc.
Please make sure your security group is open for ssh(for connecting via putty to create zip file) and tcp(to download the required python dependencies).And also make sure that you have configured your instance to access s3.

Go to AWS Lambda console -> Layers -> Create Layers. Fill in the details as shown in the below screenshot and click create

Click on Code Tab and under Layers, click Add a Layer

Copy paste the layer ARN which we just created and click add

III. Create SNS topic

Go to SNS -> Topics -> create topics

Click on the topic created and click on create subscription and add the required email id

Confirm subscription by clicking on the link in the inbox of the email id given in the previous step

Email Formatting

To pretty print the table content in our email content, I have used the python library “tabulate”, Create another layer in lambda following the steps similar to II. Create zip archive for snowflake dependency and add Lambda layer

**~$** mkdir tabulate\_lambda; cd tabulate\_lambda  
**~/tabulate\_lambda$** virtualenv v-env-t --python=python3  
**~/tabulate\_lambda$** source v-env-t/bin/activate  
**~/tabulate\_lambda$** cd v-env-t/lib/python3.6/site-packages/  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages$** pip3 install tabulate -t .  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages$** chmod -R 755  
 .  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages$** deactivate  
**~/tabulate\_lambda/v-env-t/lib/python3.6/site-packages**$ cd ../../../../  
**~/tabulate\_lambda$** mkdir python;cd python/  
**~/tabulate\_lambda/python$** cp -r ../v-env-t/lib/python3.6/site-packages/\* .  
**~/tabulate\_lambda/python$** cd ..  
**~/tabulate\_lambda$** zip -r tabulate\_lambda.zip python  
**~/tabulate\_lambda$** aws s3 cp tabulate\_lambda.zip s3://<bucket>/<prefix>/ --profile <profile-name>

Code

Now all the ingredients are ready, final step deploy the below code in the lambda function created, add your snowflake/SNS details in the <> place holders and test :)

Code_Samples/medium/snowflake/lambda at master · aparnasaravind/Code_Samples

*Code Samples - learning. Contribute to aparnasaravind/Code_Samples development by creating an account on GitHub.*github.com

Conclusion

In this post we have performed following steps

Created a lambda function with required configuration
Created 2 zip files for tabulate and snowflake dependencies and added as layers in the created lambda function
Created SNS topic and subscription, confirmed the subscription
Deployed the code in lambda

Thank you..

Deequ for generating data quality reports

Aparna Aravind — Fri, 25 Nov 2022 05:40:45 +0000

Ensuring data quality checks are really important in data driven projects

To make sure of data correctness for correct business decisions
Validate the data beforehand to avoid broken production pipelines
Validate data from disperse sources(ftp, data lakes or sources other than RDBMS etc.)which doesn't have schema and integrity constraints
Data quality ensures Machine learning models performance
Many more…

It would be really time saving to have a tool/framework that could help us in ensuring data quality and we do have multiple tools/frameworks for the same. One such opensource project is Deequ by AWS.

In this post we will try to explore more on the Constrain verification module of Deequ, as we progress lets build a dynamic verification module which takes the deequ rules from a file/or any store/dictionary.

Before that we will go through a quick refresher on AWS Deequ

aws documentation — Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.

Long story short, anything that you can fit into Spark dataframe(S3, Snowflake, RDBMS etc.), Deequ helps you to perform Data quality tests, that too at scale.

Deequ have 4 main components

Metrics computation — Gives statistics insight on data quality such as completeness, correlation, uniqueness etc.
Constraints suggestion — confused to use which all data qualities check needs to be done? AWS Deequ will give us some suggestions on the top of our data .Please see the constraints suggestion and see what makes more sense before using.
Constraint verification — We can verify the data by defining quality constraint rules and gives back the status of our checks
Metrics repository — enables us to store the Deequ results(the metrics we have computed) and then we may use them to compare them with the subsequent Deequ results. During the time I write this post only 2 repository support is there- File and In-memory.

click here - aws reference

AWS Deequ for generating data quality reports

Constraint verification module helps us to generate data quality reports based on a set of metrics that run on top of our data frame. Please find below example on the usage which I have taken from their git repository.

Note — For this post I have used pydeequ — python wrapper around Deequ(Deequ is originally written in Scala).

from pyspark.sql import SparkSession, Row  
import pydeequ  
from pydeequ.checks import \*  
from pydeequ.verification import \*  
spark = (SparkSession  
    .builder  
    .config("spark.jars.packages", pydeequ.deequ\_maven\_coord)  
    .config("spark.jars.excludes", pydeequ.f2j\_maven\_coord)  
    .getOrCreate())  

df = spark.sparkContext.parallelize(\[  
            Row(a="foo", b=1, c=5),  
            Row(a="bar", b=2, c=6),  
            Row(a="baz", b=3, c=None)\]).toDF()  


check = Check(spark, CheckLevel.Warning, "Review Check")  

checkResult = VerificationSuite(spark) \\  
    .onData(df) \\  
    .addCheck(  
        check.hasSize(lambda x: x >= 3) \\  
        .hasMin("b", lambda x: x == 0) \\  
        .isComplete("c")  \\  
        .isUnique("a")  \\  
        .isContainedIn("a", \["foo", "bar", "baz"\]) \\  
        .isNonNegative("b")) \\  
    .run()  

checkResult\_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)  
checkResult\_df.show()

hasSize — Check the size of our data frame greater than 3 — Success/fail
hasMin — Check if the minimum value of b column is 0 — Success/fail
isComplete — Check if all the values of c are not null — Success/fail
isUnique — Check if all the values of a is unique — Success/fail
isContainedIn — Check if the value of is in the given list — Success/fail
isNonNegative — Check if all the values of b are non negative — Success/fail

Details of all the available quality checks — click here

Output

Sample Use case

We will try the same example from the pydeequ git repository, but passing those constraints dynamically. For simplicity, we will configure our deequ rules in a python dictionary, you shall configure it in a file/key value store/anywhere. Please refer here for the complete code

Configure Deequ rules

Initialize Check object state

Create constraints from string

For this we will leverage getattr in python — reference, which is a reflection enabling function in python

Call VerificationSuite which is responsible for running our checks and pass our dataframe on which checks needs to be run and the check object itself

Get check results

Thats it..

output

Summary

In this post we had a quick refresher on AWS Deequ and discussed a sample use case to build a dynamic verification module which takes the deequ rules from a file/or any store/dictionary, which brought some flexibility to our module. You can always take this into another level by

Describing appropriate actions to take based on the verification module status, like sending notification if you see a quality breach in your data.
Design and configure a Deequ rule repository, where users can configure new rules

Thanks hope this helps!

Handling schema changes in snowflake

Aparna Aravind — Fri, 25 Nov 2022 05:33:22 +0000

With spark-snowflake connector writes

In this post we will be observing how the schema changes such as missing column/extra columns, data type changes behaves with spark-snowflake connector writes. In other words how the schema mismatch between the spark dataframe and snowflake table are handled.

Click here - Spark Snowflake Write - Behind the Scenes

Say we have a requirement of ingesting and appending data from source(files or JDBC source systems) to a landing Snowflake table.
And there are n number of downstream systems that consumes data from these landing zone tables
The schema changes need not get propagated from source to this landing snowflake table and the data pipeline should not also fail on these schema changes( or in fact ignore the schema changes).

There might be some schema changes that won’t appear as straightforward and might need thought through data pipelines to handle these schema changes holistically, but the purpose of this post is to identify the straightforward scenarios and corresponding solution, which might not need complicated solutions or overengineered data pipelines to handle such use cases. Remember simplification is the ultimate sophistication :)

There are 5 ways to handle simple schema changes in the sources(depending up on scenarios)

See if we can create a dataframe with the same schema as snowflake landing table, irrespective of the schema changes and use that data frame to write/append to snowflake table.
See if we can leverage the spark-Snowflake append mode with snowflake mapping properties to handle some of the schema changes scenario.
Or by design using a landing Snowflake tables with variant columns and build a standardized layer after that.
Create landing tables with all columns with varchar and cast it as per business meaning of the columns in the subsequent layers

In this post lets explore more on Option 2.Hope it saves you some testing time as well

Please note that all the schema change impact are discussed on the append mode context(as append mode doesn’t change the schema of the target snowflake table), I have also added an extra note/reference on the overwrite mode in this post.

Lets first create a table with some sample data as mentioned in below snippet. As I want spark connector to create the snowflake table for me, choosing overwrite mode for first time load.

df = spark.createDataFrame(  
            \[  
                ("Aparna", "Learning",'original')  
            \],  
            \["NAME", "ACTIVITY","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("overwrite")\\  
    .save()

Table Schema

Table Content

Behavior of schema changes

Column order change

df = spark.createDataFrame(  
            \[  
                ("Learning", "Aparna",'column order change')  
            \],  
            \["ACTIVITY", "NAME","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()

Row 2 — Column order change

By default spark snowflake connector maps the column based on column order, which have resulted in the wrong mapping here. If we need to map the columns based on column names irrespective of their order, we may use the property “column_mapping”:”name” while setting snowflake options.

Row 3- Despite of column order change mapped based on the column names

Column missing

df = spark.createDataFrame(  
            \[  
                ("Learning", 'column missing')  
            \],  
            \["ACTIVITY","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()

If a column is missing without the property column mapping=name(default column order mapping) set, it will throw an exception

Like before lets set column_mapping as name and see, how it behaves with missing column

Again threw error complaining about column number mismatch. Lets try setting one more snowflake option “column_mismatch_behavior”:”ignore”

Row 4 — Missing column populated as null

Voila, the missing column has been populated as null with the new property

Please note that the column mismatch behavior is applicable when the column mapping param is set to “name”.By default the column_mismatch_behavior is ‘error’ like we have seen before.

Extra Column

Since we learned from our previous example that column_mapping name with column_mismatch_behaviour as ignore, will help us with the scenario where number of columns differ between snowflake and spark data frame, lets try those property with a scenario where we have an extra column.

df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning", 'medium','column addition')  
            \],  
            \["NAME","ACTIVITY","PORTAL","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()

Row 5 — extra column ignored

As the column name ‘PORTAL’ is not defined in the target snowflake table, it got ignored and mapped the rest of the columns correctly

Column data type change behavior

Lets assume the name column have numeric data type

df = spark.createDataFrame(  
            \[  
                (123,"Learning",'data type change')  
            \],  
            \["NAME","ACTIVITY","COMMENTS"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_SC\_TEST")\\  
    .mode("append")\\  
    .save()

Row 6 — Numeric value 123 mapped to “123” varchar

As in the target table, data type is varchar, the numeric column name got casted to varchar.The reverse is also true, as long as the value can be parsed to a number.

df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'original',"123")  
            \],  
            \["NAME","ACTIVITY","COMMENTS","USER\_ID"\])  
df.printSchema()  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("append")\\  
    .save()

Spark Data types

df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'varchar insert to numeric column',38.00)  
            \],  
            \["NAME","ACTIVITY","COMMENTS","USER\_ID"\])  
df.printSchema()  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("append")\\  
    .save()

Row 3 — 38.00 mapped to 38

Date and Timestamp

df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'original')  
            \],  
            \["NAME","ACTIVITY","COMMENTS"\])

#Add timestamp column in the dataframe  
df=df.withColumn("TIMESTAMP\_COLUMN",F.current\_timestamp())  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("overwrite")\\  
    .save()

Row 1 — Table with Timestamp column

Lets try to append the table with a string value for timestamp column as mentioned below

df = spark.createDataFrame(  
            \[  
                ("Aparna","Learning",'Timestamp col as string','2021-11-18 20:15:49')  
            \],  
            \["NAME","ACTIVITY","COMMENTS","TIMESTAMP\_COLUMN"\])  
df.write\\  
    .format(SNOWFLAKE\_SOURCE\_NAME)\\  
    .options(\*\*sfOptions)\\  
    .option("dbtable", "DUMMY\_DT\_TYPE")\\  
    .mode("append")\\  
    .save()

Error when trying to insert timestamp string in a timestamp column

Even though we have handled these schema changes with the simple setting, it is a best practice to notify or log the schema changes somewhere, so that business can decide in future whether to incorporate these changes.

Overwrite mode

How to: Load Data in Spark with Overwrite mode without Changing Table Structure (snowflake.com)

Summary

In this post we have discussed

how the schema changes in spark dataframe while trying to append to a Snowflake schema of different schema impacts
how could we handle those changes with out much impacting our data pipeline in a simple way.
This learning can be implemented for the use cases where the Snowflake tables needs to be of static standard schema because of the nature of the downstream systems.

Thank you..

References

Using the Spark Connector — Snowflake Documentation