<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Srinidhi </title>
    <description>The latest articles on Forem by Srinidhi  (@srinidhi).</description>
    <link>https://forem.com/srinidhi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F249314%2Ff32132e1-435f-4157-b2d9-4f5206da9446.jpg</url>
      <title>Forem: Srinidhi </title>
      <link>https://forem.com/srinidhi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/srinidhi"/>
    <language>en</language>
    <item>
      <title>Data Engineering Series #3: Apache Airflow - the modern Workflow management tool. Getting Started</title>
      <dc:creator>Srinidhi </dc:creator>
      <pubDate>Tue, 30 Mar 2021 04:00:13 +0000</pubDate>
      <link>https://forem.com/srinidhi/data-engineering-series-3-apache-airflow-the-modern-workflow-management-tool-what-do-you-need-to-know-78l</link>
      <guid>https://forem.com/srinidhi/data-engineering-series-3-apache-airflow-the-modern-workflow-management-tool-what-do-you-need-to-know-78l</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Why such attention towards Airflow ?&lt;/strong&gt;
&lt;/h3&gt;

&lt;h6&gt;
  
  
  Interest rate of airflow. Source: Google Trends
&lt;/h6&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fy3v4zut344lht5r78tkb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fy3v4zut344lht5r78tkb.png" alt="Source: Google Trends"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"The software industry has seen a notable rise in the development of tools to manage data. These tools range from storage solutions that house data to pipelines that transport data".&lt;br&gt;&lt;/p&gt;

&lt;p&gt;Data-driven companies like Airbnb, Quizlet rely on these data pipelines to resolve some tedious tasks like &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduling&lt;/li&gt;
&lt;li&gt;Testing&lt;/li&gt;
&lt;li&gt;Handling errors&lt;/li&gt;
&lt;li&gt;Versioning&lt;/li&gt;
&lt;li&gt;Scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;the current data flow set up.&lt;/p&gt;

&lt;p&gt;We call these pipelines data workflows. And a well-known tool in this space is Apache Airflow.&lt;/p&gt;

&lt;p&gt;Airflow solves those tedious tasks by,&lt;/p&gt;

&lt;h6&gt;
  
  
  click on a feature if you wish to jump to its section directly.
&lt;/h6&gt;

&lt;p&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4fxoqkq0pintsjx0wji.png" alt="Alt Text"&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8slaav00tgwbt4u9xy3.png" alt="Alt Text"&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fei4oh9sued6ckuzc31yv.png" alt="Alt Text"&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ejrss7rl0dpeih9ymj0.png" alt="Alt Text"&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai9jl8836y0tzuash0ro.png" alt="Alt Text"&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flucmwct3d8javiity63g.png" alt="Alt Text"&gt;&lt;/p&gt;

&lt;p&gt;And the best part of all, Airflow is an Open Source tool and has a rapidly growing user base and contributions 🙂&lt;/p&gt;




&lt;p&gt;Without further ado,&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Let's deep dive into the workflow framework&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;And create a minimalistic workflow pipeline that leverages all the Airflow features I had listed above. &lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;✨ DAGS&lt;/h4&gt;
&lt;br&gt;
Every workflow is constructed as a Directed Acyclic Graph (DAG). &lt;br&gt;
DAG is created using a .py file which should ideally have three sections configured

&lt;ol&gt;
&lt;li&gt;DAG configuration - &lt;code&gt;default_args&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DAG Instance &lt;code&gt;DAG()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DAG Tasks - &lt;code&gt;Operators&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;br&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  DAG Configuration - &lt;code&gt;default_args&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;default_args&lt;/code&gt; is used to set properties (arguments) that are common for all the tasks.&lt;br&gt;
&lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initializing the default arguments that we'll pass to our DAG
&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Srinidhi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2018&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;on_failure_callback&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_failure_alert&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  DAG Instance - &lt;code&gt;DAG()&lt;/code&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Creating a DAG instance
&lt;/span&gt;&lt;span class="n"&gt;my_workflow_dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sample DAG for DE Series&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0 0 * * *&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;br&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  DAG Tasks - &lt;code&gt;Operators&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Airflow Operators are used in creating individual tasks in the DAG. Properties specific to a task will be configured in the operator.&lt;/p&gt;

&lt;h6&gt;Some of the most commonly used Airflow Operators&lt;/h6&gt;  

&lt;p&gt;&lt;code&gt;PythonOperator&lt;/code&gt; - Calls a python function. An alternative way to create a python function task is to use TaskFlow API which is available from Airflow 2.0&lt;br&gt;
&lt;code&gt;BashOperator&lt;/code&gt; - Executes a UNIX command&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Operators
# Bash operator tasks that executes unix command - echo
&lt;/span&gt;&lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;First Task&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_workflow_dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Python Operator that prints the details of current job using context variable
&lt;/span&gt; &lt;span class="n"&gt;task_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;second_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;print_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provide_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on_success_callback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_success_alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_workflow_dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Python Function
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Implement your python task logic here.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;#Prints job details passed through provide_context parameter
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have our DAG ready, let's see where I have configured the features in it.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;✨ RETRY&lt;/h4&gt;
&lt;br&gt;
Airflow handles errors and failures gracefully.&lt;br&gt;
If you prefer to re-run a failed task multiple times before aborting the workflow run, use &lt;strong&gt;retries&lt;/strong&gt; argument. So that an erroneous task will be executed up to the defined number of times before it's marked as failed.

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;✨ ALERTS&lt;/h4&gt;
&lt;br&gt;
&lt;strong&gt;on_success_callback&lt;/strong&gt; and &lt;strong&gt;on_failure_callback&lt;/strong&gt; arguments are used to trigger some actions once the workflow succeeds or fails respectively. This will be useful to send personalized alerts to internal team via Slack, Email, or any other API call when a workflow task succeeds or fails.
&lt;h6&gt;
  
  
  👇 Sample Slack notification received from Airflow
&lt;/h6&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyaq96ipep80t560o5yd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyaq96ipep80t560o5yd.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  View Code - &lt;a href="https://github.com/Sri-nidhi/Airflow-Notes/blob/f03ba5033b1a7e5c1571c5e8edef969fbd65ae76/utils/alert.py#L11" rel="noopener noreferrer"&gt;Success&lt;/a&gt;, &lt;a href="https://github.com/Sri-nidhi/Airflow-Notes/blob/f03ba5033b1a7e5c1571c5e8edef969fbd65ae76/utils/alert.py#L30" rel="noopener noreferrer"&gt;Failure&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;✨ WEBUI &lt;/h4&gt;
&lt;br&gt;
Executing &lt;code&gt;airflow webserver&lt;/code&gt; in CLI starts airflow service which opens up Airflow WebUI in  &lt;code&gt;0.0.0.0::8080&lt;/code&gt; URL. 

&lt;p&gt;Since Airflow is made of Flask framework, you can even extend WebUI by creating additional pages using &lt;a href="https://www.coditation.com/airflow-ui-plugin-development-a-walkthrough/" rel="noopener noreferrer"&gt;Flask AppBuilder&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;✨ SCHEDULING&lt;/h4&gt;
&lt;br&gt;
Scheduler service - the heart of Airflow, needs to be Up and Running to run the DAGS. &lt;br&gt;
Once Airflow is installed, execute &lt;code&gt;airflow scheduler&lt;/code&gt; to start scheduler service.

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08jet45zmbjgtwfyfi51.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08jet45zmbjgtwfyfi51.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to that, DAGs can be made to run automatically at a particular time by providing a &lt;a href="https://crontab.guru/" rel="noopener noreferrer"&gt;cron notation&lt;/a&gt; in &lt;code&gt;schedule_interval&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;✨ API&lt;/h4&gt;
&lt;br&gt;
Airflow 2.0 has released Airflow API that opens the door to Automation with Airflow. With Airflow API, one can automate DAG triggers, reads, and other &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html" rel="noopener noreferrer"&gt;operations&lt;/a&gt; that are possible using WebUI.&lt;br&gt;
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkyyn19zsnirqg7izov5.png" alt="Alt Text"&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;Setting up your Airflow Instance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here are few great resources :&lt;br&gt;
&lt;a href="https://dev.to/jfhbrook/how-to-run-airflow-on-windows-with-docker-2d01"&gt;On Windows with Docker&lt;/a&gt;&lt;br&gt;
&lt;a href="https://medium.com/@taufiq_ibrahim/apache-airflow-installation-on-ubuntu-ddc087482c14" rel="noopener noreferrer"&gt;On Ubuntu Local Setup&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/rolanddb/airflow-on-kubernetes" rel="noopener noreferrer"&gt;On Kubernetes&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/villasv/aws-airflow-stack" rel="noopener noreferrer"&gt;AWS Using cloudformation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you don't have the bandwidth to set up and maintain airflow in your own infrastructure, here are the commercial Airflow-as-a-service providers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/composer/" rel="noopener noreferrer"&gt;Google Cloud Composer&lt;/a&gt; - managed service built atop Google Cloud and Airflow.&lt;br&gt;
&lt;a href="https://www.astronomer.io/" rel="noopener noreferrer"&gt;Astronomer.io&lt;/a&gt; - In addition to hosting Airflow in their infrastructure, they provide solutions focused on airflow and support services.&lt;/p&gt;




&lt;h3&gt;
  
  
  Is Airflow the only option for workflows?
&lt;/h3&gt;

&lt;p&gt;Certainly Not. It depends on different factors.&lt;br&gt;
Say your entire product is hosted in a single cloud provider ( AWS / Azure/ GCP) and you are fine with vendor lock-in. &lt;br&gt;
Then,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For AWS, &lt;a href="https://aws.amazon.com/step-functions/" rel="noopener noreferrer"&gt;AWS step function&lt;/a&gt; will be a good option. &lt;/li&gt;
&lt;li&gt;For Azure, you can opt for &lt;a href="https://azure.microsoft.com/en-in/services/data-factory/" rel="noopener noreferrer"&gt;Azure Data Factory&lt;/a&gt;. &lt;/li&gt;
&lt;li&gt;For GCP, &lt;a href="https://cloud.google.com/composer/" rel="noopener noreferrer"&gt;Google Cloud Composer&lt;/a&gt; will be the best fit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whereas if you want to host your product in multi-cloud - &lt;br&gt;
Airflow will be a  better fit. There are other workflows in the market similar to Airflow :&lt;br&gt;
&lt;a href="https://www.prefect.io/" rel="noopener noreferrer"&gt;Prefect&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/spotify/luigi" rel="noopener noreferrer"&gt;Luigi&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.DAGster.io/" rel="noopener noreferrer"&gt;Dagster&lt;/a&gt; &lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Airflow Best practices / Tips:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Once you have airflow installed, modify the configurations of your airflow instance in the master config file - &lt;strong&gt;&lt;a href="https://github.com/apache/airflow/blob/master/airflow/config_templates/default_airflow.cfg" rel="noopener noreferrer"&gt;airflow.cfg&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Avoid creating complex data processing logic in Airflow, as airflow is not intended to work as a data processing engine. Its focus is to orchestrate data flow jobs. (For data processing, Run the processing script in a batch compute or spark framework and invoke from airflow)&lt;/li&gt;
&lt;li&gt;Use &lt;a href="https://marclamberti.com/blog/templates-macros-apache-airflow/" rel="noopener noreferrer"&gt;macros and templates&lt;/a&gt; to avoid hard coding values.&lt;/li&gt;
&lt;li&gt;While migrating the airflow metadata db to a new database, use "airflow upgradedb" (For 2.0 airflow db upgrade) instead of "airflow initdb" (For 2.0 airflow db init)&lt;/li&gt;
&lt;li&gt; When you want to implement a task out of the box, create &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html" rel="noopener noreferrer"&gt;custom operators&lt;/a&gt; using the &lt;code&gt;BaseOperator&lt;/code&gt; Class for re-using them.&lt;/li&gt;
&lt;li&gt;Use &lt;a href="https://github.com/airflow-plugins/" rel="noopener noreferrer"&gt;Airflow Plugins&lt;/a&gt;, to connect to third-party tools/data sources.&lt;/li&gt;
&lt;li&gt;Use Postgres database as airflow's metadata DB to use Local and CeleryExecutors.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you have any queries or any more tips that you think might be useful, comment below on the post.&lt;/p&gt;

&lt;p&gt;Going forward, I'll publish detailed posts on tools and frameworks used by Data Engineers day in and day out. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/J4zA6LplubvC5weDyo/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/J4zA6LplubvC5weDyo/giphy.gif"&gt;&lt;/a&gt;&lt;br&gt;
Follow for updates.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>python</category>
      <category>beginners</category>
      <category>career</category>
    </item>
    <item>
      <title>Data Engineering Series #2: Cloud Services and FOSS in Data Engineer's world</title>
      <dc:creator>Srinidhi </dc:creator>
      <pubDate>Wed, 24 Jun 2020 05:25:02 +0000</pubDate>
      <link>https://forem.com/srinidhi/data-engineering-series-2-cloud-services-and-foss-in-data-engineer-s-world-5c46</link>
      <guid>https://forem.com/srinidhi/data-engineering-series-2-cloud-services-and-foss-in-data-engineer-s-world-5c46</guid>
      <description>&lt;p&gt;&lt;br&gt;&lt;strong&gt;"Open Source (OSS)&lt;/strong&gt; frameworks have improved the quality of Big Data processing with its diverse set of tools addressing numerous use cases &lt;/p&gt;

&lt;p&gt;In fact, if you are a part of a team working on building a modern data architecture, chances are high you are using an open-source stack. &lt;br&gt;
&lt;br&gt;&lt;br&gt;
Similarly,  &lt;strong&gt;Cloud Computing&lt;/strong&gt; has been enabling Big Data Solutions in yielding scalable and cost-effective solutions in analytics space.&lt;/p&gt;




&lt;h3&gt;&lt;center&gt;&lt;b&gt;Open Source and Cloud : The Correlation&lt;/b&gt;&lt;/center&gt;&lt;/h3&gt;

&lt;p&gt;In the cloud ecosystem, many of the commercially available &lt;strong&gt;cloud services&lt;/strong&gt; are either  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Similar to an OSS&lt;/strong&gt; ➡ Similar in Features (&lt;strong&gt;Eg:&lt;/strong&gt; AWS Step Functions and Apache Airflow )&lt;/p&gt;

&lt;center&gt;OR&lt;/center&gt;

&lt;p&gt;&lt;strong&gt;Modeled after an OSS&lt;/strong&gt; ➡ Follows/ Inherits the design principles of an existing Open Source framework. (&lt;strong&gt;Eg:&lt;/strong&gt; AWS Kinesis and Apache Kafka)&lt;/p&gt;

&lt;center&gt;OR&lt;/center&gt;

&lt;p&gt;&lt;strong&gt;Managed service of an OSS&lt;/strong&gt; ➡ Takes care of deployment &amp;amp; maintenance of the OSS framework and making it ready to use. (&lt;strong&gt;Eg:&lt;/strong&gt;  AWS RDS Postgres and PostgresDB)&lt;/p&gt;

&lt;p&gt;To understand more, Let's touch upon the basics...&lt;/p&gt;




&lt;h3&gt;&lt;center&gt;&lt;b&gt;Getting to know the cloud&lt;/b&gt;&lt;/center&gt;&lt;/h3&gt;

&lt;p&gt;The first step that many of us go through while getting to know about cloud services is to start wondering where to start from the plethora of services available out there. &lt;/p&gt;

&lt;p&gt;So, For the ease of understanding, Irrespective of the cloud provider (AWS, Azure, GCP, etc). let's group the &lt;strong&gt;big data related cloud services&lt;/strong&gt; into these stages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdi5xj6ds1bxu9iid7o7s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdi5xj6ds1bxu9iid7o7s.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Now, Let's try to understand the cloud ecosystem by comparing  &lt;strong&gt;AWS&lt;/strong&gt; cloud services with its equivalent open source frameworks. (Similar comparison can be drawn with Azure and GCP as well)&lt;/p&gt;

&lt;h4&gt;&lt;b&gt;📍 Data Ingestion:&lt;/b&gt;&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS Service&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Relation with OSS&lt;/th&gt;
&lt;th&gt;OSS Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/kinesis/" rel="noopener noreferrer"&gt;Kinesis&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Stream Processing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Modelled After&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/sqs/" rel="noopener noreferrer"&gt;SQS&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Message Queue&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.rabbitmq.com/" rel="noopener noreferrer"&gt;RabbitMQ&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/msk/" rel="noopener noreferrer"&gt;Managed Streaming for Kafka (MSK)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Stream Processing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;&lt;b&gt;📍 Data Storage:&lt;/b&gt;&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS   Service&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Relation with OSS&lt;/th&gt;
&lt;th&gt;OSS   Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/s3/" rel="noopener noreferrer"&gt;S3&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Object store&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://min.io/" rel="noopener noreferrer"&gt;Minio&lt;/a&gt;, &lt;a href="https://launchpad.net/swift" rel="noopener noreferrer"&gt;Swift&lt;/a&gt;, &lt;a href="https://ceph.io/" rel="noopener noreferrer"&gt;Ceph&lt;/a&gt;,   ...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/rds/" rel="noopener noreferrer"&gt;RDS&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Relational database&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://mariadb.org/" rel="noopener noreferrer"&gt;MariaDB&lt;/a&gt;, &lt;a href="https://www.mysql.com/" rel="noopener noreferrer"&gt;MySQL&lt;/a&gt;, &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;Postgres&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/dynamodb/" rel="noopener noreferrer"&gt;DynamoDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;NoSQL database&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://cassandra.apache.org/" rel="noopener noreferrer"&gt;Apache Cassandra&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/elasticache/" rel="noopener noreferrer"&gt;ElastiCache&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;In-memory cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.memcached.org/" rel="noopener noreferrer"&gt;Memcached&lt;/a&gt;, &lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/neptune/" rel="noopener noreferrer"&gt;Neptune&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Graph database&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://neo4j.com/" rel="noopener noreferrer"&gt;Neo4j&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/qldb/" rel="noopener noreferrer"&gt;Amazon QLDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Ledger database&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Modelled After&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.hyperledger.org/" rel="noopener noreferrer"&gt;Hyperledger&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/documentdb/" rel="noopener noreferrer"&gt;Amazon DocumentDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Document database&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.mongodb.com/" rel="noopener noreferrer"&gt;MongoDB&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/lake-formation/" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Data lake&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html" rel="noopener noreferrer"&gt;HDFS&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/ebs/" rel="noopener noreferrer"&gt;EC2 EBS&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Block storage for EC2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.openebs.io/" rel="noopener noreferrer"&gt;OpenEBS&lt;/a&gt;, &lt;a href="https://github.com/portworx/px-dev" rel="noopener noreferrer"&gt;Portworx&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;&lt;b&gt;📍 Data Processing:&lt;/b&gt;&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS   Service&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Relation with OSS&lt;/th&gt;
&lt;th&gt;OSS   Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/emr/" rel="noopener noreferrer"&gt;Elastic Map Reduce&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Hadoop&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://hadoop.apache.org/" rel="noopener noreferrer"&gt;Hadoop&lt;/a&gt;,&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/step-functions/" rel="noopener noreferrer"&gt;Step Functions&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Worflow Orchestrator&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://airflow.apache.org/" rel="noopener noreferrer"&gt;Apache Airflow&lt;/a&gt; ,  &lt;a href="https://flyte.org/" rel="noopener noreferrer"&gt;Flyte&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/glue/" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/lambda/" rel="noopener noreferrer"&gt;Lambda&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Serverless&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://knative.dev/" rel="noopener noreferrer"&gt;Knative&lt;/a&gt;, &lt;a href="https://www.openfaas.com/" rel="noopener noreferrer"&gt;OpenFaaS&lt;/a&gt;, &lt;a href="https://fnproject.io/" rel="noopener noreferrer"&gt;Fn&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/batch/" rel="noopener noreferrer"&gt;Batch&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Batch Job Computing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://airflow.apache.org/docs/stable/kubernetes.html" rel="noopener noreferrer"&gt;Apache Airflow on Kubernetes&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;&lt;b&gt;📍 Data Analysis &amp;amp; Visualization:&lt;/b&gt;&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS   Service&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Relation with OSS&lt;/th&gt;
&lt;th&gt;OSS   Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/redshift" rel="noopener noreferrer"&gt;Amazon   Redshift&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Data warehousing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://spark.apache.org/sql/" rel="noopener noreferrer"&gt;Spark SQL&lt;/a&gt;, &lt;a href="https://hive.apache.org/" rel="noopener noreferrer"&gt;Apache Hive&lt;/a&gt;, &lt;a href="https://prestodb.io/" rel="noopener noreferrer"&gt;Presto&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/athena" rel="noopener noreferrer"&gt;Athena&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Data warehousing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://spark.apache.org/sql/" rel="noopener noreferrer"&gt;Spark SQL&lt;/a&gt;, &lt;a href="https://hive.apache.org/" rel="noopener noreferrer"&gt;Apache Hive&lt;/a&gt;, &lt;a href="https://prestodb.io/" rel="noopener noreferrer"&gt;Presto&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/cloudsearch" rel="noopener noreferrer"&gt;CloudSearch&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Search&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.elastic.co/" rel="noopener noreferrer"&gt;Elasticsearch&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/elasticsearch-service" rel="noopener noreferrer"&gt;Elasticsearch Service&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Search&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.elastic.co/" rel="noopener noreferrer"&gt;Elasticsearch&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/quicksight" rel="noopener noreferrer"&gt;QuickSight&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Business analytics&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://powerbi.microsoft.com/en-us/" rel="noopener noreferrer"&gt;PowerBI&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;&lt;b&gt;📍 Deployment:&lt;/b&gt;&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS Service&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Relation with OSS&lt;/th&gt;
&lt;th&gt;OSS Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/ecr" rel="noopener noreferrer"&gt;Elastic Container Registry (ECR)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Container registry&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;Docker Registry&lt;/a&gt;, &lt;a href="https://quay.io/" rel="noopener noreferrer"&gt;Quay&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/ecs" rel="noopener noreferrer"&gt;Elastic Container Service (ECS)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Container orchestration&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;, &lt;a href="https://mesosphere.github.io/marathon/" rel="noopener noreferrer"&gt;Marathon&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/eks" rel="noopener noreferrer"&gt;Elastic Kubernetes Services (EKS)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Container orchestration&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Managed Service of&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/cloudformation" rel="noopener noreferrer"&gt;Cloud Formation&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure as a code&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Similar to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;Some of the notable cloud adoptions with respect to Big Data.&lt;/h4&gt;

&lt;blockquote&gt;- Till now, AWS users have launched more than 15 million Hadoop clusters. (EMR / Containerized versions) &lt;br&gt;
- "container-as-a-service" (EKS, ECS) and "Database-as-a-service" (RDS, DynamoDB) are the most commonly used managed services in 2020.&lt;br&gt;
- Database services usage up 127% year over year.
&lt;/blockquote&gt;

&lt;p&gt;Next Steps...&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You can understand how these services are put to use in real-world use cases in this &lt;a href="https://aws.amazon.com/big-data/use-cases/" rel="noopener noreferrer"&gt;article&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This &lt;a href="http://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf" rel="noopener noreferrer"&gt;Whitepaper&lt;/a&gt; from AWS on Big Data will be a good place to understand its Services.&lt;/li&gt;
&lt;li&gt;And start getting hands-on following this &lt;a href="https://github.com/manuparra/starting-bigdata-aws" rel="noopener noreferrer"&gt;repo&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Going forward, I'll publish detailed posts on tools and frameworks used by Data Engineers day in and day out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/J4zA6LplubvC5weDyo/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/J4zA6LplubvC5weDyo/giphy.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Follow for updates.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>opensource</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Data Engineering Series #1: 10 Key tech skills you need, to become a competent Data Engineer.</title>
      <dc:creator>Srinidhi </dc:creator>
      <pubDate>Tue, 19 May 2020 04:41:53 +0000</pubDate>
      <link>https://forem.com/srinidhi/data-engineering-series-1-10-key-tech-skills-you-need-to-become-a-competent-data-engineer-2n46</link>
      <guid>https://forem.com/srinidhi/data-engineering-series-1-10-key-tech-skills-you-need-to-become-a-competent-data-engineer-2n46</guid>
      <description>&lt;p&gt;Bridging the gap between Application Developers and Data Scientists, &lt;a href="https://www.datanami.com/2020/02/12/demand-for-data-engineers-up-50/"&gt;the demand for Data Engineers rose up to 50% in 2020&lt;/a&gt;, especially due to increase in investments on AI based SaaS products.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xc9exkFL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/elt76s4dd15d8mcimf8q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xc9exkFL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/elt76s4dd15d8mcimf8q.png" alt="Alt Text" width="823" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
After going through multiple Job Descriptions and based on my experience in the field , I have come up with the detailed skill sets to become a competent Data Engineer. 
&lt;/blockquote&gt;

&lt;p&gt;If you are a Backend Developer, some of your skills will overlap with this list below. &lt;b&gt;Yes&lt;/b&gt;, it's quite easier for you to make the jump provided the skill gaps are addressed.&lt;/p&gt;

&lt;h3&gt; 🎯 Must Haves&lt;/h3&gt;

&lt;p&gt;&lt;b&gt; 1️⃣ The Art of Scripting and Automating &lt;/b&gt;&lt;br&gt;
Can't stress this enough. &lt;br&gt;
Ability to Write a reusable code and to know the Common Libraries and frameworks used in Python for:&lt;/p&gt;

&lt;blockquote&gt;
        * Data Wrangling operations - Pandas,numpy,re&lt;br&gt;
        * Data Scraping - requests/BeautifulSoup/lxml/Scrapy&lt;br&gt;
        * Interacting with External APIs and other Data Sources, Logging&lt;br&gt;
        * Parallel processing Libraries - Dask, Multiprocessing&lt;br&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;b&gt;2️⃣ Cloud Computing Platforms&lt;/b&gt;&lt;br&gt;&lt;a href="https://www.sisense.com/blog/data-engineering-today-all-about-the-cloud/"&gt;The rise of cloud storage and computing has changed a lot for data engineers.&lt;/a&gt; So much that, being well versed in at least one of the cloud platforms is required. &lt;/p&gt;

&lt;blockquote&gt;
*Serverless Computing, Virtual Instances, Managed Docker and Kubernetes Services&lt;br&gt;
* Security Standards, User Authentication and Authorization, Virtual Private Cloud, Subnet 
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Either start with &lt;b&gt;&lt;a href="https://www.udemy.com/course/aws-big-data/?utm_source=adwords&amp;amp;utm_medium=udemyads&amp;amp;utm_campaign=LongTail_la.EN_cc.INDIA&amp;amp;utm_content=deal4584&amp;amp;utm_term=_._ag_77882236223_._ad_387397828060_._kw__._de_c_._dm__._pl__._ti_dsa-1007766171032_._li_9061899_._pd__._&amp;amp;matchtype=b&amp;amp;gclid=CjwKCAiAzJLzBRAZEiwAmZb0app7tODlXWDk89Zpdcgj6QDgpnubj7aURaUQTgXjz3RkUy_TRKrlTxoCBFwQAvD_BwE"&gt;AWS&lt;/a&gt;&lt;/b&gt; or &lt;b&gt;&lt;a href="https://www.coursera.org/professional-certificates/gcp-data-engineering"&gt;GCP&lt;/a&gt;&lt;/b&gt; services. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt;3️⃣ Linux OS &lt;/b&gt;&lt;br&gt;
Importance of Working with Linux OS is often overlooked. &lt;br&gt;&lt;a href="https://www.redhat.com/en/resources/state-of-linux-in-public-cloud-for-enterprises"&gt;"90% of the public cloud workloads are running on Linux based OS"&lt;/a&gt; &lt;/p&gt;

&lt;blockquote&gt;        
        * Bash Scripting concepts in Linux like control flow, looping, passing input parameters&lt;br&gt;
        * File System Commands&lt;br&gt;
        * Running daemon processes&lt;br&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;b&gt;4️⃣ Database Management - Relational Databases, OLAP vs OLTP, NoSQL&lt;/b&gt;&lt;/p&gt;

&lt;blockquote&gt;
        * Creating tables, Read,Write,Update and Delete operations, joins, procedures, materialized views, aggrgated views, window functions&lt;br&gt;
        * Database vs Data warehouse. Star and snowflake schemas, facts and dimension tables.&lt;br&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt; Common Relational Databases preferred - &lt;b&gt;PostgreSQL, MySQL etc&lt;/b&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt; 5️⃣ Distributed Data Storage Systems &lt;/b&gt;&lt;/p&gt;

&lt;blockquote&gt;
        * Knowledge of how distributed data store works.&lt;br&gt;        
        * Understanding the Concepts like partitioned data storage, sorting key, SerDes, data replication, caching and persistence.
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt; Some of the mostly used ones - &lt;b&gt;HDFS, AWS S3 or any other NoSQL database (MongoDB, DynamoDB,Cassandra)&lt;/b&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt; 6️⃣ Distributed Data Processing Systems &lt;/b&gt;&lt;/p&gt;

&lt;blockquote&gt;
        * Common techniques and patterns for data processing such as partitioning, predicate pushdown, sort by partition, maintaining size of shuffle blocks, window function &lt;br&gt;
        * Leveraging all cores and memory available in the cluster to improve concurrency.
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt; Common Distributed processing frameworks - &lt;b&gt;Map Reduce, Apache Spark&lt;/b&gt; (Start with Pyspark if you are already comfortable with Python)

Credits : startdataengineering

&lt;b&gt; 7️⃣ ETL/ELT tools and Modern Workflow management Frameworks&lt;/b&gt;
Different companies will have different ways to pick ETL frameworks, 
One with an In-house data engineering team would prefer to have ETL jobs set up with properly managed workflow management tools for Batch Processing.
&lt;blockquote&gt;&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;ETL - ETL vs ELT, Data connectivity, Mapping, Metadata, Types of Data Loading 
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When to use a Workflow Management System - Directed Acyclic graphs, CRON scheduling, Operators&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt; ETL Tools: &lt;b&gt;Informatica, Talend&lt;/b&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt; Workfow Management Frameworks: &lt;b&gt;Airflow, Luigi&lt;/b&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;🎯Good To Have&lt;/h3&gt;

&lt;p&gt;&lt;b&gt;8️⃣ JAVA / JVM Based Frameworks &lt;/b&gt;&lt;br&gt;
Knowledge of a JVM based language such as &lt;b&gt;Java or Scala&lt;/b&gt; will be extremely useful&lt;/p&gt;

&lt;blockquote&gt;
        - Understand both functional and object oriented programming concepts&lt;br&gt;
        - Many of the high performance data science frameworks that are built on top of Hadoop usually are written using Scala or Java.
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt; JVM Based Frameworks - &lt;b&gt;Apache Spark, Apache Flink&lt;/b&gt;, etc &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt;9️⃣ Message Queuing Systems &lt;/b&gt;&lt;/p&gt;

&lt;blockquote&gt;
        * Understanding how data injestion happens in Message Queues&lt;br&gt;
        * What are Producer and Consumers and how are they implemented&lt;br&gt;
        * Sharding, Data Retention, Replay, de-duplication
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt; Popular Messaging queues: &lt;b&gt;Kafka,RabbitMQ, Kinesis, SQS etc&lt;/b&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt;🔟 Stream Data Processing &lt;/b&gt;&lt;/p&gt;

&lt;blockquote&gt;       
*Differentiating between Real-time, Stream and Batch Processing.&lt;br&gt;
*Sharding, Repartioning, Poll Wait time, topics/groups,brokers
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt; Commonly used frameworks: &lt;b&gt;AWS Kinesis Streams,Apache Spark,Storm,Samza&lt;/b&gt; etc&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find any other skill that will be helpful, comment below on the post.&lt;/p&gt;

&lt;p&gt;Going forward, I'll publish detailed posts on tools and frameworks used by Data Engineers day in and day out. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/J4zA6LplubvC5weDyo/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/J4zA6LplubvC5weDyo/giphy.gif" width="320" height="320"&gt;&lt;/a&gt;&lt;br&gt;
Follow for updates.&lt;/p&gt;

</description>
      <category>dev</category>
      <category>career</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
