<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ralph Brooks</title>
    <description>The latest articles on Forem by Ralph Brooks (@ralphbrooks).</description>
    <link>https://forem.com/ralphbrooks</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F605635%2Fa0c458ca-510d-4664-bb96-6c1b376cfcac.png</url>
      <title>Forem: Ralph Brooks</title>
      <link>https://forem.com/ralphbrooks</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ralphbrooks"/>
    <language>en</language>
    <item>
      <title>Conquer logging once and for all with Vertex AI and Google Cloud</title>
      <dc:creator>Ralph Brooks</dc:creator>
      <pubDate>Wed, 16 Jun 2021 13:39:54 +0000</pubDate>
      <link>https://forem.com/ralphbrooks/conquer-logging-once-and-for-all-with-vertex-ai-and-google-cloud-311n</link>
      <guid>https://forem.com/ralphbrooks/conquer-logging-once-and-for-all-with-vertex-ai-and-google-cloud-311n</guid>
      <description>&lt;p&gt;Vertex AI was announced at Google I/O 2021. More than just a rebranding of the Google AI Platform, this product starts to unify a lot of different APIs (including AutoML) under one product offering. Google states in a press release that this allows companies to start to implement MLOps easier.&lt;/p&gt;

&lt;p&gt;In this blog, we are going to do the equivalent of "Hello World" for Data Science using the Vertex AI platform. In short, we are going to use a Vertex AI "Jupyter" Notebook to communicate with the logging service of Google Cloud. Think of a notebook as a way of running Python code in an iterative manner that allows you to capture the results along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  TLDR - Show me the code!
&lt;/h2&gt;

&lt;p&gt;If you use a Vertex AI notebook, you can easily test out the &lt;a href="https://googleapis.dev/python/logging/latest/usage.html#writing-log-entries" rel="noopener noreferrer"&gt;python library&lt;/a&gt; for Cloud Logging within Google Cloud. The notebook that you need in order to test Cloud Logging can be &lt;a href="https://whiteowleducation-ml-mastery.s3.amazonaws.com/2021-06-01-vertex-ai-logging/2021-06-03-google-cloud-logging-v2.ipynb" rel="noopener noreferrer"&gt;downloaded here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;In order to complete the steps in this blog, you need to have the following:&lt;/p&gt;

&lt;p&gt;1) You need to have a Google Cloud Account. If you don't already have an account, take a look at &lt;a href="https://courses.whiteowleducation.com/courses/machine-learning-mastery/lectures/30703728" rel="noopener noreferrer"&gt;this video&lt;/a&gt; which shows how to set up an account.&lt;/p&gt;

&lt;p&gt;2) You need to enable the Cloud Logging API. After you have set up an account, you can find details about enabling this API at &lt;a href="https://console.cloud.google.com/apis/api/logging.googleapis.com" rel="noopener noreferrer"&gt;https://console.cloud.google.com/apis/api/logging.googleapis.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;3) You need to go to &lt;a href="https://console.cloud.google.com/vertex-ai/notebooks" rel="noopener noreferrer"&gt;https://console.cloud.google.com/vertex-ai/notebooks&lt;/a&gt;. Enable the notebooks API if you see a corresponding warning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig9n9qr6350qa1oyqrtv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig9n9qr6350qa1oyqrtv.png" alt="Images courtesy of https://www.whiteowleducation.com "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Images courtesy of &lt;a href="https://www.whiteowleducation.com" rel="noopener noreferrer"&gt;https://www.whiteowleducation.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Create a Vertex AI notebook
&lt;/h2&gt;

&lt;p&gt;Within the Vertex AI console, the first step that we need to do is we need to create a notebook instance. This instance is this is going to be backed by a CPU, so the key is to run this exercise, and when you're done, be sure to delete the notebook instance so that you don't incur additional fees.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9pruedofpg2sy0u44gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9pruedofpg2sy0u44gi.png" alt="Make sure to delete the notebook instance after you get done using it. This is important to manage costs."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Make sure to delete the notebook instance after you get done using it. This is important to manage costs.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As seen above, I'm creating a notebook called test-logging in us-central1 (and you should create your instance in a location that is close to you). I create this notebook with libraries such as TensorFlow and Pandas that would typically be used in data science, and I do this by selecting the TensorFlow Enterprise 2.5 environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimso51b3tok66s3ikk9s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimso51b3tok66s3ikk9s.png" alt="If we are just examining logging, I am minimizing CPU to manage costs."&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;If we are just examining logging, I am minimizing CPU to manage costs.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After clicking create, you will see the test-logging notebook appear in the console, and click on "OPEN JUPTYERLAB" in order to continue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fshq7vj88orud3lhgizk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fshq7vj88orud3lhgizk4.png" alt="Notebook in console with Jupyterbab"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you are in JupyterLab, click on Python [conda env:root] in order to open up a notebook for experimentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhsf0f4hsyx1p9227x970.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhsf0f4hsyx1p9227x970.png" alt="Vertex Notebook Options"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now go ahead and enter the following Python code into the notebook.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.cloud.logging_v2&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;logging_v2&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;environ&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;google_log_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Formatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%(name)s | %(module)s | %(funcName)s | %(message)s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;datefmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-$dT%H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_default_handler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setFormatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;google_log_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cloud_logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex-ai-notebook-logger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cloud_logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cloud_logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex-ai-notebook-logger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This is a log from a Vertex AI Notebook!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If any of the above code looks unfamiliar, or if you have not used the google-cloud-logging library before, I strongly encourage you to take a look at &lt;a href="https://courses.whiteowleducation.com/courses/machine-learning-mastery/lectures/32084006" rel="noopener noreferrer"&gt;this video&lt;/a&gt; which discusses how to set up the format for logging and how to get a python logger to output information to the cloud.&lt;/p&gt;




&lt;h2&gt;
  
  
  Verify Results
&lt;/h2&gt;

&lt;p&gt;After running this test code in the notebook, you can head over to the Logs Explorer to see your results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlydiu4rwhog79xz2z2w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlydiu4rwhog79xz2z2w.png" alt="Google Cloud Log Explorer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, remember to go back into the notebook console within Vertex AI to:&lt;/p&gt;

&lt;p&gt;1) Select the "instance name" that you created (such as "test-logging")&lt;/p&gt;

&lt;p&gt;2) Click the delete icon at the top of the console in order to delete the instance&lt;/p&gt;

&lt;p&gt;When successful, you should see something on the console that states " You don't have any notebook instances in this project yet."&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this blog post, we briefly reviewed Vertex AI notebooks, and we looked at how those notebooks can communicate with the centralized logging in Google Cloud.&lt;/p&gt;

&lt;p&gt;It would be great to hear your thoughts about this blog, you can reach me through my company (White Owl Education) which is on Twitter at @&lt;a href="https://twitter.com/whiteowled" rel="noopener noreferrer"&gt;whiteowled&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>python</category>
      <category>jupyter</category>
    </item>
    <item>
      <title>Best Practices to Become a Data Engineer</title>
      <dc:creator>Ralph Brooks</dc:creator>
      <pubDate>Fri, 07 May 2021 22:37:05 +0000</pubDate>
      <link>https://forem.com/ralphbrooks/best-practices-to-become-a-data-engineer-4656</link>
      <guid>https://forem.com/ralphbrooks/best-practices-to-become-a-data-engineer-4656</guid>
      <description>&lt;p&gt;Steps to go from doing data analysis to ingesting and cleaning data in order get better insights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q: I come business intelligence background. I’m looking to make the transition to a data engineer. How do I go about doing this? Do I need to learn NumPy? Do I need to learn Pandas? What are the key concepts that I need to understand in order to ramp up on data engineering quickly?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’re doing business intelligence, maybe you’re working data visualization tools such as Power BI or Tableau. Either way, you’re doing a lot of analysis. At some point, you will be ready to ingest new data so that you can derive richer, deeper insights — you’re ready to start the journey to become a data engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Master the basics first.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The following are the six main things that you need to do in order to get to the next level in your career as a future data engineer:&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Learn the Basics of SQL
&lt;/h2&gt;

&lt;p&gt;If you have never done ANY programming before, then the first place that you want to start is by learning to sift through data. The way this is done is by learning a language called SQL (Structured Query Language). Among others things, SQL is a tool that can be used to look at relevant information in TABLES and to filter information with WHERE and SELECT statements.&lt;/p&gt;

&lt;h3&gt;
  
  
  GOOD RESOURCES TO GET STARTED WITH SQL
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.khanacademy.org/computing/computer-programming/sql/sql-basics/v/welcome-to-sql" rel="noopener noreferrer"&gt;khanacademy.org&lt;/a&gt;  - Khan Academy has a good set of videos that goes through the basics of SQL. The videos cover how to select data, and how to join data together. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B006QNDJZI/ref=dp-kindle-redirect?_encoding=UTF8&amp;amp;btkr=1" rel="noopener noreferrer"&gt;Head First SQL&lt;/a&gt; – When I was first starting out, I definitely looked at one or two.  Head First books published by O’Reilly. Head First covers the basics of a topic while focusing on different ways to engage the brain so that you learn the material quickly. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.amazon.com/SQL-Cookbook-Query-Solutions-Techniques/dp/1492077445/ref=sr_1_1?dchild=1&amp;amp;gclid=Cj0KCQjw1a6EBhC0ARIsAOiTkrFnexKQYh8wu-h-nOzvf4jPebcmpVPaOXBYeWLzuKAcUfskwv_D108aAmw6EALw_wcB&amp;amp;hvadid=241663586626&amp;amp;hvdev=c&amp;amp;hvlocphy=9026808&amp;amp;hvnetw=g&amp;amp;hvqmt=e&amp;amp;hvrand=7614006686566039404&amp;amp;hvtargid=kwd-1023089072&amp;amp;hydadcr=16371_10302015&amp;amp;keywords=sql+cookbook&amp;amp;qid=1619811009&amp;amp;sr=8-1" rel="noopener noreferrer"&gt;SQL Cookbook&lt;/a&gt; - SQL Cookbook gives step by step instructions on ways to look at data ("recipes") and different ways to think about how to analyze data. The book helps someone form an intuition about how to approach data analysis. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://cloud.google.com/bigquery/docs/#docs" rel="noopener noreferrer"&gt;Google Cloud Reference Documentation for Big Query&lt;/a&gt; &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I am a big fan of jumping into the deep end of the pool and learning how to swim quickly. Frankly, there's no better way to do this than Standard SQL with BigQuery in Google Cloud. &lt;/p&gt;

&lt;p&gt;If you start your SQL journey with Big Query, then you are learning about Google Cloud technology while you are learning data analysis. If you are going this route, then a good place to start is to go through the &lt;a href="https://www.qwiklabs.com/quests/68" rel="noopener noreferrer"&gt;BigQuery for Data Warehousing&lt;/a&gt; tutorial.&lt;/p&gt;

&lt;p&gt;When you learn this Google version of SQL (&lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types" rel="noopener noreferrer"&gt;Standard SQL&lt;/a&gt;), you will not only learn how to analyze data, but you will also learn how to make predictions on data. For example, you could use this flavor of SQL to predict sales with a &lt;a href="https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create" rel="noopener noreferrer"&gt;linear regression&lt;/a&gt;. As another example, you could also use this language to do a basic prediction as to whether or not a customer will make a transaction with a &lt;a href="https://cloud.google.com/bigquery-ml/docs/bigqueryml-web-ui-start" rel="noopener noreferrer"&gt;logistic regression&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Learn the Basics of Python
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3v17y1jrh84nula0vt4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3v17y1jrh84nula0vt4.jpg" alt="Python (the programming language) has nothing to do with a snake which has the same name (Image courtesy of pexels.com)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Python (the programming language) has nothing to do with a snake which has the same name (Image courtesy of pexels.com)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you want to get to the next level in your career, it is almost essential to learn some type of programming language. Personally I would recommend learning Python. Python's flexible and it is relatively straight forward to learn. You can use to process streaming data with data pipelines, to analyze data with Jupyter notebooks, and to build artificial intelligence models. For me, it is one language that does a lot, and it is super flexible.&lt;/p&gt;

&lt;p&gt;These are some top resources to learn Python, but they are not all free resources. In some cases, you “get what you pay for”,  and paying for something may save you a lot of time in the long run.&lt;/p&gt;

&lt;h3&gt;
  
  
  BEST RESOURCES TO GET STARTED WITH PYTHON
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://learnpythonthehardway.org/python3/" rel="noopener noreferrer"&gt;Learn Python 3 the Hard Way&lt;/a&gt;  – White Owl Education has no affiliation with the author of “Learn Python 3 the Hard Way”, but when I was learning Python this one was one of the books that I used in order to ramp up. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One note here - The book is very specific about what editor to use and how to get through the class. I would follow the instructions in the book to the letter without deviation. &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.python.org/3/tutorial/introduction.html" rel="noopener noreferrer"&gt;Official Python Tutorial&lt;/a&gt; - The official Python tutorial (which is part of the reference documentation) is actually pretty good, but it doesn't endorse any particular software for writing Python. Because of this, you're still better off using a book like "Learn Python the Hard Way" before jumping into the official tutorial.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.python.org/" rel="noopener noreferrer"&gt;Official Python Documentation&lt;/a&gt; – Think of the official Python Documentation like a dictionary. You're not going to read this front to back, but it is definitely a reference that you may want to use from time to time. &lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 3: Learn How to Navigate Code Quickly
&lt;/h2&gt;

&lt;p&gt;When you are developing in a programming language, it is helpful to write code in a program that can check for syntax errors, and that can quickly navigate through large amounts of code (large "code bases"). These programs that help you write code are called Integrated Development Environments, and the following are two of the most popular IDEs:&lt;/p&gt;

&lt;h3&gt;
  
  
  TOP DEVELOPMENT ENVIRONMENTS FOR PYTHON
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;PyCharm: White Owl Education has a &lt;a href="https://courses.whiteowleducation.com/courses/machine-learning-mastery/lectures/30654103" rel="noopener noreferrer"&gt;free tutorial on how to set up Pycharm on your laptop&lt;/a&gt; .&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://code.visualstudio.com/" rel="noopener noreferrer"&gt;Visual Studio Code&lt;/a&gt;: This is a lightweight integrated development environment.  As a side note, colleagues of mine swear by VS Code for Python development, but I have only used VS code for React, JavaScript, and TypeScript development. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  HOW MUCH PYTHON DO YOU NEED TO KNOW TO MASTER DATA ENGINEERING?
&lt;/h3&gt;

&lt;p&gt;On this journey to becoming a data engineer, you need to master the basics of Python. How do you do this? Do a technique and then do that same technique again and again until it becomes intuitive. This is critical because as you start use Python to stream in data or to “do artificial intelligence,” you don't want to worry about very basic Python syntax mistakes.&lt;/p&gt;

&lt;p&gt;So what do you need to know? You should be able to do the following in your sleep.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/tutorial/controlflow.html#defining-functions" rel="noopener noreferrer"&gt;Creation of a function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/tutorial/classes.html" rel="noopener noreferrer"&gt;Create a Python Class&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/tutorial/controlflow.html" rel="noopener noreferrer"&gt;Understand control flow using ‘if’ and ‘for’&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/howto/logging.html" rel="noopener noreferrer"&gt;Debugging with a logger&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files" rel="noopener noreferrer"&gt;Reading and writing from a file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/tutorial/modules.html" rel="noopener noreferrer"&gt;Creation of a Python module&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/library/unittest.html" rel="noopener noreferrer"&gt;Create a basic unit test&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;You should be able to use the &lt;a href="https://docs.python-requests.org/en/master/" rel="noopener noreferrer"&gt;requests package&lt;/a&gt; to pull data from an API.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Learn the Basics of NumPy
&lt;/h2&gt;

&lt;p&gt;As you start to analyze more and more data, you may start to group things together, you may covert these things into numbers (operate with numbers), and eventually you may start to apply math operations to these groups in order to make predictions about data. Pretty cool, right?&lt;/p&gt;

&lt;p&gt;NumPy is a python package that efficiently helps you to make these changes. In addition, concepts from NumPy can be seen in a data analysis package called Pandas. Concepts from NumPy are also seen in a machine learning, and specifically they are seen in a machine learning framework from Google called &lt;a href="https://www.tensorflow.org/" rel="noopener noreferrer"&gt;TensorFlow&lt;/a&gt;. Long story short – if you are planning to do data analysis or machine learning, then sooner or later you will need to learn NumPy.&lt;/p&gt;

&lt;p&gt;Let's look at example of NumPy to make things more concrete.&lt;/p&gt;

&lt;h3&gt;
  
  
  NumPy Example
&lt;/h3&gt;

&lt;p&gt;In Python, this is a list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;a_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just a list with 6 numbers (0 to 5).  Here is the same code using NumPy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;nd_array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This NumPy array ( "nd_array") also contains 6 numbers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nd_array&lt;/span&gt;
&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What if we now wanted take these numbers and put them in  two groups of 3? How could we express that?&lt;br&gt;
With NumPy, we only need the following line of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nd_array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;
&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
       &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's break down this one line of code down into steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;STEP 1&lt;/em&gt;&lt;/strong&gt;:  We're going to use the NumPy package (“np”)&lt;br&gt;
&lt;strong&gt;&lt;em&gt;STEP 2&lt;/em&gt;&lt;/strong&gt;: Use a function within that package called reshape.&lt;br&gt;
&lt;strong&gt;&lt;em&gt;STEP 3&lt;/em&gt;&lt;/strong&gt;: We're going to reshape the array, and we're going to put it into different groups ("dimensions").&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The (-1, 3) means to use "as many groups as possible" (the -1) where the group size is 3. &lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  BEST BOOKS AND RESOURCES ON NUMPY
&lt;/h3&gt;

&lt;p&gt;When you are starting out, you need to be able to do the following with NumPy&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installation of Numpy &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://numpy.org/devdocs/user/absolute_beginners.html" rel="noopener noreferrer"&gt;Understand how to determine the shape and size of an array&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://numpy.org/devdocs/user/absolute_beginners.html" rel="noopener noreferrer"&gt;Understand how to index and slice an array&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following resources can help with this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://numpy.org/install/" rel="noopener noreferrer"&gt;NumPy Installation documentation&lt;/a&gt; - This is another reference which gives a different approach on how to install NumPy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython-ebook-dp-B075X4LT6K/dp/B075X4LT6K/ref=mt_other?_encoding=UTF8&amp;amp;me=&amp;amp;qid=1620395658&amp;amp;asin=B075X4LT6K&amp;amp;revisionId=&amp;amp;format=2&amp;amp;depth=1" rel="noopener noreferrer"&gt;Python for Data Analysis&lt;/a&gt; - This book by Wes McKinney is a couple of years old, but it gives a really good walk through of NumPy and how to use it in an interactive Python environment called a &lt;a href="https://jupyter.org/" rel="noopener noreferrer"&gt;Jupyter Notebook&lt;/a&gt;.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://colab.research.google.com/" rel="noopener noreferrer"&gt;Google Colabratory&lt;/a&gt; - If you are looking for a free resource to run Python, NumPy, and TensorFlow, you may want to try Google CoLab. This site allows you to run code using GPUs that work well with machine learning operations. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 5: Learn the Basics of Pandas
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmajcrgp56whaau3e9qa.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmajcrgp56whaau3e9qa.jpg" alt="Turns out the Python package called Pandas ALSO is unrelated to the animal by the same name (Image courtesy of pexels.com)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Turns out the Python package called Pandas ALSO is unrelated to the animal by the same name (Image courtesy of pexels.com)&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;If you are making a transition to a career as a data engineer, then the manipulation of data and the cleaning of data are going to become extremely important. The first step in this journey may be to take a subset of data, and to work with &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;Pandas&lt;/a&gt; (a Python package which is “Excel on steroids”) in order to really understand the data. &lt;/p&gt;

&lt;p&gt;Let's make this more concrete with a practical example.&lt;/p&gt;
&lt;h3&gt;
  
  
  Pandas Example
&lt;/h3&gt;

&lt;p&gt;If you want to follow along, check out the corresponding &lt;a href="https://colab.research.google.com/drive/1qs0iLUfTLu0Zui14fT1Faxhke8frQ8v3?usp=sharing" rel="noopener noreferrer"&gt;Google Colab Notebook&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example we are going to read from a comma separated file (csv). This file will contain 4 names. In Unix, we use the echo statement to create the file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;echo "category,name" &amp;gt;&amp;gt; customers.csv
echo "A, Ralph Brooks" &amp;gt;&amp;gt; customers.csv
echo "B, John Doe" &amp;gt;&amp;gt; customers.csv
echo "B, Jane Doe" &amp;gt;&amp;gt; customers.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pandas is used to read in this file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a dataframe (df in the above example) which is an Excel-like grid of data which contains category and name (the first and last name). &lt;/p&gt;

&lt;p&gt;Now we are going to create a function that is going to do the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take a name and split it by the space character into two components&lt;/li&gt;
&lt;li&gt;Look at the first component - the first_name&lt;/li&gt;
&lt;li&gt;If the name_list has two parts (a first name and a last name), return back the first name&lt;/li&gt;
&lt;li&gt;Return nothing, if you can't identify a first name and a last name
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_first_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;name_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# if the list has two elements, then there is a first name and a last name
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;first_name&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We then extract out only names from the dataframe into a Python list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we map the get_first_name function to our list of names.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;first_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;first_names&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Ralph&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Jane&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pretty powerful stuff.  We just used the pandas library to read in data and to process the name part of that data. &lt;/p&gt;

&lt;p&gt;When you are starting out with data analysis with Pandas, you want to make sure that you take the time to master the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html" rel="noopener noreferrer"&gt;Read data from a comma separated file into a DataFrame&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://pandas.pydata.org/docs/user_guide/reshaping.html" rel="noopener noreferrer"&gt;Select only those rows in a dataframe which have a certain value&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the Pandas Example that we covered, you should be able to create a subset of our dataframe which only contains the second category (a dataframe that only contains category B). &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Merge two dataframes together based on a common key using the &lt;a href="https://pandas.pydata.org/docs/user_guide/merging.html" rel="noopener noreferrer"&gt;merge function&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  BEST BOOKS AND RESOURCES ON PANDAS
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html" rel="noopener noreferrer"&gt;Pandas Tutorial&lt;/a&gt; - At this point in your programming journey, you really want to get good at looking at open source documentation, and moving through the relevant parts of that documentation quickly. With regards to Pandas, take a look at the tutorial, and then take a look at the &lt;a href="https://pandas.pydata.org/docs/reference/index.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; on an "as needed basis."&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 6: Learn How to Build Data Pipelines
&lt;/h2&gt;

&lt;p&gt;Congrats for making it this far in the blog. At this point, you know that you at least have some homework that you are going to need to do on SQL, Python, PyCharm, NumPy, and Pandas. The payoff though is that once you have got a basic handle on these different technologies, you are ready to combine them together to PULL DATA into your analysis. It is the difference between "working with the data you have" to "working with the data that you need."&lt;/p&gt;

&lt;p&gt;Data engineering is a discipline unto itself, but the basics here are to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull data from a source (such as using the Twitter API to pull in Tweets)&lt;/li&gt;
&lt;li&gt;Cleaning data (so that bad punctuation or bad data does not effect other processing that you do "downstream")&lt;/li&gt;
&lt;li&gt;Place clean data in a different source - An example here would be to pull in streaming data, clean the data in a pipeline, and then export that data into BigQuery on Google Cloud for further processing. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  BEST ONLINE COURSES AND RESOURCES ON BUILDING DATA PIPELINES
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://courses.whiteowleducation.com/p/machine-learning-mastery" rel="noopener noreferrer"&gt;Machine Learning Mastery&lt;/a&gt; - The Machine Learning Mastery course from &lt;a href="https://www.whiteowleducation.com" rel="noopener noreferrer"&gt;White Owl Education&lt;/a&gt; not only covers setup of Python and installation of packages (including NumPy and TensorFlow). It also shows how to set up a data pipeline that can read streaming information and that can process streaming data with machine learning.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://beam.apache.org/" rel="noopener noreferrer"&gt;Apache Beam&lt;/a&gt; - Apache Beam is a scalable framework that allows you to implement batch and streaming data processing jobs. It is a framework that you can use in order to create a data pipeline on &lt;a href="https://cloud.google.com/dataflow" rel="noopener noreferrer"&gt;Google Cloud&lt;/a&gt; or on &lt;a href="https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-kinesis-data-analytics-now-supports-java-based-apache-beam-streaming-workloads/" rel="noopener noreferrer"&gt;Amazon Web Services&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this blog post, you learned about the 6 main steps that are needed in order to take your data analysis to the next level.  These steps are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Learn the Basics of SQL&lt;/li&gt;
&lt;li&gt;Learn the Basics of Python&lt;/li&gt;
&lt;li&gt;Learn How to Navigate Code Quickly&lt;/li&gt;
&lt;li&gt;Learn the Basics of NumPy&lt;/li&gt;
&lt;li&gt;Learn the Basics of Pandas&lt;/li&gt;
&lt;li&gt;Learn How to Build Data Pipelines&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you have any questions, feel free to reach out. You can direct message me on twitter at @&lt;a href="https://twitter.com/whiteowled" rel="noopener noreferrer"&gt;whiteowled&lt;/a&gt; &lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>sql</category>
    </item>
    <item>
      <title>4 Ways to Effectively Debug Data Pipelines using Python and Apache Beam</title>
      <dc:creator>Ralph Brooks</dc:creator>
      <pubDate>Tue, 30 Mar 2021 01:55:14 +0000</pubDate>
      <link>https://forem.com/ralphbrooks/4-ways-to-effectively-debug-data-pipelines-using-python-and-apache-beam-53c6</link>
      <guid>https://forem.com/ralphbrooks/4-ways-to-effectively-debug-data-pipelines-using-python-and-apache-beam-53c6</guid>
      <description>&lt;p&gt;Apache Beam is an open source framework that is useful for cleaning and processing data at scale. It is also useful for processing streaming data in real time.  In fact, you can even develop in Apache Beam on your laptop and deploy it to Google Cloud for processing (the Google Cloud version is called DataFlow).&lt;/p&gt;

&lt;p&gt;Beyond this, Beam touches into the world of artificial intelligence. More formally, it is used as a part of a machine learning pipelines or in automated deployments of machine learning models ( MLOps ). As a specific example, Beam could be used to clean up spelling errors or punctuation from a Twitter data before the data is sent to a machine learning model that determines if the tweet represents emotion that is happy or sad.&lt;/p&gt;

&lt;p&gt;One of the challenges though when working with Beam is how to approach debugging the Python that is used to create data pipelines and how to debug basic functionality on your laptop. In this blog post, I am going to show you 4 ways that can help you improve your debugging.&lt;/p&gt;




&lt;p&gt;QUICK NOTE: This blog gives a high level overview of how to debug data pipelines. For a deeper dive you may want to check out &lt;a href="https://courses.whiteowleducation.com/courses/machine-learning-mastery/lectures/30683503"&gt;this video&lt;/a&gt; which talks about unittests with Apache Beam and &lt;a href="https://courses.whiteowleducation.com/courses/machine-learning-mastery/lectures/31386945"&gt;this video&lt;/a&gt; which walks you through the debugging process for a basic data pipeline.&lt;/p&gt;




&lt;h3&gt;
  
  
  1)  Only run time-consuming unit tests if dependent libraries are installed
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;try:
    from apitools.base.py.exceptions import HttpError
except ImportError:
    HttpError = None


@unittest.skipIf(HttpError is None,
 'GCP dependencies are not installed')
class TestBeam(unittest.TestCase):
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are using unittest, it is helpful to have a test that only runs if the correct libraries are installed. In the above Python example, I have a try block which looks for a class within a Google Cloud library. If the class isn’t found, the unit test is skipped, and a message is displayed that says 'GCP dependencies are not installed.'&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Use TestPipeline when running local unit tests
&lt;/h3&gt;

&lt;p&gt;Apache Beam uses a Pipeline object in order to help construct a &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;directed acyclic graph&lt;/a&gt; (DAG) of transformations. If you are running tests, you could also use apache_beam.testing.TestPipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4zlwRWSf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Tred-G.svg/220px-Tred-G.svg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4zlwRWSf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Tred-G.svg/220px-Tred-G.svg.png" alt="DAG"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Example of a directed acyclic graph&lt;/em&gt; &lt;/p&gt;
&lt;h3&gt;
  
  
  3) Parentheses are helpful
&lt;/h3&gt;

&lt;p&gt;The reference beam documentation talks about using a "With" loop so that each time you transform your data, you are doing it within the context of a pipeline. Example Python pseudo-code might look like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;With beam.Pipeline(…)as p:
    emails = p | 'CreateEmails' &amp;gt;&amp;gt; beam.Create(self.emails_list) 
    phones = p | 'CreatePhones' &amp;gt;&amp;gt; beam.Create(self.phones_list) 
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It may also be helpful to construct the transformation  without the 'With Block'. The modified pseudo-code would then look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;emails_list = [
            ('amy', 'amy@example.com'),
            ('carl', 'carl@example.com'),
            ('julia', 'julia@example.com'),
            ('carl', 'carl@email.com'),
        ]

phones_list = [
    ('amy', '111-222-3333'),
    ('james', '222-333-4444'),
    ('amy', '333-444-5555'),
    ('carl', '444-555-6666'),
]

p = beam.Pipeline(...)

def list_to_pcollection(a_pipeline, a_list_in_memory, a_label):
    # () are required to span multiple lines
    return ( a_pipeline | a_label &amp;gt;&amp;gt; beam.Create(a_list_in_memory) )


emails = list_to_pcollection(p, emails_list, 'CreateEmails')
phones = list_to_pcollection(p, phones_list, 'CreatePhones')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In either case ('using the with block' or skipping the block), parentheses ARE YOUR FRIEND.&lt;/p&gt;

&lt;p&gt;Because Beam can do 'composite transforms' where one transformation 'chains' to the next, multiple lines for transformations are quite likely.  As seen in the above example, when you have multiple lines  you need to either have parentheses or have the line continuation character ('\').&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Using labels is recommended but each label MUST be unique
&lt;/h3&gt;

&lt;p&gt;Beam can use labels in order to keep track of transformations. As you can see in the beam pipeline on Google Cloud below, labels make it VERY easy for you to identify different stages of processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cV2OaMx8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/m787i7uf2aa5wixqct2r.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cV2OaMx8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/m787i7uf2aa5wixqct2r.PNG" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Different Stages of Processing in DataFlow&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The main caveat here is that EACH LABEL must be unique. Going back to our example above, the following pseudo-code would fail:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;p = beam.Pipeline(...)

def list_to_pcollection(a_pipeline, a_list_in_memory, a_label):
    # () are required when there is no WITH loop
    return ( a_pipeline | a_label &amp;gt;&amp;gt; beam.Create(a_list_in_memory) )


emails = list_to_pcollection(p, emails_list, 'CreateEmails')
# The line below would cause a failure because labels must be unique
phones = list_to_pcollection(p, phones_list, 'CreateEmails')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, the following code would work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;index = 1
a_label = “create” + str(index)
emails = list_to_pcollection(p, emails_list, a_label)
index = index + 1
a_label = “create” + str(index)
phones = list_to_pcollection(p, phones_list, a_label)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bottom line is that if you are programmatically creating labels, you need to make sure they are unique. &lt;/p&gt;




&lt;h3&gt;
  
  
  SUMMARY
&lt;/h3&gt;

&lt;p&gt;In this post, we reviewed 4 ways that should help you with debugging. To put everything in context, consider the following:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Test to see that your dependent libraries are installed:
&lt;/h4&gt;

&lt;p&gt;If you test this first, then you can save time that would be wasted if half of your unit tests run before this error is detected. When you think about how many times tests are run, the time savings in the long run can be significant.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Use TestPipeline when running local tests:
&lt;/h4&gt;

&lt;p&gt;Unlike apache_beam.Pipeline, TestPipeline can handle setting PipelineOptions internally. This means that there is less configuration involved in order to get your unit test coded, and less configuration typically means time saved.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Parentheses are helpful :
&lt;/h4&gt;

&lt;p&gt;Since PCollections can do multiple transformations all at once ('a composite transform'), it is quite likely that transformations will span multiple lines. When parentheses are used with these multiple lines, you don’t you don't have to worry about forgetting line continuation characters.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Using labels is recommended but each label MUST be unique:
&lt;/h4&gt;

&lt;p&gt;Using labels for steps within your pipeline is critical. When you deploy your beam pipeline to Google Cloud, you may notice a step that doesn’t meet performance requirements, and the label will help you QUICKLY identify the problematic code.&lt;/p&gt;

&lt;p&gt;As your pipeline grows, it is likely that different transformations of data will be constructed based on functions with different parameters. This is  a good thing – it means you are reusing code as opposed to creating something new every time you need to transform data. The main warning here though is that each label MUST be unique; just make sure you adjust your code to reflect this.&lt;/p&gt;

&lt;h3&gt;
  
  
  NEXT STEPS
&lt;/h3&gt;

&lt;p&gt;Ok – those are 4 steps that hopefully should improve your debugging experience. For more information on how to process data in real-time and for information on how to deploy machine learning models into production, I encourage you to take a look at the new &lt;a href="https://courses.whiteowleducation.com/p/machine-learning-mastery"&gt;Machine Learning Mastery&lt;/a&gt;  course that White Owl Education is putting together. &lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>apachebeam</category>
    </item>
  </channel>
</rss>
