<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: wrighter</title>
    <description>The latest articles on Forem by wrighter (@wrighter).</description>
    <link>https://forem.com/wrighter</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F323341%2Fbd59e3ab-8a5e-4fd6-94ae-5b3feed68773.jpeg</url>
      <title>Forem: wrighter</title>
      <link>https://forem.com/wrighter</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/wrighter"/>
    <language>en</language>
    <item>
      <title>Parameterizing and automating Jupyter notebooks with papermill</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Sun, 14 Nov 2021 13:59:51 +0000</pubDate>
      <link>https://forem.com/wrighter/parameterizing-and-automating-jupyter-notebooks-with-papermill-4ci1</link>
      <guid>https://forem.com/wrighter/parameterizing-and-automating-jupyter-notebooks-with-papermill-4ci1</guid>
      <description>&lt;p&gt;Have you ever created a Jupyter notebook and wished you could generate the notebook with a different set of parameters? If so, you’ve probably done at least one of the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Edited the variables in a cell and reran the notebook, saving off a copy as needed&lt;/li&gt;
&lt;li&gt;Saved a copy of the notebook and maybe hacked up code to edit the values directly in the .ipynb files and reran notebooks&lt;/li&gt;
&lt;li&gt;Built some custom code to set the variables with data loaded from a database or configuration file, then reran the notebook &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It turns out that there is a good solution for this problem that parameterizes interactive notebooks and coexists well with automated jobs, it’s called &lt;a href="https://papermill.readthedocs.io/"&gt;papermill&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Motivation
&lt;/h3&gt;

&lt;p&gt;Many notebook authors use the standard practice of designating a cell near the top of their notebooks for global variables. The author or other users of the notebook then modifies the values in the cell and runs the entire notebook to obtain different results. To persist the output, the author will manually download the notebook in another format or save it as a different notebook file. But using only a notebook server and these manual methods can quickly become messy and difficult to track, not to mention error prone. Which notebook is the one you edit? Papermill helps solve this problem. In this article, I’ll introduce papermill and basic usage, walk through an example of parameterization, and finally talk about ways to fully schedule and automate notebook execution using cron.&lt;/p&gt;

&lt;p&gt;With papermill, a special cell in the notebook is designated for parameters. When papermill executes a parameterized notebook, either via the command line interface (CLI) or using the Python API, parameters are passed in and executed in a subsequent cell. This allows the notebook to be run multiple times with different parameters quickly. The resulting executed notebook can then be saved in a variety of places, including local or cloud storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;To install papermill, use pip. I’d recommend using a virtual environment using virtualenv or conda. I often recommend using &lt;a href="https://www.wrighters.io/you-can-easily-and-sensibly-run-multiple-versions-of-python-with-pyenv/"&gt;pyenv&lt;/a&gt; to install a recent Python version and for creating a &lt;a href="https://www.wrighters.io/use-pyenv-and-virtual-environments-to-manage-python-complexity/"&gt;virtualenv&lt;/a&gt;. But use whatever you are most comfortable with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install papermill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you would like to use the various input and output options (like Amazon’s &lt;code&gt;s3&lt;/code&gt; or Microsoft’s &lt;code&gt;azure&lt;/code&gt;, you can install all the dependencies. I won’t get into the detail here, but the documentation covers those options, and you can even extend papermill to add other handlers for input/output (I/O) of notebooks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install papermill[all]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Basic use
&lt;/h2&gt;

&lt;p&gt;The first thing most users will want to do with papermill is parameterize a notebook. I made a simple &lt;a href="https://github.com/wrighter/python_blogposts/blob/main/tools/papermill_example1.ipynb"&gt;example notebook&lt;/a&gt; that you can download and follow along. Once you have Jupyter running and have opened a notebook, all you need to do is add a parameters tag to the cell with parameters in it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HRhVFiar--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/07/adding_parameters.gif%3Fresize%3D656%252C260%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HRhVFiar--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/07/adding_parameters.gif%3Fresize%3D656%252C260%26ssl%3D1" alt="How you add a tag in Jupyter notebook" width="656" height="260"&gt;&lt;/a&gt;How you add a tag in Jupyter notebook.&lt;/p&gt;

&lt;p&gt;Save the notebook, and now you are ready to execute it using papermill. For the example notebook, use the CLI to run the notebook, supplying your own name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill -p name Matt papermill_example1.ipynb papermill_matt.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command is telling papermill to execute the input notebook &lt;code&gt;papermill_example1.ipynb&lt;/code&gt; and write the output to &lt;code&gt;papermill_matt.ipynb&lt;/code&gt;, while setting the parameter &lt;code&gt;name&lt;/code&gt; to the value &lt;code&gt;Matt&lt;/code&gt;. If you open the resulting notebook, the contents will now include a new cell after the &lt;code&gt;parameters&lt;/code&gt;-tagged one with an &lt;code&gt;injected-parameters&lt;/code&gt; tag like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e-faEPFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/www.wrighters.io/wp-content/uploads/2021/07/papermill_matt.jpg%3Fresize%3D656%252C297%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e-faEPFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/www.wrighters.io/wp-content/uploads/2021/07/papermill_matt.jpg%3Fresize%3D656%252C297%26ssl%3D1" alt="The notebook after parameters are injected " width="656" height="297"&gt;&lt;/a&gt;The notebook after parameters are injected (with the new cell)&lt;/p&gt;

&lt;p&gt;You should now see how you can add as many parameters as you need to make new notebooks from an existing notebook. Think of the main notebook (in our case, &lt;code&gt;papermill_example1.ipynb&lt;/code&gt;) as a template that you can use to make as many copies as you want by quickly injecting parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic API use
&lt;/h2&gt;

&lt;p&gt;You may want to fetch or build your injected parameters using Python code, and so a Python API is also available to execute papermill. We can achieve the exact same result as above, in a Python script (or in a notebook, it works great there as well – and will show you the progress dynamically).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import papermill as pm

name = "Matt"
res = pm.execute_notebook(
    'papermill_example1.ipynb',
    'papermill_{name}.ipynb',
    parameters = dict(name=name)
)

{"version_major":2,"version_minor":0,"model_id":"cf8280b216094bf6a75a9536b6505051"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  More parameter passing
&lt;/h2&gt;

&lt;p&gt;So far we’ve passed only one parameter, and have used the -p option to do this. You can pass parameters a couple of ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Command Line
&lt;/h3&gt;

&lt;p&gt;You can run these all using the example notebook, then view the results yourself. First, you can specify multiple parameters from the CLI. Even if a parameters doesn’t exist in the notebook yet, parameters can be passed in and created. In that case, papermill will create an &lt;code&gt;injected-parameters&lt;/code&gt; cell and execute it at the top of the notebook.&lt;/p&gt;

&lt;p&gt;Here’s an example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill -p name Matt -p level 5 -p factor 0.33 -p alive True papermill_example1.ipynb papermill_matt.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or with long options instead…&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill --parameters name Matt --parameters level 5 --parameters factor 0.33 --parameters alive True papermill_example1.ipynb papermill_matt.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the &lt;code&gt;-p&lt;/code&gt; or &lt;code&gt;--parameters&lt;/code&gt; option will try to parse integers and floats, so if you want them to be interpreted as strings, you use the &lt;code&gt;-r&lt;/code&gt; or &lt;code&gt;--raw&lt;/code&gt; option to get all values in as strings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill -r name Matt -r level 5 -r factor 0.33 -r alive True papermill_example1.ipynb papermill_matt.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use &lt;a href="http://yaml.org"&gt;yaml&lt;/a&gt; for specifying parameters. This can be passed in via a file (&lt;code&gt;-f&lt;/code&gt; or &lt;code&gt;--parameters_file&lt;/code&gt;), a string (&lt;code&gt;-y&lt;/code&gt; or &lt;code&gt;--parameters_yaml&lt;/code&gt;) or a base64 encoded string (&lt;code&gt;-b&lt;/code&gt; or &lt;code&gt;--parameters_base64&lt;/code&gt;). This allows you to pass in more complex data, including lists and dictionaries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill papermill_example1.ipynb papermill_matt.ipynb -y "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
    - 1.0
    - 2.5
    - 3.7
params:
    x: 3
    y: 4"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can base64 encode the string pretty easily. (Run this in your shell on Mac or Linux or Windows WSL in the directory where the notebook file is).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;echo "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
    - 1.0
    - 2.5
    - 3.7
params:
    x: 3
    y: 4" &amp;gt; params.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can run the file version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill papermill_example1.ipynb papermill_matt.ipynb -f params.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or the base64 version&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PARAMS=$(cat params.yaml| base64) # makes the base64 version of the yaml file
papermill papermill_example1.ipynb papermill_matt.ipynb -b $PARAMS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Either way, you should get the idea that you can pass complex data into your notebook from the command line, and also via the API. These examples all use the local filesystem for input and output of notebooks, but note that you can read and write notebooks from Amazon &lt;code&gt;s3&lt;/code&gt;, Azure, Google Cloud Storage, or web servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspecting notebooks
&lt;/h2&gt;

&lt;p&gt;You can also inspect the available parameters of a notebook, from the CLI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ papermill --help-notebook papermill_example1.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]

Parameters inferred for notebook 'papermill_example1.ipynb':
  name: Unknown type (default "Joe")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or using the Python API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pm.inspect_notebook('papermill_example1.ipynb')

{'name': {'name': 'name',
  'inferred_type_name': 'None',
  'default': '"Joe"',
  'help': ''}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Executing a full workflow
&lt;/h2&gt;

&lt;p&gt;A typical workflow for papermill is to have a parameterized notebook, run it with multiple values, then convert the resulting notebooks into another format for review or reporting. Let’s walk through an example of how this might be setup.&lt;/p&gt;

&lt;p&gt;First, we have a &lt;a href="https://github.com/wrighter/python_blogposts/blob/main/tools/papermill_example2.ipynb"&gt;parameterized notebook&lt;/a&gt; that uses the Yahoo! finance API to fetch stock prices and plot data with the all time high price of the stock (or at least it’s the high for the last two years since I’m only fetching that much data at this point).&lt;/p&gt;

&lt;p&gt;If you want to run this example, you will need to ensure you have the &lt;code&gt;yfinance&lt;/code&gt; API installed as well as &lt;code&gt;matplotlib&lt;/code&gt;. You can install both with pip if needed.&lt;/p&gt;

&lt;p&gt;We can use the papermill CLI to inspect the parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ papermill --help-notebook papermill_example2.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]

Parameters inferred for notebook 'papermill_example2.ipynb':
  symbol: Unknown type (default 'AAPL')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll run this notebook with several symbols. I’ve chosen to use a shell script for this so that I can run it through a scheduled cron job. If desired, this could just as easily be done using a simple Python script. However, if you are using a virtual enviroment you may end up needing a script anyway for ensuring the virtualenv is loaded properly. In that case, it might just be easier to use shell script for the entire process.&lt;/p&gt;

&lt;p&gt;I’m also going to use the &lt;a href="https://nbconvert.readthedocs.io/en/latest/usage.html"&gt;&lt;code&gt;jupyter nbconvert&lt;/code&gt;&lt;/a&gt; (or you can run it as &lt;code&gt;jupyter-nbconvert&lt;/code&gt;) command to convert the notebook into an html file for viewing via a web browser. Just like papermill, nbconvert is available via the command line or using the Python API.&lt;/p&gt;

&lt;h3&gt;
  
  
  The automation script
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/bin/bash

set -eux

# activate our virtualenv (this was created using pyenv-virtualenv, yours will be elsewhere)
source /Users/mcw/.pyenv/versions/3.8.6/envs/pandas/bin/activate

# get to the script directory if running via cron
cd $(dirname "${BASH_SOURCE[0]}")

for S in AAPL MSFT GOOG FB
do
        papermill -p symbol $S papermill_example2.ipynb papermill_${S}.ipynb
        jupyter-nbconvert --no-input --to html papermill_${S}.ipynb
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can run this command from your shell (after adjusting the line that activates the virtual environment to reflect your own setup). You can also schedule it to run regularly in cron pretty easily. For example, you can run this report every weekday at 4 PM like this (with your own path).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00 16 * * mon-fri /Users/mcw/projects/python_blogposts/tools/run_papermill.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extending the example
&lt;/h3&gt;

&lt;p&gt;With just a little more creativity (and software configuration on nbconvert), you can output the notebooks to PDF or other formats, send them via email, or upload them to a server to have nice looking reports updated on a daily basis.&lt;/p&gt;

&lt;p&gt;Note that the per-symbol notebooks are saved to the local disk. They can be opened in Jupyter server and re-executed easily if debugging or further work is required. Just know that if you have an automated job running, the notebooks will be replaced each time it runs. Ideally, you want to work on your main template notebook, then generate new versions for each symbol with automation.&lt;/p&gt;

&lt;p&gt;One other tip is that papermill can read and write to standard input and output. This means that if you have other tools that take notebook files as input, you don’t have to write the files out to disk. For example, in our shell script above, we could prevent writing out each individual notebook file per symbol and do the following inside our loop instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papermill -p symbol $S papermill_example2.ipynb | jupyter-nbconvert --stdin --no-input --to html --output report_${S}.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that if you do this, you’ll need to open the main notebook (&lt;code&gt;papermill_example2.ipynb&lt;/code&gt;) and edit your parameters to debug issues. But maybe that’s prefereable if you need to save disk space and don’t need the ability to debug each notebook separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Papermill is a useful library to parameterize and execute Jupyter notebooks. You can use it to automate execution of your notebooks with any sets of parameters you can dream up. Follow this up with a conversion of the notebook using nbconvert to provide readable and useful versions of your notebooks.&lt;/p&gt;

&lt;p&gt;There is much more that can be done with notebook automation, but starting with papermill as a tool to execute and parameterize notebooks is a good platform to build on.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/parameters-jupyter-notebooks-with-papermill/"&gt;Parameterizing and automating Jupyter notebooks with papermill&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Indexing time series data in pandas</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Wed, 10 Nov 2021 00:35:47 +0000</pubDate>
      <link>https://forem.com/wrighter/indexing-time-series-data-in-pandas-2ai0</link>
      <guid>https://forem.com/wrighter/indexing-time-series-data-in-pandas-2ai0</guid>
      <description>&lt;p&gt;Quite often the data that we want to analyze has a time based component. Think about data like daily temperatures or rainfall, stock prices, sales data, student attendance, or events like clicks or views of a web application. There is no shortage of sources of data, and new sources are being added all the time. As a result, most pandas users will need to be familiar with time series data at some point.&lt;/p&gt;

&lt;p&gt;A time series is just a pandas &lt;code&gt;DataFrame&lt;/code&gt; or &lt;code&gt;Series&lt;/code&gt; that has a time based index. The values in the time series can be anything else that can be contained in the containers, they are just accessed using date or time values. A time series container can be manipulated in many ways in pandas, but for this article I will focus just on the basics of indexing. Knowing how indexing works first is important for data exploration and use of more advanced features.&lt;/p&gt;

&lt;h2&gt;
  
  
  DatetimeIndex
&lt;/h2&gt;

&lt;p&gt;In pandas, a &lt;code&gt;DatetimeIndex&lt;/code&gt; is used to provide indexing for pandas &lt;code&gt;Series&lt;/code&gt; and &lt;code&gt;DataFrame&lt;/code&gt;s and works just like other &lt;code&gt;Index&lt;/code&gt; types, but provides special functionality for time series operations. We’ll cover the common functionality with other &lt;code&gt;Index&lt;/code&gt; types first, then talk about the basics of partial string indexing.&lt;/p&gt;

&lt;p&gt;One word of warning before we get started. It’s important for your index to be sorted, or you may get some strange results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;p&gt;To show how this functionality works, let’s create some sample time series data with different time resolutions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt;

&lt;span class="c1"&gt;# this is an easy way to create a DatetimeIndex
# both dates are inclusive
&lt;/span&gt;&lt;span class="n"&gt;d_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"2021-01-20"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# this creates another DatetimeIndex, 10000 minutes long
&lt;/span&gt;&lt;span class="n"&gt;m_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;periods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"T"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# daily data in a Series
&lt;/span&gt;&lt;span class="n"&gt;daily&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_range&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;d_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# minute data in a DataFrame
&lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m_range&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                      &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                      &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;m_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# time boundaries not on the minute boundary, add some random jitter
&lt;/span&gt;&lt;span class="n"&gt;mr_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m_range&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;microseconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1_000_000.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m_range&lt;/span&gt;&lt;span class="p"&gt;))])&lt;/span&gt; 
&lt;span class="c1"&gt;# minute data in a DataFrame, but at a higher resolution
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mr_range&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                       &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mr_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
2021-01-04 0.107230
2021-01-05 0.201100
Freq: D, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Resolution
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;DatetimeIndex&lt;/code&gt; has a resolution that indicates to what level the &lt;code&gt;Index&lt;/code&gt; is indexing the data. The three indices created above have distinct resolutions. This will have ramifications in how we index later on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"daily:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resolution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"minute:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resolution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"randomized minute:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resolution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;daily: day
minute: minute
randomized minute: microsecond
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Typical indexing
&lt;/h2&gt;

&lt;p&gt;Before we get into some of the “special” ways to index a pandas &lt;code&gt;Series&lt;/code&gt; or &lt;code&gt;DataFrame&lt;/code&gt; with a &lt;code&gt;DatetimeIndex&lt;/code&gt;, let’s just look at some of the typical indexing functionality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basics
&lt;/h3&gt;

&lt;p&gt;I’ve covered the basics of indexing before, so I won’t cover too many details here. However it’s important to realize that a &lt;code&gt;DatetimeIndex&lt;/code&gt; works just like other indices in pandas, but has extra functionality. (The extra functionality can be more useful and convenient, but just hold tight, those details are next). If you already understand basic indexing, you may want to skim until you get to partial string indexing. If you haven’t read my articles on indexing, you should start with the &lt;a href="https://www.wrighters.io/indexing-and-selecting-in-pandas-part-1/"&gt;basics&lt;/a&gt; and go from there.&lt;/p&gt;

&lt;p&gt;Indexing a &lt;code&gt;DatetimeIndex&lt;/code&gt; using a &lt;code&gt;datetime&lt;/code&gt;-like object will use &lt;a href="https://pandas.pydata.org/docs/user_guide/timeseries.html#exact-indexing"&gt;exact indexing&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;getitem&lt;/code&gt; a.k.a the array indexing operator (&lt;code&gt;[]&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;When using &lt;code&gt;datetime&lt;/code&gt;-like objects for indexing, we need to match the resolution of the index.&lt;/p&gt;

&lt;p&gt;This ends up looking fairly obvious for our daily time series.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.29330017699861666
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:00:00"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;KeyError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ke&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ke&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timestamp('2021-01-01 00:00:00')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;KeyError&lt;/code&gt; is raised because in a &lt;code&gt;DataFrame&lt;/code&gt;, using a single argument to the &lt;code&gt;[]&lt;/code&gt; operator will look for a &lt;em&gt;column&lt;/em&gt;, not a row. We have a single column called &lt;code&gt;value&lt;/code&gt; in our &lt;code&gt;DataFrame&lt;/code&gt;, so the code above is looking for a column. Since there isn’t a column by that name, there is a &lt;code&gt;KeyError&lt;/code&gt;. We will use other methods for indexing rows in a &lt;code&gt;DataFrame&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;.iloc&lt;/code&gt; indexing
&lt;/h3&gt;

&lt;p&gt;Since the &lt;code&gt;iloc&lt;/code&gt; indexer is integer offset based, it’s pretty clear how it works, not much else to say here. It works the same for all resolutions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.29330017699861666
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value 0.999354
Name: 2021-01-07 22:39:00, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value 0.646703
Name: 2021-01-01 00:04:00.452614, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;.loc&lt;/code&gt; indexing
&lt;/h3&gt;

&lt;p&gt;When using &lt;code&gt;datetime&lt;/code&gt;-like objects, you need to have exact matches for single indexing. It’s important to realize that when you make &lt;code&gt;datetime&lt;/code&gt; or &lt;code&gt;pd.Timestamp&lt;/code&gt; objects, all the fields you don’t specify explicitly will default to 0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;jan1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;jan1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.29330017699861666
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;jan1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# the defaults for hour, minute, second make this work
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value 0.124186
Name: 2021-01-01 00:00:00, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# we don't have that exact time, due to the jitter
&lt;/span&gt;    &lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;jan1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;KeyError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ke&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Missing in index: "&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ke&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# but we do have a value on that day
# we could construct it manually to the microsecond if needed
&lt;/span&gt;&lt;span class="n"&gt;jan1_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;microsecond&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;microsecond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;jan1_ms&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Missing in index: datetime.datetime(2021, 1, 1, 0, 0)
value 0.527961
Name: 2021-01-01 00:00:00.641049, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Slicing
&lt;/h3&gt;

&lt;p&gt;Slicing with integers works as expected, you can read more about regular slicing &lt;a href="https://www.wrighters.io/indexing-and-selecting-in-pandas-slicing/"&gt;here&lt;/a&gt;. But here’s a few examples of “regular” slicing, which works with the array indexing operator (&lt;code&gt;[]&lt;/code&gt;) or the &lt;code&gt;.iloc&lt;/code&gt; indexer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# first two, end is not inclusive
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-01-01 0.293300
2021-01-02 0.921466
Freq: D, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# same
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# every other
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:03:00.922019 0.509333
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# works with the iloc indexer as well
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:03:00.922019 0.509333
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slicing with &lt;code&gt;datetime&lt;/code&gt;-like objects also works. Note that the end item is inclusive, and the defaults for hours, minutes, seconds, and microseconds will set the cutoff for the randomized data on minute boundaries (in our case).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="c1"&gt;# end is inclusive
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
Freq: D, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sort of slicing work with &lt;code&gt;[]&lt;/code&gt; and &lt;code&gt;.loc&lt;/code&gt;, but not &lt;code&gt;.iloc&lt;/code&gt;, as expected. Remember, &lt;code&gt;.iloc&lt;/code&gt; is for integer offset indexing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# no! use integers with iloc
&lt;/span&gt;    &lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cannot do positional indexing on DatetimeIndex with these indexers [2021-01-01 00:00:00] of type datetime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Special indexing with strings
&lt;/h3&gt;

&lt;p&gt;Now things get really interesting and helpful. When working with time series data, partial string indexing can be very helpful and way less cumbersome than working with &lt;code&gt;datetime&lt;/code&gt; objects. I know we started with objects, but now you see that for interactive use and exploration, strings are very helpful. You can pass in a string that can be parsed as a full date, and it will work for indexing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-04"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.10723013753233923
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:03:00"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value 0.834881
Name: 2021-01-01 00:03:00, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strings also work for slicing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:03:00"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:05:00"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# end is inclusive
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        value
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
2021-01-01 00:05:00 0.291089
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Partial String Indexing
&lt;/h3&gt;

&lt;p&gt;Partial strings can also be used, so you only need to specify part of the data. This can be useful for pulling out a single year, month, or day from a longer dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# all items match (since they were all in 2021)
&lt;/span&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# this one as well (and only in January for our data)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
2021-01-04 0.107230
2021-01-05 0.201100
2021-01-06 0.534822
2021-01-07 0.070303
2021-01-08 0.413683
2021-01-09 0.316605
2021-01-10 0.438853
2021-01-11 0.258554
2021-01-12 0.473523
2021-01-13 0.497695
2021-01-14 0.250582
2021-01-15 0.861521
2021-01-16 0.589558
2021-01-17 0.574399
2021-01-18 0.951196
2021-01-19 0.967695
2021-01-20 0.082931
Freq: D, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can do this on a &lt;code&gt;DataFrame&lt;/code&gt; as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;ipython-input-67-96027d36d9fe&amp;gt;:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
  minute["2021-01-01"]

                        value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
... ...
2021-01-01 23:55:00 0.687931
2021-01-01 23:56:00 0.001978
2021-01-01 23:57:00 0.770587
2021-01-01 23:58:00 0.154300
2021-01-01 23:59:00 0.777973

[1440 rows x 1 columns]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See that deprecation warning? You should no longer use &lt;code&gt;[]&lt;/code&gt; for &lt;code&gt;DataFrame&lt;/code&gt; string indexing (as we saw above, &lt;code&gt;[]&lt;/code&gt; should be used for column access, not rows). Depending on whether the value is found in the index or not, you may get an error or a warning. Use &lt;code&gt;.loc&lt;/code&gt; instead so you can avoid the confusion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-01 23:55:00.642728 0.749619
2021-01-01 23:56:00.238864 0.053027
2021-01-01 23:57:00.168598 0.598910
2021-01-01 23:58:00.103543 0.107069
2021-01-01 23:59:00.687053 0.941584

[1440 rows x 1 columns]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If using string slicing, the end point includes &lt;em&gt;all times in the day&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;"2021-01-02"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-02 23:55:00.604411 0.987777
2021-01-02 23:56:00.134674 0.159338
2021-01-02 23:57:00.508329 0.973378
2021-01-02 23:58:00.573397 0.223098
2021-01-02 23:59:00.751779 0.685637

[2880 rows x 1 columns]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But if we include times, it will include partial periods, cutting off the end right up to the microsecond if it is specified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;"2021-01-02 13:32:01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-02 13:28:00.925951 0.969213
2021-01-02 13:29:00.037827 0.758476
2021-01-02 13:30:00.309543 0.473163
2021-01-02 13:31:00.363813 0.846199
2021-01-02 13:32:00.867343 0.007899

[2253 rows x 1 columns]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Slicing vs. exact matching
&lt;/h2&gt;

&lt;p&gt;Our three datasets have different resolutions in their index: day, minute, and microsecond respectively. If we pass in a string indexing parameter and the resolution of the string is &lt;em&gt;less&lt;/em&gt; accurate than the index, it will be treated as a slice. If it’s the same or more accurate, it’s treated as an exact match. Let’s use our microsecond (&lt;code&gt;minute2&lt;/code&gt;) and minute (&lt;code&gt;minute&lt;/code&gt;) resolution data examples. Note that every time you get a slice of the &lt;code&gt;DataFrame&lt;/code&gt;, the value returned is a &lt;code&gt;DataFrame&lt;/code&gt;. When it’s an exact match, it’s a &lt;code&gt;Series&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# slice - the entire day
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# slice - the first hour of the day
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:00"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# slice - the first minute of the day
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:00:00"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# slice - the first minute and second of the day
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                            value
2021-01-01 00:00:00.641049 0.527961
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="c1"&gt;# note the string representation include the full microseconds
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt; &lt;span class="c1"&gt;# slice - this seems incorrect to me, should return Series not DataFrame
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="c1"&gt;# exact match
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-01-01 00:00:00.641049

value 0.527961
Name: 2021-01-01 00:00:00.641049, dtype: float64

minute.loc["2021-01-01"] # slice - the entire day
minute.loc["2021-01-01 00"] # slice - the first hour of the day
minute.loc["2021-01-01 00:00"] # exact match

value 0.124186
Name: 2021-01-01 00:00:00, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that for a microsecond resolution string match, I don’t see an exact match (where the return would be a &lt;code&gt;Series&lt;/code&gt;), but instead a slice match (because the return value is a &lt;code&gt;DataFrame&lt;/code&gt;). On the minute resolution &lt;code&gt;DataFrame&lt;/code&gt; it worked as I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  asof
&lt;/h2&gt;

&lt;p&gt;One way to deal with this sort of issue is to use &lt;code&gt;asof&lt;/code&gt;. Often, when you have data that is either randomized in time or may have missing values, getting the most recent value as of a certain time is preffered. You could do this yourself, but it looks little cleaner to use &lt;code&gt;asof&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:00:03"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# vs
&lt;/span&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:00:03"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value 0.527961
Name: 2021-01-01 00:00:03, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  truncate
&lt;/h2&gt;

&lt;p&gt;You can also use &lt;code&gt;truncate&lt;/code&gt; which is sort of like slicing. You specify a value of &lt;code&gt;before&lt;/code&gt; or &lt;code&gt;after&lt;/code&gt; (or both) to indicate cutoffs for data. Unlike slicing which includes all values that partially match the date, &lt;code&gt;truncate&lt;/code&gt; assumes 0 for any unspecified values of the date.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;minute2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;truncate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"2021-01-01 00:00:03"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               value
2021-01-01 00:00:00.641049 0.527961
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;You can now see that time series data can be indexed a bit differently than other types of &lt;code&gt;Index&lt;/code&gt; in pandas. Understanding time series slicing will allow you to quickly navigate time series data and quickly move on to more advanced time series analysis.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/indexing-time-series-data-in-pandas/"&gt;Indexing time series data in pandas&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>python</category>
    </item>
    <item>
      <title>Building Jupyter notebook workflows with scrapbook</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Mon, 02 Aug 2021 21:36:05 +0000</pubDate>
      <link>https://forem.com/wrighter/building-jupyter-notebook-workflows-with-scrapbook-27lo</link>
      <guid>https://forem.com/wrighter/building-jupyter-notebook-workflows-with-scrapbook-27lo</guid>
      <description>&lt;p&gt;One principle of good software design is to limit the functionality and scope of a software component. Jupyter notebooks often grow in size and complexity as they are developed. It is tempting to put all of the logic for a complex workflow in one notebook. Breaking a workflow into multiple notebooks requires a way to communicate data between the notebooks. A notebook author needs to be able to persist data or results from one notebook and read it in another in order to build a workflow. There are many common options for this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Saving data to CSV/Pickle/Parquet, etc.&lt;/li&gt;
&lt;li&gt;Saving to a database (relational or object store)&lt;/li&gt;
&lt;li&gt;Inter-process communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these options have one common problem: the notebook and the data are separate. It would be useful to have the data and notebook co-exist in one place. This is what the &lt;a href="https://nteract-scrapbook.readthedocs.io/"&gt;scrapbook&lt;/a&gt; library from &lt;a href="https://nteract.io"&gt;nteract&lt;/a&gt; does. Scrapbook allows a notebook author to persist some of the data from a notebook session into the notebook file itself. Then other notebooks (or Python applications) can read the notebook files and use the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building workflows
&lt;/h2&gt;

&lt;p&gt;Instead of one notebook that executes an entire workflow, smaller notebooks can be created, &lt;a href="https://www.wrighters.io/unit-testing-python-code-in-jupyter-notebooks/"&gt;unit tested&lt;/a&gt;, and then &lt;a href="https://www.wrighters.io/parameters-jupyter-notebooks-with-papermill/"&gt;parameterized and executed with papermill&lt;/a&gt;. The outputs of each notebook can then be read by subsequent notebooks in the workflow. Each notebook executes and persists any results to be used by the next step in the process. Scrapbook persists the values in the notebook file itself. Later in the workflow, the notebook file is read and the values retreived. Any Python objects or display values can be persisted, as long as they can be serialized. The library includes some basic encoders, and new ones can be created easily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;

&lt;p&gt;First, to use scrapbook, you have to install it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapbook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or if you want to be able to install all the optional dependencies (for remote servers like Amazon S3 or Azure:)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapbook[all]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How does it work?
&lt;/h3&gt;

&lt;p&gt;Scrapbook takes advantage of the fact that notebooks are just &lt;a href="https://nbformat.readthedocs.io/en/latest/"&gt;JSON documents&lt;/a&gt; with the ability to store different types of outputs for cells. The best way to understand this is to look at a simple example.&lt;/p&gt;

&lt;p&gt;First, create a source notebook and import the scrapbook library.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import scrapbook as sb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, in a cell, define a value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we save the notebook, the cell above (in the JSON .ipynb file) will look something like this (you may see a different id and execution count):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
   "cell_type": "code",
   "execution_count": 1,
   "id": "6b5d2b33",
   "metadata": {},
   "outputs": [],
   "source": [
    "x = 1"
   ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, in a subsequent cell, we can use &lt;code&gt;scrapbook&lt;/code&gt; to &lt;code&gt;glue&lt;/code&gt; the value of &lt;code&gt;x&lt;/code&gt; to the current notebook.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sb.glue("x", x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After saving the notebook, the cell above (in the JSON .ipynb file) will look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
   "cell_type": "code",
   "execution_count": 1,
   "id": "228fc7d4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/scrapbook.scrap.json+json": {
       "data": 1,
       "encoder": "json",
       "name": "x",
       "version": 1
      }
     },
     "metadata": {
      "scrapbook": {
       "data": true,
       "display": false,
       "name": "x"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sb.glue(\"x\", x)"
   ]
  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While you don’t see any output in your notebook for this cell, there still is data hidden in the cell outputs as an encoded numeric value. Metadata is saved as well so that scrapbook can properly read the value later.&lt;/p&gt;

&lt;p&gt;Again, if the notebook up to this point has been saved, we can now read the notebook using &lt;code&gt;scrapbook&lt;/code&gt; and use it to fetch the value out of the notebook file. Usually, we do this in a different notebook or Python application, but it does work inside the same notebook (as long as it’s been saved to disk).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nb = sb.read_notebook("scrapbook_and_jupyter.ipynb")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The notebook object (&lt;code&gt;nb&lt;/code&gt;) has a number of attributes which correspond directly to the JSON schema of a notebook file, just as documented in the &lt;code&gt;nbformat&lt;/code&gt; &lt;a href="https://nbformat.readthedocs.io/en/latest/"&gt;docs&lt;/a&gt;. But it also has a few extra methods for dealing with &lt;code&gt;scraps&lt;/code&gt;, the values that have been glued to the notebook. You can see the scraps directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nb.scraps

Scraps([('x', Scrap(name='x', data=1, encoder='json', display=None))])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or see them in a &lt;code&gt;DataFrame&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nb.scrap_dataframe

  name data encoder display filename
0 x 1 json None scrapbook_and_jupyter.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can fetch the value easily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = nb.scraps['x'].data
x

1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we’ve covered the basics, let’s put the work together for a more complicated example.&lt;/p&gt;

&lt;h2&gt;
  
  
  A sample workflow
&lt;/h2&gt;

&lt;p&gt;For this workflow, let’s build on the example from my article on &lt;a href="https://www.wrighters.io/parameters-jupyter-notebooks-with-papermill/"&gt;papermill&lt;/a&gt;. Let’s say we want to run a single notebook for a number of stock tickers and look for any symbols that are within a threshold of their All Time High price (ATH). Then, we will run a second notebook that reads all the notebooks from the first step, and only shows data from those tickers within the threshold.&lt;/p&gt;

&lt;p&gt;In the example we will use more &lt;code&gt;scrapbook&lt;/code&gt; features.&lt;/p&gt;

&lt;h3&gt;
  
  
  The first step of the workflow
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/wrighter/python_blogposts/tree/main/tools/scrapbook_example_source.ipynb"&gt;source notebook&lt;/a&gt; will be executed once for each ticker. To keep things simple (and fast), the notebook will generate fake data for this example, but could easily be connected to real data. The notebook generates a price series, an All Time High (ATH) price, and then determines if the last price is within a threshold of the ATH, along with a plot. The notebook saves the plot, the source data, and a few values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;length = 1000
symbol = "XYZ"
d = {
    "a": 1,
    "b": 2,
}
threshold = 0.1 # 10%

import pandas as pd
import numpy as np
import scrapbook as sb

import matplotlib.pyplot as plt

# generate a DataFrame that has synthetic price information
idx = pd.date_range(start='20100101', periods=length, freq='B')
prices = pd.DataFrame({'price' : np.cumsum(np.random.random(length) - .5)}, index=idx)
# normalize to always be above 0
prices['price'] += abs(prices['price'].min())
prices['ATH'] = prices['price'].expanding().max()

distance = 1 - prices.iloc[-1]['price']/prices.iloc[-1]['ATH']
if distance &amp;lt;= threshold:
    close_to_ath = True
else:
    close_to_ath = False

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(prices['price'])
ax.plot(prices['ATH'])
ax.text(prices.index[-1], prices['price'].iloc[-1], f"{distance * 100: .1f}%");
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--79YsnpGp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image.png%3Fresize%3D656%252C425%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--79YsnpGp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image.png%3Fresize%3D656%252C425%26ssl%3D1" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Gluing different types
&lt;/h3&gt;

&lt;p&gt;We’ve already covered the &lt;code&gt;glue&lt;/code&gt; method for a basic type. If the type passed in can be serialized using one of the built in encoders, it will be. To preserve numeric types, they will be encoded as JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sb.glue("length", length) # numeric - int (stored as json)
sb.glue("symbol", symbol) # text
sb.glue("distance", distance) # numeric - float
sb.glue("close_to_ath", close_to_ath) # bool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also specify the encoder for more complex types. At this time (as of version 0.5 of scrapbook), there are encoders included for json, pandas, text, and display.&lt;/p&gt;

&lt;p&gt;There is also a &lt;code&gt;display&lt;/code&gt; &lt;em&gt;parameter&lt;/em&gt; to the &lt;code&gt;glue&lt;/code&gt; function. This determines whether the value is visibile in the notebook when it is glued. By default you will not see the value in the notebook when it is stored.&lt;/p&gt;

&lt;p&gt;The display encoder will only save the displayed value, not the underlying data that backs it. This might make sense for visual types that can have a lot of data needed to create the result, and where you only want the visual result, not the data. For example, if we only wanted our plot from above, we could persist just the display. We don’t have an encoder that will encode a &lt;code&gt;matplotlib.figure.Figure&lt;/code&gt; (so an exception is raised), but since it can be displayed, it can be stored that way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# with display set, this will display the value, see it in the output below?
sb.glue("dj", d, encoder="json", display=True)  
sb.glue("prices", prices, encoder="pandas")
sb.glue("message", "This is a message", encoder="text")

try:
    sb.glue("chart", fig)
except NotImplementedError as nie:
    print(nie)
# but we can store the display result (will also display the value)
sb.glue("chart", fig, encoder="display")

{'a': 1, 'b': 2}
Scrap of type &amp;lt;class 'matplotlib.figure.Figure'&amp;gt; has no supported encoder registered
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6PU_TJYb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image-1.png%3Fresize%3D656%252C425%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6PU_TJYb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image-1.png%3Fresize%3D656%252C425%26ssl%3D1" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that a parameterized notebook exits and can be executed with different values, we run it with a simple script (or from the command line) for a number of tickers. For example, we might do something like this in the directory where the notebook file exists (with some fake tickers):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir tickers
for s in AAA ABC BCD DEF GHI JKL MNO MMN OOP PQD XYZ PDQ
do
    papermill -p symbol $s scrapbook_example_source.ipynb tickers/${s}.ipynb
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, assuming there were no failures in the notebooks, there should be a directory of notebook files with data for each ticker.&lt;/p&gt;

&lt;h3&gt;
  
  
  The second worfklow step
&lt;/h3&gt;

&lt;p&gt;Our &lt;a href="https://github.com/wrighter/python_blogposts/tree/main/tools/scrapbook_example_dest.ipynb"&gt;second notebook&lt;/a&gt; in the workflow loads each of the workbooks generated above, creating a report of those that are within the threshold.&lt;/p&gt;

&lt;p&gt;An additional API is used here. The &lt;code&gt;read_notebooks&lt;/code&gt; method, which allows us to fetch the notebooks all at once. We’ll iterate through them and display the ticker and distance for each notebook, and show the chart for each that is within the threshold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source_dir = "tickers"
sbook = sb.read_notebooks(source_dir)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have a scrapbook of notebooks (&lt;code&gt;sbook&lt;/code&gt;) that we can iterate through.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for nb in sbook.notebooks:
    print(f"{nb.scraps['symbol'].data: &amp;lt;5} {nb.scraps['distance'].data * 100: .2f}%")
    if nb.scraps['close_to_ath'].data:
        display(nb.scraps['chart'].display['data'], raw=True)   

AAA 49.81%
ABC 60.51%
BCD 0.13%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---Gi59bzz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image-2.png%3Fresize%3D656%252C424%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---Gi59bzz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image-2.png%3Fresize%3D656%252C424%26ssl%3D1" alt=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DEF 94.09%
FB 80.13%
GHI 19.65%
JKL 44.80%
MMN 100.00%
MNO 2.42%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gWuX_ZGW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image-3.png%3Fresize%3D656%252C419%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gWuX_ZGW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/www.wrighters.io/wp-content/uploads/2021/08/image-3.png%3Fresize%3D656%252C419%26ssl%3D1" alt=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OOP 24.18%
PDQ 93.33%
PQD 18.19%
XYZ 44.14%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reglue
&lt;/h3&gt;

&lt;p&gt;One last API to mention is &lt;code&gt;reglue&lt;/code&gt;. You can use this method on an existing notebook to “re”-glue a scrap into the current notebook. You can also rename the scrap.&lt;/p&gt;

&lt;p&gt;This is probably most useful if you want to propogate some data forward to another notebook that will be reading the current notebook.&lt;/p&gt;

&lt;p&gt;Another use of &lt;code&gt;reglue&lt;/code&gt; is to display visual elements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nb.reglue("length", "length2") # new name
nb.reglue("chart") # will display chart, just like earlier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Some possible drawbacks to scrapbook
&lt;/h3&gt;

&lt;p&gt;Using notebooks to store your data is not an optimized way to store data. There are a number of potential issues with choosing a tool like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It obviously doesn’t scale like a relational or object database.&lt;/li&gt;
&lt;li&gt;It’s not obvious to those reading notebook code how much data is being persisted, or where it is.&lt;/li&gt;
&lt;li&gt;It also does not have good tool support for editing data manually, especially for more complex types that will be large chunks of Base64 encoded text.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You won’t want to use a tool like this to support large amounts of data produced by notebooks. But for smaller amounts of data, especially concise summaries or outputs of a notebook, it provides the highly desirable feature of keeping data with the notebook that generated it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extending scrapbook
&lt;/h3&gt;

&lt;p&gt;You can extend the framework by writing your own encoders. The documents show a simple example of this, so if you end up with data that can’t be encoded using the default encoders, you can create your own.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Scrapbook is a useful small library for keeping notebooks and the data they produce together in one file. It integrates well with papermill, which allows you to pass in parameters to your notebooks. Scrapbook is especially useful for running workflows of multiple notebooks that feed data to one another.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/building-jupyter-notebook-workflows-with-scrapbook/"&gt;Building Jupyter notebook workflows with scrapbook&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>How to iterate over pandas DataFrame rows (and should you?)</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Sun, 30 May 2021 22:38:09 +0000</pubDate>
      <link>https://forem.com/wrighter/how-to-iterate-over-dataframe-rows-and-should-you-35o3</link>
      <guid>https://forem.com/wrighter/how-to-iterate-over-dataframe-rows-and-should-you-35o3</guid>
      <description>&lt;p&gt;One of the most searched for (and discussed) questions about pandas is how to iterate over rows in a &lt;code&gt;DataFrame&lt;/code&gt;. Often this question comes up right away for new users who have loaded some data into a &lt;code&gt;DataFrame&lt;/code&gt; and now want to do something useful with it. The natural way for most programmers to think of what to do next is to build a loop. They may not understand the “correct” way to work with &lt;code&gt;DataFrames&lt;/code&gt; yet, but even experienced pandas and NumPy developers will consider iterating over the rows of a &lt;code&gt;DataFrame&lt;/code&gt; to solve a problem. Instead of trying to find the one right answer about iteration, it makes better sense to understand the issues involved and know when to choose the best solution.&lt;/p&gt;

&lt;p&gt;As of this writing, the &lt;a href="https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas"&gt;top voted question tagged with ‘pandas’ on Stack Overflow&lt;/a&gt; is about how to iterate over &lt;code&gt;DataFrame&lt;/code&gt; rows. &lt;a href="https://stackoverflow.blog/2021/04/19/how-often-do-people-actually-copy-and-paste-from-stack-overflow-now-we-know/"&gt;It also turns out&lt;/a&gt; that question has &lt;a href="https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/16476974#16476974"&gt;the most copied answer with a code block&lt;/a&gt; on the entire site. The Stack Overflow developers say thousands of people view the answer weekly and copy it to solve their problem. Obviously people want to iterate over &lt;code&gt;DataFrame&lt;/code&gt; rows!&lt;/p&gt;

&lt;p&gt;It is also true that there can be serious consequences with iterating over &lt;code&gt;DataFrame&lt;/code&gt; rows using the top solution. Other answers to the question (especially the second highest rated answer) do a fairly good job of giving other options, but the entire list of 26 (and counting!) answers is extremely confusing. Instead of asking &lt;em&gt;how&lt;/em&gt; to iterate over &lt;code&gt;DataFrame&lt;/code&gt; rows, it makes more sense to understand what the options are that are available, what their advantages and disadvantages are, and then choose the one that makes sense for you. In some cases, the top voted answer for iteration might be the best choice!&lt;/p&gt;

&lt;h2&gt;
  
  
  But I have heard that iteration is wrong, is that true?
&lt;/h2&gt;

&lt;p&gt;First, choosing to iterate over the rows of a &lt;code&gt;DataFrame&lt;/code&gt; is not automatically the wrong way to solve a problem. However, in most cases what beginners are trying to do with iteration is better done with another approach. However, no one should ever feel bad about writing a first solution that uses iteration instead of other (perhaps better) ways. That’s often the best way to learn, you can think of a first solution as the first draft of your essay, you can improve it with some editing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now what do we want to do with the &lt;code&gt;DataFrame&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;Let’s start with basic questions. If we look at the original question on Stack Overflow, the question and answer just print the content of the &lt;code&gt;DataFrame&lt;/code&gt;. First off, let’s all agree that this is not a good way to look at the content of a &lt;code&gt;DataFrame&lt;/code&gt;. The standard rendering of a &lt;code&gt;DataFrame&lt;/code&gt; , whether it is rendered with &lt;code&gt;print&lt;/code&gt; or viewed with a Jupyter notebook using &lt;code&gt;display&lt;/code&gt; or as an output in a cell will be far better than what would be printed using custom formatting.&lt;/p&gt;

&lt;p&gt;If the &lt;code&gt;DataFrame&lt;/code&gt; is large, only some columns and rows may be visible by default. Use &lt;code&gt;head&lt;/code&gt; and &lt;code&gt;tail&lt;/code&gt; to get a sense of the data. If you want to only look at subsets of a &lt;code&gt;DataFrame&lt;/code&gt;, instead of using a loop to only display those rows, use the &lt;a href="https://www.wrighters.io/indexing-and-selecting-in-pandas-part-1/"&gt;powerful indexing capabilities of pandas&lt;/a&gt;. With a little practice, you can select any combinations of rows or columns to show. Start there first.&lt;/p&gt;

&lt;p&gt;Now instead of a trivial printing example, let’s look at ways to actually use data for a row in a &lt;code&gt;DataFrame&lt;/code&gt; that includes some logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Let’s build an example &lt;code&gt;DataFrame&lt;/code&gt; to use. I’ll do this by making some fake data (using &lt;a href="https://faker.readthedocs.io/en/master/"&gt;Faker&lt;/a&gt;). Note that the columns are different data types (we have some strings, an integer, and dates).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from datetime import datetime, timedelta

import pandas as pd
import numpy as np
from faker import Faker

fake = Faker()

today = datetime.now()
next_month = today + timedelta(days=30)
df = pd.DataFrame([[fake.first_name(), fake.last_name(),
                    fake.date_this_decade(), fake.date_between_dates(today, next_month),
                    fake.city(), fake.state(), fake.zipcode(), fake.random_int(-100,1000)]
                  for r in range(100)],
                  columns=['first_name', 'last_name', 'start_date',
                           'end_date', 'city', 'state', 'zipcode', 'balance'])

df['start_date'] = pd.to_datetime(df['start_date']) # convert to datetimes
df['end_date'] = pd.to_datetime(df['end_date'])

df.dtypes

first_name object
last_name object
start_date datetime64[ns]
end_date datetime64[ns]
city object
state object
zipcode object
balance int64
dtype: object

df.head()

  first_name last_name start_date end_date city state \
0 Katherine Moody 2020-02-04 2021-06-28 Longberg Maryland   
1 Sarah Merritt 2021-03-02 2021-05-30 South Maryborough Tennessee   
2 Karen Hensley 2020-02-29 2021-06-23 Brentside Missouri   
3 David Ferguson 2020-02-02 2021-06-14 Judithport Virginia   
4 Phillip Davis 2020-07-17 2021-06-04 Louisberg Minnesota   

  zipcode balance  
0 20496 493  
1 18495 680  
2 63702 427  
3 66787 587  
4 98616 211  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  A first attempt
&lt;/h3&gt;

&lt;p&gt;Let’s say that our &lt;code&gt;DataFrame&lt;/code&gt; contains customer data and we have a scoring function for customers that uses multiple customer attributes to give them a score between ‘A’ and ‘F’. Any customer with a negative balance is scored an ‘F’, above 500 is an ‘A’, and after that, logic depends on if a customer is a ‘legacy’ customer and what state they live in.&lt;/p&gt;

&lt;p&gt;Note that I made doctests for this function, see &lt;a href="https://www.wrighters.io/unit-testing-python-code-in-jupyter-notebooks/"&gt;my post on Jupyter unit testing&lt;/a&gt; for more details on how to unit test in Jupyter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from dataclasses import dataclass

@dataclass
class Customer:
    first_name: str
    last_name: str
    start_date: datetime
    end_date: datetime
    city: str
    state: str
    zipcode: str
    balance: int

def score_customer(customer:Customer) -&amp;gt; str:
    """Give a customer a credit score.
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, -5))
    'F'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
    'C'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
    'D'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 150))
    'C'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 250))
    'B'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 350))
    'B'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 350))
    'A'
    &amp;gt;&amp;gt;&amp;gt; score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 50))
    'C'
    """
    if customer.balance &amp;lt; 0:
        return 'F'
    if customer.balance &amp;gt; 500:
        return 'A'
    # legacy vs. non-legacy
    if customer.start_date &amp;gt; datetime(2020, 1, 1):
        if customer.balance &amp;lt; 100:
            return 'D'
        elif customer.balance &amp;lt; 200:
            return 'C'
        elif customer.balance &amp;lt; 300:
            return 'B'
        else:
            if customer.state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if customer.balance &amp;lt; 100:
            return 'C'
        else:
            return 'A'

import doctest
doctest.testmod()

TestResults(failed=0, attempted=8)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Scoring our customers
&lt;/h2&gt;

&lt;p&gt;OK, now that we have a concrete example, how do we obtain the score for all of our customers? Let’s just go straight to the top answer from the Stack Overflow question, &lt;code&gt;DataFrame.iterrows&lt;/code&gt;. This is a generator that returns the index for a row along with the row as a &lt;code&gt;Series&lt;/code&gt;. If you aren’t familiar with what a &lt;a href="https://wiki.python.org/moin/Generators"&gt;generator&lt;/a&gt; is, you can think of it as a function you can iterate over. As a result, calling &lt;code&gt;next&lt;/code&gt; on it will yield the first element.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next(df.iterrows())

(0,
 first_name Katherine
 last_name Moody
 start_date 2020-02-04 00:00:00
 end_date 2021-06-28 00:00:00
 city Longberg
 state Maryland
 zipcode 20496
 balance 493
 Name: 0, dtype: object)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks promising! This is a tuple containing the index of the first row and the row data itself. Maybe we can just pass it right into our function. Let’s try that out and see what happens. Even though the row is a &lt;code&gt;Series&lt;/code&gt;, the columns are the same as the attributes of our &lt;code&gt;Customer&lt;/code&gt; class, so we might be able to just pass this into our scoring function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score_customer(next(df.iterrows())[1])

'A'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wow, that seemed to work. Can we just score the entire table?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df['score'] = [score_customer(c[1]) for c in df.iterrows()]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Is this our best choice?
&lt;/h2&gt;

&lt;p&gt;Wow, that seems too easy. You can see why this is the top voted answer, since it seems to do exactly what we want. Why would there be any controversy about this answer?&lt;/p&gt;

&lt;p&gt;As is usually the case with pandas (and really with any software engineering question), picking an ideal solution depends on the inputs. Let’s summarize what the issues could be with various design choices. If the issues raised don’t fit your specific use case, iteration using &lt;code&gt;iterrows&lt;/code&gt; may be a perfectly acceptable solution! I won’t judge you. I use it plenty of times, and will summarize at the end how to make decisions about the possible solutions.&lt;/p&gt;

&lt;p&gt;The arguments for and against using &lt;code&gt;iterrows&lt;/code&gt; can be grouped into the following categories.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Efficiency (Speed and Memory)&lt;/li&gt;
&lt;li&gt;Mixed types in a row causing issues&lt;/li&gt;
&lt;li&gt;Readability and maintainability&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Speed and Memory
&lt;/h2&gt;

&lt;p&gt;In general, if you want things to be fast in pandas (or Numpy, or any framework that offers vectorized calculations), you will not want to iterate through elements but instead choose a vectorized solution. However, even if the solution &lt;em&gt;can&lt;/em&gt; be vectorized, it might be a lot of work for the programmer to do so, especially a beginner. Other answers to the question on Stack Overflow present a host of other solutions. They mostly all fall into one of the following categories, in the following order of preference for speed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Vectorization&lt;/li&gt;
&lt;li&gt;Cython routines&lt;/li&gt;
&lt;li&gt;List comprehensions (vanilla for loop)&lt;/li&gt;
&lt;li&gt;DataFrame.apply()&lt;/li&gt;
&lt;li&gt;DataFrame.itertuples() and iteritems()&lt;/li&gt;
&lt;li&gt;DataFrame.iterrows()&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Vectorization
&lt;/h3&gt;

&lt;p&gt;The main problem with always telling people to vectorize everything is that at times a vectorized solution may be a real chore to write, debug, and maintain. The examples given to prove that vectorization is preferred often show trivial operations, like simple multiplication. But since the example I started with in this article is not just a single calculation, I decided to write one possible vectorized solution to this problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def vectorized_score(df):
    return np.select([df['balance'] &amp;lt; 0,
                      df['balance'] &amp;gt; 500, # technically not needed, would fall through
                      ((df['start_date'] &amp;gt; datetime(2020,1,1)) &amp;amp;
                       (df['balance'] &amp;lt; 100)),
                      ((df['start_date'] &amp;gt; datetime(2020,1,1)) &amp;amp;
                       (df['balance'] &amp;gt;= 100) &amp;amp;
                       (df['balance'] &amp;lt; 200)),
                      ((df['start_date'] &amp;gt; datetime(2020,1,1)) &amp;amp;
                       (df['balance'] &amp;gt;= 200) &amp;amp;
                       (df['balance'] &amp;lt; 300)),
                      ((df['start_date'] &amp;gt; datetime(2020,1,1)) &amp;amp;
                       (df['balance'] &amp;gt;= 300) &amp;amp;
                       df['state'].isin(['Illinois', 'Indiana'])),
                      ((df['start_date'] &amp;gt;= datetime(2020,1,1)) &amp;amp;
                       (df['balance'] &amp;lt; 100)),
                     ], # conditions
                     ['F',
                      'A',
                      'D',
                      'C',
                      'B',
                      'B',
                      'C'], # choices
                     'A') # default score

assert (df['score'] == vectorized_score(df)).all()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There’s more than one way to do this, of course. I chose to use &lt;code&gt;np.select&lt;/code&gt; (you can read more about it and other various ways to update &lt;code&gt;DataFrame&lt;/code&gt;s in &lt;a href="https://www.wrighters.io/selecting-in-pandas-using-where-and-mask/"&gt;my article on using &lt;code&gt;where&lt;/code&gt; and &lt;code&gt;mask&lt;/code&gt;&lt;/a&gt;.) I sort of like using &lt;code&gt;np.select&lt;/code&gt; when you have multiple conditions like this, although it’s not extremely readable. We could have also done this using more code with vectorized updates for each step and made it much more readable. It would probably be similar in terms of speed.&lt;/p&gt;

&lt;p&gt;I personally find this very unreadable, but maybe with some good comments it could be clearly explained to future maintainers (or my future self). But the reason we are doing vectorized code is to make this faster. How does performance look for our sample &lt;code&gt;DataFrame&lt;/code&gt;?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%timeit vectorized_score(df)

2.75 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s also time our original solution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%timeit [score_customer(c[1]) for c in df.iterrows()] 

13.5 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OK, so we’re almost 5x faster, just with our tiny dataset. This speedup wouldn’t be enough to matter for small sizes, but with big datasets a simple rewrite to get that much of a speedup makes sense. And I’m sure that a faster vectorized version could be written with a little thought and profiling applied to the situation. But hold on until the end to see what the performance looks like for larger datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cython
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cython.readthedocs.io/en/latest/"&gt;Cython&lt;/a&gt; is a project that makes it easy to write C extensions for Python using (mostly) Python syntax. I confess that I’m far from a Cython expert, but have found that even just a little bit of effort in Cython can make a Python code hotspot much faster. In this case, we have shown that we can make a vectorized solution, so using Cython in a non-vectorized solution would probably not be worth pursuing as a first choice. However, I did write a simple Cython version &lt;a href="https://github.com/wrighter/python_blogposts/blob/main/pandas/iter.pyx"&gt;here&lt;/a&gt; and it was the fastest of the non-vectorized solutions at smaller sized inputs, even with just a tiny bit of effort. Especially for cases where there is a lot of calculation done per row that can’t be vectorized, using Cython might be a great choice, but will require an investment in time.&lt;/p&gt;

&lt;h3&gt;
  
  
  List comprehensions
&lt;/h3&gt;

&lt;p&gt;Now the next option is a little different. I admit that I don’t think I’ve used this technique often. The idea here is to use a list comprehension, invoking your function with each element in your &lt;code&gt;DataFrame&lt;/code&gt;. Note that I did use a list comprehension already in our first solution, but it was along with &lt;code&gt;iterrows&lt;/code&gt;. This time instead of using &lt;code&gt;iterrows&lt;/code&gt;, the data is pulled out of each column in the &lt;code&gt;DataFrame&lt;/code&gt; directly and then iterated over. No &lt;code&gt;Series&lt;/code&gt; is created in this case. If your function has multiple arguments, you can use &lt;code&gt;zip&lt;/code&gt; to make tuples of the arguments, passing in the columns in your &lt;code&gt;DataFrame&lt;/code&gt; to match the argument order. Now to do this, I’ll need a modified scoring function, since I don’t have already constructed &lt;code&gt;Customer&lt;/code&gt; objects in my &lt;code&gt;DataFrame&lt;/code&gt;, and creating them just to invoke the function would add another layer. I only use three attributes of the customer, so here’s a simple rewrite.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def score_customer_attributes(balance:int, start_date:datetime, state:str) -&amp;gt; str:
    if balance &amp;lt; 0:
        return 'F'
    if balance &amp;gt; 500:
        return 'A'
    # legacy vs. non-legacy
    if start_date &amp;gt; datetime(2020, 1, 1):
        if balance &amp;lt; 100:
            return 'D'
        elif balance &amp;lt; 200:
            return 'C'
        elif balance &amp;lt; 300:
            return 'B'
        else:
            if state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if balance &amp;lt; 100:
            return 'C'
        else:
            return 'A'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here’s what the first loop of the list comprehension will look like when calling the function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next(zip(df['balance'], df['start_date'], df['state']))

(493, Timestamp('2020-02-04 00:00:00'), 'Maryland')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will now build a list of all the scores for the entire &lt;code&gt;DataFrame&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df['score3'] = [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
assert (df['score'] == df['score3']).all()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now how fast is this?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%timeit [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]

171 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wow, that’s much faster, over 70x faster than the original for this data. By just taking the raw data and invoking a simple Python function, the scores are all calculated quickly in Python space. No row conversions to &lt;code&gt;Series&lt;/code&gt; need to take place.&lt;/p&gt;

&lt;p&gt;Note that we could also invoke our original function, we’d just have to make a &lt;code&gt;Customer&lt;/code&gt; object to pass in. This is a bit uglier, but still quite fast.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%timeit [score_customer(Customer(first_name='', last_name='', end_date=None, city=None, zipcode=None, balance=a[0], start_date=a[1], state=a[2])) for a in zip(df['balance'], df['start_date'], df['state'])]

254 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DataFrame.apply
&lt;/h3&gt;

&lt;p&gt;We can also use &lt;code&gt;DataFrame.apply&lt;/code&gt;. Note that to apply this to rows, you need to pass in the correct axis since it defaults to applying to each column. The axis argument here is specifying which index you want to have in the object passed to your function. We want each object to be a customer row, with the columns as the index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assert (df.apply(score_customer, axis=1) == df['score']).all()

%timeit df.apply(score_customer, axis=1)

3.57 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The performance here is better than our original, over 3x faster. This is also very readable, and allows us to use our easy to read and maintain original function. It’s still slower than the list comprehension though because it is constructing a &lt;code&gt;Series&lt;/code&gt; object for each row.&lt;/p&gt;

&lt;h3&gt;
  
  
  DataFrame.iteritems and DataFrame.itertuples
&lt;/h3&gt;

&lt;p&gt;Now we will look at the regular iteration methods in more detail. There are three &lt;code&gt;iter&lt;/code&gt; functions available for &lt;code&gt;DataFrame&lt;/code&gt;s: &lt;code&gt;iteritems&lt;/code&gt;, &lt;code&gt;itertuples&lt;/code&gt;, and &lt;code&gt;iterrows&lt;/code&gt;. &lt;code&gt;DataFrames&lt;/code&gt; also support iteration directly, but these functions don’t all iterate over the same things. Since understanding what all these methods do by just seeing their names can be really confusing, let’s review them all here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;iter(df)&lt;/code&gt; (calls the &lt;code&gt;DataFrame. __iter__&lt;/code&gt; method). Iterate over the info axis, which for &lt;code&gt;DataFrames&lt;/code&gt; is the column names, not the values.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next(iter(df)) # 'first_name'

'first_name'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;iteritems&lt;/code&gt;. Iterate over the columns, returning a tuple of column name and the column as a &lt;code&gt;Series&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next(df.iteritems())
next(df.items()) # these two are equivalent

('first_name',
 0 Katherine
 1 Sarah
 2 Karen
 3 David
 4 Phillip
          ...     
 95 Robert
 96 Christopher
 97 Kristen
 98 Nicholas
 99 Caroline
 Name: first_name, Length: 100, dtype: object)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;items&lt;/code&gt;. This is the same as above. &lt;code&gt;iteritems&lt;/code&gt; actually just invokes &lt;code&gt;items&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next(df.iterrows())

(0,
 first_name Katherine
 last_name Moody
 start_date 2020-02-04 00:00:00
 end_date 2021-06-28 00:00:00
 city Longberg
 state Maryland
 zipcode 20496
 balance 493
 score A
 score3 A
 Name: 0, dtype: object)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;iterrows&lt;/code&gt;. We already have seen this, it iterates through the rows, but returns them as a tuple of index and the row, as a &lt;code&gt;Series&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;itertuples&lt;/code&gt;. Iterates over the rows, returning a &lt;code&gt;namedtuple&lt;/code&gt; for each row. You can optionally change the name of the tuple and disable the index being returned.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next(df.itertuples())

Pandas(Index=0, first_name='Katherine', last_name='Moody', start_date=Timestamp('2020-02-04 00:00:00'), end_date=Timestamp('2021-06-28 00:00:00'), city='Longberg', state='Maryland', zipcode='20496', balance=493, score='A', score3='A')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using itertuples
&lt;/h2&gt;

&lt;p&gt;Since we already looked at &lt;code&gt;iterrows&lt;/code&gt;, we only need to look at &lt;code&gt;itertuples&lt;/code&gt;. As you can see, the returned value, a &lt;code&gt;namedtuple&lt;/code&gt;, can be used in our original function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assert ([score_customer(c[1]) for c in df.iterrows()] == df['score']).all()

%timeit [score_customer(t) for t in df.itertuples()] 

858 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The performance here is pretty good, over 12x faster. The construction of a &lt;code&gt;namedtuple&lt;/code&gt; for each row is much faster than construction of a &lt;code&gt;Series&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mixed types in a row
&lt;/h2&gt;

&lt;p&gt;Now is a good time to bring up another difference between &lt;code&gt;iterrows&lt;/code&gt; and &lt;code&gt;itertuples&lt;/code&gt;. A &lt;code&gt;namedtuple&lt;/code&gt; can properly represent any type in a single row. In our case, we have strings, date types, and integers. A pandas &lt;code&gt;Series&lt;/code&gt;, however, has to have only one datatype for the entire &lt;code&gt;Series&lt;/code&gt;. Because our datatypes were diverse enough, they were all represented as &lt;code&gt;object&lt;/code&gt; types, and ended up retaining their type, with no functionality issues for us. But this is not always the case!&lt;/p&gt;

&lt;p&gt;If your columns have different numerical types, for example, they will end up being the type that can represent all of them. This can cause your data returned by &lt;code&gt;itertuples&lt;/code&gt; and &lt;code&gt;iterrows&lt;/code&gt; to be slightly different between these two methods, so watch out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dfmixed = pd.DataFrame({'integer_column': [1,2,3], 'float_column': [1.1, 2.2, 3.3]})
dfmixed.dtypes

integer_column int64
float_column float64
dtype: object

next(dfmixed.itertuples())

Pandas(Index=0, integer_column=1, float_column=1.1)

next(dfmixed.iterrows())

(0,
 integer_column 1.0
 float_column 1.1
 Name: 0, dtype: float64)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Column names
&lt;/h3&gt;

&lt;p&gt;One other word of warning. If your &lt;code&gt;DataFrame&lt;/code&gt; has columns that cannot be represented as Python variable names, you will not be able to access them using dot syntax. So if you have a column named &lt;code&gt;2b&lt;/code&gt; or &lt;code&gt;My Column&lt;/code&gt; then you’ll have to access them using positional names (i.e. the first column will be called &lt;code&gt;_1&lt;/code&gt;). For &lt;code&gt;iterrows&lt;/code&gt;, the row will be a &lt;code&gt;Series&lt;/code&gt;, so you’ll have to access the columns using &lt;code&gt;["2b"]&lt;/code&gt; or &lt;code&gt;["My Column"]&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other choices
&lt;/h2&gt;

&lt;p&gt;There are other options for iteration, of course. For example, you could increment an integer offset and use the &lt;code&gt;iloc&lt;/code&gt; indexer on the &lt;code&gt;DataFrame&lt;/code&gt; to select any row. Of course, this is really no different from any other iteration, while also being non-idiomatic so others reading your code will probably find it hard to read and understand. I built a naive version of this in the performance comparison code for the summary below, if you want to see it (the performance was horrible).&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing well
&lt;/h2&gt;

&lt;p&gt;Choosing the right solution depends on essentially two factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How big is your data set?&lt;/li&gt;
&lt;li&gt;What can you write (and maintain) easily?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the image below, you can see the running time for the solutions we’ve considered (&lt;a href="https://github.com/wrighter/python_blogposts/blob/main/pandas/iter.py"&gt;the code to generate this is here&lt;/a&gt;). As you can see, only the vectorized solution holds up well with larger data. If your data set is huge, vectorized solutions may be your only reasonable choice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---Hizxr_w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/www.wrighters.io/wp-content/uploads/2021/05/iter_speeds.png%3Fresize%3D656%252C304%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---Hizxr_w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/www.wrighters.io/wp-content/uploads/2021/05/iter_speeds.png%3Fresize%3D656%252C304%26ssl%3D1" alt="Comparative runtimes for various methods on our DataFrame." width="656" height="304"&gt;&lt;/a&gt;Comparative runtimes for various methods on our DataFrame.&lt;/p&gt;

&lt;p&gt;However, depending on how many times you need to execute your code, how long it takes you to write it correctly, and how well you can maintain it going forward, you may choose any of the other solutions and be fine. In fact, they all grow linearly with increasing data for these solutions.&lt;/p&gt;

&lt;p&gt;Maybe one way to think about this is not just big-O notation, but “big-U” notation. In other words, how long will it take YOU to write a correct solution? If it’s less than the running time of your code, an iterative solution may be totally fine. However, if you’re writing production code, take the time to learn how to vectorize.&lt;/p&gt;

&lt;p&gt;One other point; sometimes writing the iterative solution on a smaller set is easy, and you may want to do that first, then write the vectorized version. Verify your results with the iterative solution to make sure you did it correctly, then use the vectorized version on the larger full data set.&lt;/p&gt;

&lt;p&gt;I hope you’ve found this dive into &lt;code&gt;DataFrame&lt;/code&gt; iteration interesting. I know I learned a few useful things along the way.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/how-to-iterate-over-dataframe-rows-and-should-you/"&gt;How to iterate over DataFrame rows (and should you?)&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>python</category>
    </item>
    <item>
      <title>4 ways to run Jupyter notebooks</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Mon, 10 May 2021 17:13:38 +0000</pubDate>
      <link>https://forem.com/wrighter/4-ways-to-run-jupyter-notebooks-16d6</link>
      <guid>https://forem.com/wrighter/4-ways-to-run-jupyter-notebooks-16d6</guid>
      <description>&lt;p&gt;Jupyter notebooks are an increasingly popular way to write, execute, document, and share code and communicate the results, especially in the Python ecosystem. This article will cover four ways to run Jupyter notebooks. It will also talk about some of the advantages and disadvantages of each. The notebook ecosystem is expanding and there are a lot of options, so let’s dig in.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, what is a notebook?
&lt;/h2&gt;

&lt;p&gt;Before we look at the options, let’s review what a Jupyter notebook is. A notebook is a combination of code, documentation, and output. It’s essentially a captured interactive session with an interpreter. It contains cells that contain code or descriptive text, along with the output of executing the code. Since a cell can be executed multiple times in an interactive session, the notebook will contain the most recent execution and results. A notebook file is usually created via an interactive process by the author using a web application for authoring the notebook document. As you can see in the architecture diagram below, the notebook server can communicate to multiple kernels. The kernel is the process where the notebook runs, and each is independent of the other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5Whdjsf0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/www.wrighters.io/wp-content/uploads/2021/05/image.png%3Fresize%3D656%252C522%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5Whdjsf0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/www.wrighters.io/wp-content/uploads/2021/05/image.png%3Fresize%3D656%252C522%26ssl%3D1" alt=""&gt;&lt;/a&gt;The Jupyter Notebook Architecture&lt;/p&gt;

&lt;p&gt;Jupyter supports other languages, but for now, we’ll assume we’re talking about Python. However, nothing in this article requires that Python be the language of choice for the kernel.&lt;/p&gt;

&lt;p&gt;A user interacts with a notebook server (usually, but not always via a web browser as you’ll soon see) to edit cells in the notebook. The cells can contain code or documentation, like markdown. The server ensures that all user edits and actions are executed in the kernel. When a cell is executed, the output from the kernel is captured. The notebook server persists the output in a file, ending in .ipynb. The file format is JSON. You can open it in a text editor and save it via version control (although it’s not very clean and can be messy and hard to diff, especially for large outputs like images or graphs). You can also send it to others to open or use.&lt;/p&gt;

&lt;h1&gt;
  
  
  How you do you view a notebook?
&lt;/h1&gt;

&lt;p&gt;First, let’s separate the concept of viewing a notebook from actually executing it. Since a notebook file contains all the data from an interpreter session, it can be rendered into a human readable format to show that data, without re-executing the code. So viewing a notebook is a lot easier than executing it, since you don’t need a kernel. You can just take the input json and convert it as whichever output you desire. This is a good way to share your code and output with others, and if they only want to view it, this is all they will need. Executed notebooks can be shared via a number of tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  nbconvert
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://nbconvert.readthedocs.io/en/latest/index.html"&gt;&lt;code&gt;nbconvert&lt;/code&gt;&lt;/a&gt; tool will convert a notebook into various output formats. Depending on which software packages are installed in the environment, notebooks can be rendered in html, PDF, LaTeX, and other formats. It can also execute a notebook from the command line, without a server running, but it isn’t intended for interactive use. The resulting converted notebooks can be sent to others for viewing using whichever tool they prefer, like a web browser or PDF viewer.&lt;/p&gt;

&lt;h2&gt;
  
  
  nbviewer
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://nbviewer.jupyter.org/"&gt;nbviewer&lt;/a&gt; web site is another option for sharing notebooks. Think of it as a web based &lt;code&gt;nbconvert&lt;/code&gt; tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other services (like GitHub)
&lt;/h2&gt;

&lt;p&gt;A number of services support rendering notebooks as web pages. For example, GitHub will render your notebooks for you if a .ipynb file is a part of a repository that you are browsing. For example, I put many of my articles in &lt;a href="https://github.com/wrighter/python_blogposts"&gt;GitHub&lt;/a&gt;, and some of them render right in the browser.&lt;/p&gt;

&lt;h1&gt;
  
  
  How do you run or execute a Jupyter notebook?
&lt;/h1&gt;

&lt;p&gt;OK, enough about viewing notebooks, if we want to actually create new notebooks or execute already created notebooks, what are our options? To work with a notebook, you need a notebook server running. The notebook server will launch the necessary kernel, provide you with a user interface via your web browser (or other authoring tool), and send data back and forth to the kernel for execution.&lt;/p&gt;

&lt;p&gt;Let’s look at four different options for executing notebooks.&lt;/p&gt;

&lt;h1&gt;
  
  
  Standard Jupyter servers
&lt;/h1&gt;

&lt;p&gt;Your first option is to run one of the standard Juypyter notebooks servers. You can do this by installing the server in your Python environment, and then running the server and connecting to it via a browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Jupyter notebook
&lt;/h2&gt;

&lt;p&gt;The standard &lt;a href="https://jupyter-notebook.readthedocs.io/en/stable/"&gt;Jupyter notebook&lt;/a&gt; is a reliable and simple way to execute notebooks, and is what I tend to use most of the time. You can install it using either &lt;code&gt;pip&lt;/code&gt; or Anaconda using &lt;code&gt;conda&lt;/code&gt;. I’d recommend using something like &lt;a href="https://www.wrighters.io/you-can-easily-and-sensibly-run-multiple-versions-of-python-with-pyenv/"&gt;pyenv&lt;/a&gt; and &lt;a href="https://www.wrighters.io/use-pyenv-and-virtual-environments-to-manage-python-complexity/"&gt;a virtual environment&lt;/a&gt; to setup and run a newer version of Python if you don’t choose &lt;code&gt;conda&lt;/code&gt;. The Jupyter project recommends using Anaconda &lt;a href="https://jupyter.readthedocs.io/en/latest/install/notebook-classic.html"&gt;in their docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Note that the Jupyter notebook is fairly configurable, so you can checkout &lt;a href="https://github.com/ipython-contrib/jupyter_contrib_nbextensions"&gt;the extensions&lt;/a&gt; once you’re comfortable with the basic setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  JupyterLab
&lt;/h2&gt;

&lt;p&gt;A second option from the Jupyter project is &lt;a href="https://jupyterlab.readthedocs.io/en/latest/"&gt;JupyterLab&lt;/a&gt;, the next generation notebook server. It provides a more sophisticated front end and may be a lot easier for beginning users to understand. It also supports extensions.&lt;/p&gt;

&lt;p&gt;Both Jupyter notebook and JupyterLab are supported as part of &lt;a href="https://jupyterhub.readthedocs.io/en/stable/"&gt;JupyterHub&lt;/a&gt;, a way to serve up Jupyter notebooks for multiple users. You might consider this if you are planning on having multiple users in a class or workgroup run notebooks at the same time, and you don’t want users to have to run their own Jupyter notebook or JupyterLab instance.&lt;/p&gt;

&lt;h1&gt;
  
  
  IDE integration
&lt;/h1&gt;

&lt;p&gt;A second way to execute notebooks is via your Integrated Development Environment (IDE). Many IDEs support Jupyter notebooks, sometimes via a plugin. For example, &lt;a href="https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html"&gt;Pycharm&lt;/a&gt; supports notebooks in the professional version. If you use Microsoft Visual Studio Code, &lt;a href="https://code.visualstudio.com/docs/python/jupyter-support"&gt;Jupyter support&lt;/a&gt; is also available. For other IDEs, check for Jupyter support. If it lacks support, you might be very interested in the next option.&lt;/p&gt;

&lt;h1&gt;
  
  
  Hosted services
&lt;/h1&gt;

&lt;p&gt;A third popular way to execute notebooks is via hosted services. With a hosted service, you don’t have to maintain a server. You can access your notebook from anywhere. Sharing code with others can be easier, especially with some of the services offering collaborative editing of the same notebook file. Some of these are free or offer a free version. Some support advanced features like enhanced visualizations, easier environment setup, GPU support, and other IDE-like functionality. With these environments, you can create a notebook from scratch or upload an existing .ipynb file, so you can take work from one environment (or your own setup) and move it to the service. If you are using source code control (I hope you are), then you can easily add your notebooks by cloning your repository.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://deepnote.com/"&gt;DeepNote&lt;/a&gt; – a data science notebook with a free version. Supports collaboration with other users and a number of advanced integrations.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cocalc.com/"&gt;Cocalc&lt;/a&gt; – a service that targets classroom settings, supports a wide variety of languages and environments&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://replit.com/"&gt;Replit&lt;/a&gt; – online IDE with collaborative tools, supports over 50 languages, free version available.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datalore.jetbrains.com/"&gt;Datalore (from JetBrains)&lt;/a&gt; – a Jupyter notebook implementation with PyCharm functionality, free version available.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://colab.research.google.com/"&gt;Google Colab&lt;/a&gt; – free Jupyter notebooks from Google, Pro version available.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This appears to be a competitive space with new options appearing all the time.&lt;/p&gt;

&lt;h1&gt;
  
  
  The command line
&lt;/h1&gt;

&lt;p&gt;Last but not least, you may be a command line nerd wondering if you have to use a browser or fancy IDE. It turns out you also have an option. The &lt;a href="https://github.com/davidbrochart/nbterm"&gt;nbterm&lt;/a&gt; project allows you to interactively run Jupyter notebooks from the command line.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;As you can see, there are a number of ways to execute Jupyter notebooks. Depending on your needs, you should be able to find a solution that works well for you. I’d encourage you to try a couple out and see if they help you be more productive.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/4-ways-to-run-jupyter-notebooks/"&gt;4 ways to run Jupyter notebooks&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>How to use ipywidgets to make your Jupyter notebook interactive</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Mon, 03 May 2021 00:32:14 +0000</pubDate>
      <link>https://forem.com/wrighter/how-to-use-ipywidgets-to-make-your-jupyter-notebook-interactive-5dem</link>
      <guid>https://forem.com/wrighter/how-to-use-ipywidgets-to-make-your-jupyter-notebook-interactive-5dem</guid>
      <description>&lt;p&gt;Have you ever created a Python-based Jupyter notebook and analyzed data that you want to explore in a number of different ways? For example, you may want to look at a plot of data, but filter it ten different ways. What are your options to view these ten different results?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Copy and paste a cell, changing the filter for each cell, then executing the cell. You will end up with ten different cells with ten different values.&lt;/li&gt;
&lt;li&gt;Modify the same cell, execute it and view the results, then modify it again, ten times.&lt;/li&gt;
&lt;li&gt;Parameterize the notebook (perhaps using a tool like &lt;a href="https://papermill.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Papermill&lt;/a&gt;) and execute the entire notebook with ten different sets of parameters.&lt;/li&gt;
&lt;li&gt;Some combination of the above.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These all are non-ideal if we want quick interaction and the ability to explore the data. Those options are also prone to typing errors or lots of extra editing work. They may work great for the original developer of a notebook, but allowing a user who doesn’t undestand Python syntax to modify variables and re-execute cells may not be the best option. What if you could just give the user a simple form, with a button, and they could modify the form and see the results they want?&lt;/p&gt;

&lt;p&gt;It turns out you can do this pretty easily right in Jupyter, without creating a full webapp. This is possible with &lt;code&gt;ipywidgets&lt;/code&gt;, also known just as widgets. I’ll show you the basics in this article of building a few simple forms to view and analyze some data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are widgets?
&lt;/h2&gt;

&lt;p&gt;Jupyter widgets are special bits of code that will embed JavaScript and html in your notebook and present a visual representation in your brower when executed in a notebook. These components allow a user to interact with the widgets. The widgets can execute code on certain actions, allowing you to update cells without a user having to re-execute them or even modify any code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;First, you need to make sure that &lt;code&gt;ipywidgets&lt;/code&gt; is installed in your environment. This will depend a bit on which Jupyter environment you are using. For older Jupyter and JupyterLab installs, make sure to check the details in &lt;a href="https://ipywidgets.readthedocs.io/en/latest/user_install.html" rel="noopener noreferrer"&gt;the docs&lt;/a&gt;. But for a basic install, just use pip&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install ipywidgets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or for conda&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;conda install -c conda-forge ipywidgets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should be all that you need to do in most situations to get things running. &lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Instead of going through all the widgets and getting into details right away, let’s grab some interesting data and explore it manually. Then we’ll use widgets to make a more interactive version of some of this data exploration. Let’s grab some data from the &lt;a href="https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q" rel="noopener noreferrer"&gt;Chicago Data Portal&lt;/a&gt; – specifically their dataset of current active business licenses. Note that if you just run the code as below, you’ll only get 1000 rows of data. Check the documentation on how to to grab all the data.&lt;/p&gt;

&lt;p&gt;Note: all of this code was written in a Jupyter notebook using Python 3.8.6. While this article shows the output, the best way to experience widgets is to interact with them in your own environment. You can download a notebook of this article &lt;a href="https://github.com/wrighter/python_blogposts/blob/main/tools/jupyter_ipywidgets.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
df = pd.read_csv('https://data.cityofchicago.org/resource/uupf-x98q.csv')
df[['LEGAL NAME', 'ZIP CODE', 'BUSINESS ACTIVITY']].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.36.39-PM.png%3Fresize%3D656%252C185%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.36.39-PM.png%3Fresize%3D656%252C185%26ssl%3D1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see from the data, the business activity is pretty verbose, but the zip code is an easy way to do some simple searches and filters of data. For our smaller data set, let’s just grab the zip codes that have 20 or more businesses.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;zips = df.groupby('ZIP CODE').count()['ID'].sort_values(ascending=False)
zips = list(zips[zips &amp;gt; 20].index)
zips

[60618, 60622, 60639, 60609, 60614, 60608, 60619, 60607]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, a reasonable scenario for filtering data might be create a report filtering by zip code, showing the legal name and address of a business, ordered by expiration date of the license. This would be a pretty simple (even if somewhat messy) expression in pandas. For example, in this data set we can take the top zip code and look at a few columns like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.loc[df['ZIP CODE'] == zips[0]].sort_values(by='LICENSE TERM EXPIRATION DATE', ascending=False)[['LEGAL NAME', 'ADDRESS', 'LICENSE TERM EXPIRATION DATE']]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.39.27-PM.png%3Fresize%3D656%252C321%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.39.27-PM.png%3Fresize%3D656%252C321%26ssl%3D1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now what if someone wanted to be able to run this report for different zip codes, looking at different columns, and sorting by other columns? The user would have to be comfortable editing the cell above, rerunning it, and maybe executing other cells to look for the column names and other values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using widgets
&lt;/h2&gt;

&lt;p&gt;Instead, we can use widgets to make a form that allows this interaction to be executed visually. In this article you will learn enough about widgets to build a form and dynamically show the results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Widget types
&lt;/h3&gt;

&lt;p&gt;Since most of us are familiar with forms in our web browsers, it makes sense to think about widgets as parts of typical forms. Widgets can represent numerical, boolean, or text values. They can be selectors of pre-existing lists, or can accept free text (or password text). You can also use them to display formatted output or images. The &lt;a href="https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html" rel="noopener noreferrer"&gt;full list of widgets&lt;/a&gt; describes them in more detail. You can also create your own custom widgets, but for our purposes, we will be able to do all the work with standard widgets.&lt;/p&gt;

&lt;p&gt;A widget is just an object that can be displayed in a Jupyter notebook once created. It will render itself (and its underlying content) and (possibly) allow user interaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making a form
&lt;/h3&gt;

&lt;p&gt;For our form, we will need to gather four pieces of information:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The zip code to filter&lt;/li&gt;
&lt;li&gt;The column to sort on&lt;/li&gt;
&lt;li&gt;Whether the sort is ascending or descending&lt;/li&gt;
&lt;li&gt;The columns to display.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These four pieces of information will be captured by the following form elements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A selection dropdown&lt;/li&gt;
&lt;li&gt;A selection dropdown&lt;/li&gt;
&lt;li&gt;A checkbox&lt;/li&gt;
&lt;li&gt;A multi-selection list&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These three widgets will provide a quick intro to widgets, and once you know how to instantiate and use one widget, the others are quite similar. Before we can create a widget, we need to import the library. Let’s look at dropdowns first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import ipywidgets as widgets

widgets.Dropdown(
    options=zips,
    value=zips[0],
    description='Zip Code:',
    disabled=False,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.40.35-PM.png%3Fresize%3D656%252C74%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.40.35-PM.png%3Fresize%3D656%252C74%26ssl%3D1"&gt;&lt;/a&gt;Of course, just creating an object doesn’t allow us to use it, so we need to assign it to a variable, and the &lt;code&gt;display&lt;/code&gt; function can be used to render it, the same as we see above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;zips_dropdown = widgets.Dropdown(
    options=zips,
    value=zips[0],
    description='Zip Code:',
    disabled=False,
)

display(zips_dropdown)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can easily do the same for the columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;columns_dropdown = widgets.Dropdown(
    options=df.columns,
    value=df.columns[4],
    description='Sort Column:',
    disabled=False,
)

display(columns_dropdown)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.43.18-PM.png%3Fresize%3D656%252C74%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.43.18-PM.png%3Fresize%3D656%252C74%26ssl%3D1"&gt;&lt;/a&gt;And for boolean values, you have a few options. You can do a &lt;code&gt;CheckBox&lt;/code&gt; or &lt;code&gt;ToggleButton&lt;/code&gt;. I’ll go with the first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sort_checkbox = widgets.Checkbox(
    value=False,
    description='Ascending?',
    disabled=False)
display(sort_checkbox)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.44.18-PM.png%3Fresize%3D656%252C74%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.44.18-PM.png%3Fresize%3D656%252C74%26ssl%3D1"&gt;&lt;/a&gt;Finally for this example, we want to be able to select all the columns we want to see in the output. We’ll use a &lt;code&gt;SelectMultiple&lt;/code&gt; for that. Note that if you use the shift and ctrl (or Command on a Mac) keys to select multiple options.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;columns_selectmultiple = widgets.SelectMultiple(
    options=df.columns,
    value=['LEGAL NAME'],
    rows=10,
    description='Visible:',
    disabled=False
)
display(columns_selectmultiple)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.45.39-PM.png%3Fresize%3D656%252C362%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.45.39-PM.png%3Fresize%3D656%252C362%26ssl%3D1"&gt;&lt;/a&gt;Last, we will show a button that we can click to force updates. (Note that we won’t end up needing this in the end, there’s a simpler way to interact with our elements, but buttons can be useful for many situations).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;button = widgets.Button(
    description='Run',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Run report',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
display(button)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.46.39-PM.png%3Fresize%3D396%252C96%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.46.39-PM.png%3Fresize%3D396%252C96%26ssl%3D1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling output
&lt;/h2&gt;

&lt;p&gt;Before we hook our button up to a function, we need to make sure we can capture the output of our function. If we want to view a &lt;code&gt;DataFrame&lt;/code&gt;, or print text, or log some information to stdout, we need to be able to capture that information and clear it, if necessary. This is what the &lt;code&gt;Output&lt;/code&gt; widget is for. Note that you don’t have to use an output widget, but if you want your output to appear in a certain cell, you will need to use this. The cell where the &lt;code&gt;Output&lt;/code&gt; widget is displayed will render the results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;out = widgets.Output(layout={'border': '1px solid black'})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hooking it all up
&lt;/h2&gt;

&lt;p&gt;Now that we’ve generated all our user interface components, how do we display them all in one spot and hook them up to generate actions? &lt;/p&gt;

&lt;p&gt;First, let’s create a simple layout with all the items together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;box = widgets.VBox([zips_dropdown, columns_dropdown, sort_checkbox, columns_selectmultiple, button])
display(box)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.48.10-PM.png%3Fresize%3D654%252C614%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.48.10-PM.png%3Fresize%3D654%252C614%26ssl%3D1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling events
&lt;/h2&gt;

&lt;p&gt;For widgets that can produce events, you can provide a function that will receive the event. For a &lt;code&gt;Button&lt;/code&gt;, the event is &lt;code&gt;on_click&lt;/code&gt;, and it requires a function that will take a single argument, the &lt;code&gt;Button&lt;/code&gt; itself. If we use the &lt;code&gt;Output&lt;/code&gt; we created above (as a context manager using a &lt;code&gt;with&lt;/code&gt; statement), clicking the button will cause the text “Button clicked” to be appended to the cell output. Note that the cell that receives the output will be the one where the &lt;code&gt;Output&lt;/code&gt; was rendered.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def on_button_clicked(b):
    with out:
        print("Button clicked.")

button.on_click(on_button_clicked, False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.49.29-PM.png%3Fresize%3D374%252C112%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.49.29-PM.png%3Fresize%3D374%252C112%26ssl%3D1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A better way to hook things up
&lt;/h2&gt;

&lt;p&gt;The above example is simple, but doesn’t show us how we’d get the values from the other inputs. Another way to do that is to use &lt;code&gt;interact&lt;/code&gt;. It works as both a function or a function decorator to automatically create widgets that allow you to interactively change the inputs to a function. Based on the named argument type, it will generate a widget that allows you to change that value. Using &lt;code&gt;interact&lt;/code&gt; is a quick way to provide user interaction around a function. The function will be called each time a widget is updated. As you move the slider, the square of the number will be printed if the checkbox is checked, and the number will just be printed unchanged otherwise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def my_function2(x, y):
    if y:
        print(x*x)
    else:
        print(x)

interact(my_function2,x=10,y=False);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.51.26-PM.png%3Fresize%3D656%252C168%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Fwww.wrighters.io%2Fwp-content%2Fuploads%2F2021%2F05%2FScreen-Shot-2021-05-02-at-6.51.26-PM.png%3Fresize%3D656%252C168%26ssl%3D1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that you can provide more information to &lt;code&gt;interact&lt;/code&gt; to provide more appropriate user interface elements (see the docs for examples). But since we already made widgets, we could just use those instead. The best way to do that is to use another function, &lt;code&gt;interactive&lt;/code&gt;.  &lt;code&gt;interactive&lt;/code&gt; is like interact, but allows you to interact with the widgets that were created (or supply them directly), and to display values when you want. Since we already made some widgets, we can just let &lt;code&gt;interactive&lt;/code&gt; know about them by providing each of them as keyword arguments. The first argument is a function, and that function’s arguments need to match the subsequent keyword arguments to interactive. Each time we change one of the values in the form, the function will be invoked with the values from the form widgets. With just a few lines of code, we now have an interactive tool for looking at and filtering this data.&lt;/p&gt;

&lt;p&gt;But first, I’ll make a cell with an output to receive the display.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report_output = widgets.Output()
display(report_output)



from ipywidgets import interactive

def filter_function(zipcode, sort_column, sort_ascending, view_columns):
    filtered = df.loc[df['ZIP CODE'] == zipcode].sort_values(by=sort_column, ascending=sort_ascending)[list(view_columns)]
    with report_output:
        report_output.clear_output()
        display(filtered)

interactive(filter_function, zipcode=zips_dropdown, sort_column=columns_dropdown,
                    sort_ascending=sort_checkbox, view_columns=columns_selectmultiple) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the same form created earlier above is rendered in the cell. The output will appear in whichever cell the &lt;code&gt;display(report_output)&lt;/code&gt; line was executed. As you modify any of the form elements, the resulting filtered &lt;code&gt;DataFrame&lt;/code&gt; will be displayed in that cell.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This has been just a quick overview of using &lt;code&gt;ipywidgets&lt;/code&gt; to make Jupyter notebooks more interactive. Even if you are comfortable editing Python code and re-executing cells to update and explore data, widgets may be a great way to make that exploration more dynamic and convenient, along with being less error prone. If you need to share notebooks with people who are not comfortable editing Python code, widgets can be a lifesaver and really help the data come alive.&lt;/p&gt;

&lt;p&gt;Just reading about these widgets is not nearly as interesting as running examples and working with them yourself. Give these examples a try and then try using widgets in your own notebooks.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/use-ipywidgets-with-jupyter-notebooks/" rel="noopener noreferrer"&gt;How to use ipywidgets to make your Jupyter notebook interactive&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io" rel="noopener noreferrer"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Profiling Python code with memory_profiler</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Thu, 22 Apr 2021 01:03:46 +0000</pubDate>
      <link>https://forem.com/wrighter/profiling-python-code-with-memoryprofiler-285d</link>
      <guid>https://forem.com/wrighter/profiling-python-code-with-memoryprofiler-285d</guid>
      <description>&lt;p&gt;What do you do when your Python program is using too much memory? How do you find the spots in your code with memory allocation, especially in large chunks? It turns out that there is not usually an easy answer to these question, but a number of tools exist that can help you figure out where your code is allocating memory. In this article, I’m going to focus on one of them, &lt;a href="https://github.com/pythonprofilers/memory_profiler"&gt;&lt;code&gt;memory_profiler&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;memory_profiler&lt;/code&gt; tool is similar in spirit (and inspired by) the &lt;a href="https://github.com/pyutils/line_profiler"&gt;&lt;code&gt;line_profiler&lt;/code&gt;&lt;/a&gt; tool , which I’ve &lt;a href="https://www.wrighters.io/profiling-python-code-with-line_profiler/"&gt;written about as well&lt;/a&gt;. Whereas &lt;code&gt;line_profiler&lt;/code&gt; tells you how much &lt;em&gt;time&lt;/em&gt; is spent on each line, &lt;code&gt;memory_profiler&lt;/code&gt; tells you how much &lt;em&gt;memory&lt;/em&gt; is allocated (or freed) by each line. This allows you to see the real impact of each line of code and get a sense where memory usage. While the tool is quite helpful, there’s a few things to know about it to use it effectively. I’ll cover some details in this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;memory_profiler&lt;/code&gt; is written in Python and can be installed using pip. The package will include the library, as well as a few command line utilities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install memory_profiler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It uses the &lt;a href="https://github.com/giampaolo/psutil"&gt;&lt;code&gt;psutil&lt;/code&gt;&lt;/a&gt; library (or can use tracemalloc or posix) to access process information in a cross platform way, so it works on Windows, Mac, and Linux.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic profiling
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;memory_profiler&lt;/code&gt; is a set of tools for profiling a Python program’s memory usage, and the documentation gives a nice overview of those tools. The tool that provides the most detail is the line-by-line memory usage that the module will report when profiling a single function. You can obtain this by running the module from the command line against a python file. It’s also available via Juypyter/IPython magics, or in your own code. I’ll cover all those options in this article. &lt;/p&gt;

&lt;p&gt;I’ve extended the example code from the documentation to show several ways that you might see memory grow and be reclaimed in Python code, and what the line-by-line output looks like on my computer. Using the sample code below, saved in a source file (&lt;code&gt;performance_memory_profiler.py&lt;/code&gt;), you can follow along by running the profile yourself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from functools import lru_cache

from memory_profiler import profile

import pandas as pd
import numpy as np

@profile
def simple_function():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 7)
    del b
    return a

@profile
def simple_function2():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 8)
    del b
    return a

@lru_cache
def caching_function(size):
    return np.ones(size)

@profile
def test_caching_function():
    for i in range(10_000):
        caching_function(i)

    for i in range(10_000,0,-1):
        caching_function(i)

if __name__ == ' __main__':
    simple_function()
    simple_function()
    simple_function2()
    test_caching_function()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running &lt;code&gt;memory_profiler&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;To provide line-by-line results, &lt;code&gt;memory_profiler&lt;/code&gt; requires that a method be decorated with the &lt;code&gt;@profile&lt;/code&gt; decorator. Just add this to the methods you want to profile, I have done this with three methods above. Then you’ll need a way to actually execute those methods, such as a command line script. Running a unit test can work as well, as long as you can run it from the command line. You do this by running the &lt;code&gt;memory_profiler&lt;/code&gt; module and supplying the Python script that drives your code. You can give it a &lt;code&gt;-h&lt;/code&gt; to see the help:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m memory_profiler -h
usage: python -m memory_profiler script_file.py

positional arguments:
  program python script or module followed by command line arguements to run

optional arguments:
  -h, --help show this help message and exit
  --version show program's version number and exit
  --pdb-mmem MAXMEM step into the debugger when memory exceeds MAXMEM
  --precision PRECISION
                        precision of memory output in number of significant digits
  -o OUT_FILENAME path to a file where results will be written
  --timestamp print timestamp instead of memory measurement for decorated functions
  --include-children also include memory used by child processes
  --backend {tracemalloc,psutil,posix}
                        backend using for getting memory info (one of the {tracemalloc, psutil, posix})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To view the results from the sample program, just run it with the defaults. Since we marked three of the functions with the &lt;code&gt;@profile&lt;/code&gt; decorator, all three invocations will be printed. Be careful of profiling a method or function that is invoked many times, it will print a result for each invocation. Below are the results from my computer, and I’ll explain more about the run below. For each function, we get the source line number on the left, the actual Python source code on the right, and three metrics for each line. First, the memory usage of the entire process when that line of code was executed, how much of an increment (positive numbers) or decrement (negative numbers) of memory occured for that line, and how many times that line was executed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m memory_profiler performance_memory_profiler.py
Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
     8 67.2 MiB   67.2 MiB 1          @profile
     9                                def simple_function():
    10 74.8 MiB    7.6 MiB 1              a = [1] * (10 ** 6)
    11 227.4 MiB 152.6 MiB 1              b = [2] * (2 * 10 ** 7)
    12 227.4 MiB   0.0 MiB 1              del b
    13 227.4 MiB   0.0 MiB 1              return a

Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
     8 227.5 MiB 227.5 MiB 1          @profile
     9                                def simple_function():
    10 235.1 MiB 7.6 MiB   1             a = [1] * (10 ** 6)
    11 235.1 MiB 0.0 MiB   1              b = [2] * (2 * 10 ** 7)
    12 235.1 MiB 0.0 MiB   1               del b
    13 235.1 MiB 0.0 MiB   1               return a

Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
    15 235.1 MiB 235.1 MiB 1 @profile
    16 def simple_function2():
    17 235.1 MiB 0.0 MiB 1 a = [1] * (10 ** 6)
    18 1761.0 MiB 1525.9 MiB 1 b = [2] * (2 * 10 ** 8)
    19 235.1 MiB -1525.9 MiB 1 del b
    20 235.1 MiB 0.0 MiB 1 return a

Filename: performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
    27 235.1 MiB 235.1 MiB 1 @profile
    28 def test_caching_function():
    29 275.6 MiB 0.0 MiB 10001 for i in range(10_000):
    30 275.6 MiB 40.5 MiB 10000 caching_function(i)
    31
    32 280.6 MiB 0.0 MiB 10001 for i in range(10_000,0,-1):
    33 280.6 MiB 5.0 MiB 10000 caching_function(i)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Interpreting the results
&lt;/h2&gt;

&lt;p&gt;If you check the official docs, you’ll see slightly different results in their example output than mine when I executed &lt;code&gt;simple_function&lt;/code&gt;. For instance, in my first two invocations of the function, the &lt;code&gt;del&lt;/code&gt; seems to have no effect, whereas their example shows memory being freed. This is because Python is a garbage collected language, and so &lt;code&gt;del&lt;/code&gt; is not the same as freeing memory in a language like &lt;code&gt;c&lt;/code&gt; or &lt;code&gt;c++&lt;/code&gt;. You can see that the memory spiked on the first invocation of the method, but then on the second invocation no new memory was needed for creating &lt;code&gt;b&lt;/code&gt; a second time. To clarify this point, I added another method, &lt;code&gt;simple_function2&lt;/code&gt; that creates a bigger list, and this time we see that the memory is freed, the garbage collector decided it wanted to reclaim that memory. This is just one example of how profiling code may require multiple runs with varied input data to get realistic results for your code. Also consider the hardware used; production issues may not match a development workstation. Just as much time may be needed to craft a good test program as to interpret the results and deciding how to improve things.&lt;/p&gt;

&lt;p&gt;The second thing to note from my results is the profiling of &lt;code&gt;caching_function&lt;/code&gt;. Note that the test driver runs through the function with 10,000 values, but then runs through them again in reverse. The cache will get hit for the first 128 calls (the default size of the &lt;code&gt;functools.lru_cache&lt;/code&gt; function decorator. We see that there is much less memory growth the second time around (this is both because of the cache hits and the garbage collector not reclaiming previously allocated memory). In general, look for continual or large memory increments without decrements. Also look for cases where memory grows every time the function is called, even if it’s in smaller amounts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Profiling in regular code
&lt;/h2&gt;

&lt;p&gt;If the function decorator is imported in your code (as above) and run as normal, profiling data is sent to stdout. This can be a handy way to profile single methods quickly. You can annotate any function and just run your code using whichever scripts you normally use. Note you can send this output to a file or log it using the &lt;code&gt;logging&lt;/code&gt; module as well. See the docs for details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Jupyter/IPython magics
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;memory_profiler&lt;/code&gt; project also includes Jupyter/IPython magics, which can be useful. It’s very important to note that to get line-by-line output (as of the most recent version as of this writing – v0.58), code has to be saved in local Python source files, it can’t be read directly from notebooks or the IPython interpreter. But the magics can still be useful for debugging memory issues. To use them, load the extension.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%load_ext memory_profiler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  mprun
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;%mprun&lt;/code&gt; magic is similar to running the functions as described above, but you can do some more ad-hoc checking. First, just import the functions, then run them. Note that I found it didn’t seem to play well with &lt;a href="https://www.wrighters.io/using-autoreload-to-speed-up-ipython-and-jupyter-work/"&gt;&lt;code&gt;autoreload&lt;/code&gt;&lt;/a&gt;, so your mileage may vary in trying to modify code and test it without doing a full kernel restart.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from performance_memory_profiler import test_caching_function, simple_function

%mprun -f simple_function simple_function()
Filename: /Users/mcw/projects/python_blogposts/performance/performance_memory_profiler.py

Line # Mem usage Increment Occurences Line Contents
============================================================
     8 76.4 MiB 76.4 MiB 1 @profile
     9 def simple_function():
    10 84.0 MiB 7.6 MiB 1 a = [1] * (10 ** 6)
    11 236.6 MiB 152.6 MiB 1 b = [2] * (2 * 10 ** 7)
    12 236.6 MiB 0.0 MiB 1 del b
    13 236.6 MiB 0.0 MiB 1 return a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  memit
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;%memit&lt;/code&gt; and &lt;code&gt;%%memit&lt;/code&gt; magics are helpful for checking what the peak memory and incremental memory growth is for the code executed. You don’t get line-by-line output, but this can allow for interactive debugging and testing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%memit
range(1000)
peak memory: 237.00 MiB, increment: 0.32 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Looking at specific objects, not using memory_profiler
&lt;/h2&gt;

&lt;p&gt;Let’s just look quickly at Numpy and pandas objects and how we can see the memory usage of those objects. These two libraries and their objects are very likely to be large for many use cases. For newer versions of the libraries, you can use &lt;code&gt;sys.get_size_of&lt;/code&gt; to see their memory usage. Under the hood, pandas objects will just call their &lt;code&gt;memory_usage&lt;/code&gt; method, which you can also use directly. Note that you need to specify &lt;code&gt;deep=True&lt;/code&gt; if you also want to see the memory usage of objects in pandas containers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sys

import numpy as np
import pandas as pd

def make_big_arrays():
    x = np.ones(int(1e7))
    return x

def make_big_series():
    return pd.Series(np.ones(int(1e7)))

def make_big_string_series():
    return pd.Series([str(i) for i in range(int(1e7))])

arr = make_big_arrays()
ser = make_big_series()
ser2 = make_big_string_series()

print("arr: ", sys.getsizeof(arr))
print("ser: ", sys.getsizeof(ser))
print("ser2: ", sys.getsizeof(ser2))
print("ser: ", ser.memory_usage(), ser.memory_usage(deep=True))
print("ser2: ", ser2.memory_usage(), ser2.memory_usage(deep=True))

arr: 80000096
ser: 80000144
ser2: 638889034
ser: 80000128 80000128
ser2: 80000128 638889018

%memit make_big_string_series()

peak memory: 1883.11 MiB, increment: 780.45 MiB

%%memit
x = make_big_string_series()
del x

peak memory: 1883.14 MiB, increment: 696.07 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to point out there. First, you can see the size of a &lt;code&gt;Series&lt;/code&gt; of &lt;code&gt;int&lt;/code&gt; objects is the same whether you use &lt;code&gt;deep=True&lt;/code&gt; or not. For string objects, the size of the object is the same as the &lt;code&gt;int&lt;/code&gt; &lt;code&gt;Series&lt;/code&gt;, but the underlying objects are much bigger. You can see that our &lt;code&gt;Series&lt;/code&gt; that is made of strings objects is over 600MiB, and using &lt;code&gt;%memit&lt;/code&gt; we can see that an increment when we invoke the function. This tool will help you narrow down which functions allocate the most memory and should be investigated further with line-by-line profiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further investigation
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;memory_profile&lt;/code&gt; project also has tools for investigating longer running programs and seeing how memory grows over time. Check out the &lt;code&gt;mprof&lt;/code&gt;command for that functionality. It also supports tracking memory in forked processing in a multiprocessing context. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Debugging memory issues can be a very difficult and laborious process, but having a few tools to help understand where the memory is being allocated can be very helpful in moving the debugging sessions along. When used along with other profiling tools, such as &lt;a href="https://www.wrighters.io/profiling-python-code-with-line_profiler/"&gt;&lt;code&gt;line_profiler&lt;/code&gt;&lt;/a&gt; or &lt;a href="https://www.wrighters.io/profiling-python-code-with-py-spy/"&gt;&lt;code&gt;py-spy&lt;/code&gt;&lt;/a&gt;, you can get a much better idea of where your code needs improvement.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/profiling-python-code-with-memory_profiler/"&gt;Profiling Python code with memory_profiler&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>profiling</category>
      <category>python</category>
    </item>
    <item>
      <title>How to view all your variables in a Jupyter notebook</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Thu, 15 Apr 2021 01:45:10 +0000</pubDate>
      <link>https://forem.com/wrighter/how-to-view-all-your-variables-in-a-jupyter-notebook-a49</link>
      <guid>https://forem.com/wrighter/how-to-view-all-your-variables-in-a-jupyter-notebook-a49</guid>
      <description>&lt;p&gt;Bring up the subject of Jupyter notebooks around Python developers and you’ll likely get a variety of opinions about them. Many developers think that using notebooks can promote some bad habits, cause confusion, and result in ugly code. A very common problem raised is the idea of hidden state in a notebook. This hidden state can show up in a few ways, but one common way is by executing notebook cells out of order. This often happens during development and exploration. It can be common to modify a call, execute it multiple times, and even delete it. Once a cell is deleted or modified and re-executed, the hidden state from that cell remains in the current session. Variables, functions, classes, and any other code will continue to exist and possibly affect code in other cells. &lt;/p&gt;

&lt;p&gt;This causes some obvious problems, first for the current session of the notebook, and second for any future invocations of the notebook. In order for a notebook to reflect reality, it should contain valid code that can be executed in order to produce consistent results. Practically, you can work towards this goal in a couple of ways.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nuke it
&lt;/h2&gt;

&lt;p&gt;If your notebook is small, and runs quickly, you can always restart your kernel and run all the code again. This mimics the more typical development of unit testing or running scripts from the command line (or in an IDE integration). If you just run a new Python instance with the saved code, no hidden state can exist and the output will be consistent. This will make sense for small notebooks where you can quickly visualize all the code and verify it on inspection. &lt;/p&gt;

&lt;p&gt;But this may not be practical for all cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  View it
&lt;/h2&gt;

&lt;p&gt;If a developer doesn’t want to continually restart their interpreter, they can also view what the current state is. Let’s walk through a few ways to do this, from the simple to more complex. Note that this code example uses Jupyter 6.15 with IPython 7.19.0 as the kernel.&lt;/p&gt;

&lt;p&gt;First, let’s make some data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np

def a_function():
    pass

class MyClass:
    def __init__ (self, name):
        self.name = name

var = "a variable"
var2 = "another variable"
x = np.ones(20)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now once a cell with the above Python code has been executed, I can inspect the state of my current session by either executing a single cell with one of the variables, in it, or using the IPython &lt;code&gt;display&lt;/code&gt; function. A cell will display the value of the last row in the cell (unless you append a &lt;code&gt;;&lt;/code&gt; at the end of the line). If using the default interpreter, &lt;code&gt;display&lt;/code&gt; is not available, but executing any variable will show you the value (based on its &lt;code&gt;__repr__&lt;/code&gt; method).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;display(a_function)
display(var2)
display(MyClass)
display(x)
var
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;function __main__.a_function()&amp;gt;
'another variable'
__main__.MyClass
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])
'a variable'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  But what if the code is gone?
&lt;/h2&gt;

&lt;p&gt;OK, this above method is obvious, we can view items that we know exist. But how do we find objects that we don’t know exist? Maybe we deleted the cell that created the values, or if we’re using an IPython command line, our history is not visible anymore for that code. Or maybe we edited the cell a few times and re-executed it, and changed some variable names.&lt;/p&gt;

&lt;p&gt;One function to consider is the &lt;a href="https://docs.python.org/3/library/functions.html#dir"&gt;&lt;code&gt;dir&lt;/code&gt;&lt;/a&gt; builtin. When you invoke this function with no arguments, it will return a list of all the variable names in the local scope. If you supply a module or class, it will list the attributes of the module or the class (and its subclasses).&lt;/p&gt;

&lt;p&gt;When we do this, we can see that our variables are all present. Note this is available in standard Python, not just IPython.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dir()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['In',
 'MyClass',
 'Out',
 '_',
 '_2',
 '__',
 '___',
 ' __builtin__',
 ' __builtins__',
 ' __doc__',
 ' __loader__',
 ' __name__',
 ' __package__',
 ' __spec__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_i3',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'a_function',
 'exit',
 'get_ipython',
 'np',
 'quit',
 'var',
 'var2',
 'x']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Woah, there’s also a lot of other stuff in there. Most of the variables are added by IPython and relate to command history, so if you run this sample with the default interpreter, there won’t be quite as many variables present. Also, some functions load up at startup (and you can configure IPython to load others as well). Other objects exist because Python places them in the global scope.&lt;/p&gt;

&lt;p&gt;Note that the special variable &lt;code&gt;_&lt;/code&gt; is the value of the last executed cell (or line).&lt;/p&gt;

&lt;h2&gt;
  
  
  Using &lt;code&gt;globals&lt;/code&gt; and &lt;code&gt;locals&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;There are two other functions that are helpful: &lt;a href="https://docs.python.org/3/library/functions.html#locals"&gt;&lt;code&gt;locals&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://docs.python.org/3/library/functions.html#globals"&gt;&lt;code&gt;globals&lt;/code&gt;&lt;/a&gt;. These will return the symbol table, a dictionary keyed by the variable names and containing the values. For &lt;code&gt;globals&lt;/code&gt; this is the values for the current module (when invoked in a function or method, the module is the one where the function was defined, not where it was executed). &lt;code&gt;locals&lt;/code&gt; is the same as &lt;code&gt;globals&lt;/code&gt; when invoked at the module level, but free variables are returned when invoked in function blocks.&lt;/p&gt;

&lt;p&gt;Note, don’t modify these tables, it will impact the running interpreter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;locals() # get the full dictionary
globals()['var'] # grab out a single value
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'a variable'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Can I see something a little nicer?
&lt;/h2&gt;

&lt;p&gt;Working with a big dictionary that has some extra values added by IPython might not be the easiest way to inspect your variables. You could build a function to beautify the symbol table, but luckily there’s already some nice magics for this. (Magics are special functions in IPython, look &lt;a href="https://www.wrighters.io/using-autoreload-to-speed-up-ipython-and-jupyter-work/"&gt;here&lt;/a&gt; for a quick intro to magics, and specifically the &lt;code&gt;autoreload&lt;/code&gt; magic.)&lt;/p&gt;

&lt;p&gt;Jupyter/IPython provide three helpful magics for inspecting variables. First, there is &lt;code&gt;%who&lt;/code&gt;. With no arguments it prints all the interactive variables with minimal formatting. You can supply types to only show variables matching the type given.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%who
MyClass a_function np var var2 x

# just functions
%who function
a_function
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;%who_ls&lt;/code&gt; magic does the same thing, but returns the variables as a list. It can also limit what you see by type.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%who_ls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['MyClass', 'a_function', 'np', 'var', 'var2', 'x']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%who_ls str function
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['a_function', 'var', 'var2']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last magic is &lt;code&gt;%whos&lt;/code&gt;, it provides a nice formatted table that will show you the variable, type, and a string representation. It includes helpful information about Numpy and pandas data structures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%whos
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Variable Type Data/Info
---------------------------------------
MyClass type &amp;lt;class ' __main__.MyClass'&amp;gt;
a_function function &amp;lt;function a_function at 0x10ca51e50&amp;gt;
np module &amp;lt;module 'numpy' from '/Us&amp;lt;...&amp;gt;kages/numpy/ __init__.py'&amp;gt;
var str a variable
var2 str another variable
x ndarray 20: 20 elems, type `float64`, 160 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fancy output
&lt;/h2&gt;

&lt;p&gt;Now if you want to get fancy, Jupyter has an extension available through &lt;a href="https://github.com/ipython-contrib/jupyter_contrib_nbextensions/"&gt;nbextensions&lt;/a&gt;. The Variable Inspector extension will give you a nice option for viewing variables in an output similar to the &lt;code&gt;%whos&lt;/code&gt; output above. For developers used to an IDE with an automatically updating variable inspector, this extension may prove useful and worth checking out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Removing variables
&lt;/h2&gt;

&lt;p&gt;After looking at the variables defined in your local scope, you may want to remove some of them. For example, if you deleted a cell and want the objects created by that cell to be removed, just &lt;code&gt;del&lt;/code&gt; them. Verify they are gone with any of the methods above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;del var2
%whos
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Variable Type Data/Info
---------------------------------------
MyClass type &amp;lt;class ' __main__.MyClass'&amp;gt;
a_function function &amp;lt;function a_function at 0x10ca51e50&amp;gt;
np module &amp;lt;module 'numpy' from '/Us&amp;lt;...&amp;gt;kages/numpy/ __init__.py'&amp;gt;
var str a variable
x ndarray 20: 20 elems, type `float64`, 160 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Now you know of a few tools that you can use to look for variables in your current Python session. Use them to better understand the code you’ve already executed and maybe save yourself a little bit of time.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/how-to-view-all-your-variables-in-a-jupyter-notebook/"&gt;How to view all your variables in a Jupyter notebook&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Using autoreload to speed up IPython and Jupyter work</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Mon, 05 Apr 2021 23:59:20 +0000</pubDate>
      <link>https://forem.com/wrighter/using-autoreload-to-speed-up-ipython-and-jupyter-work-38lf</link>
      <guid>https://forem.com/wrighter/using-autoreload-to-speed-up-ipython-and-jupyter-work-38lf</guid>
      <description>&lt;p&gt;I try to do all of my interactive Python development with either Jupyter notebooks or an IPython session. One of the main reasons I like these environments is the &lt;a href="https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html"&gt;&lt;code&gt;%autoreload&lt;/code&gt;&lt;/a&gt; magic. What’s so special about &lt;code&gt;%autoreload&lt;/code&gt; and why does it often make development faster and simpler? &lt;/p&gt;

&lt;h2&gt;
  
  
  Why IPython and Jupyter?
&lt;/h2&gt;

&lt;p&gt;Before going further, if you haven’t yet used both IPython and Jupyter, check out the &lt;a href="https://ipython.readthedocs.io/en/stable/interactive/tutorial.html"&gt;ipython interactive tutorial&lt;/a&gt; first. It does a good job of explaining why using IPython is superior to the default Python interpreter. It has a host of useful features, but in this article I will only be talking about one feature (magics) and specifically one of those magics (&lt;code&gt;%autoreload&lt;/code&gt;). &lt;a href="https://jupyter-notebook.readthedocs.io/en/stable/"&gt;Jupyter notebooks&lt;/a&gt;, like IPython, support most of the same magics, so much of the tutorial will work in either an interactive IPython session or a Jupyter notebook session. One thing to note is that I’m talking about Python here, not other languages running in a Jupyter notebook.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a magic?
&lt;/h2&gt;

&lt;p&gt;Magics are just special functions that you can call in your IPython or Jupyter session. They come in two forms: line and cell. A line magic is prefixed with one &lt;code&gt;%&lt;/code&gt;, a cell magic is prefixed with two, &lt;code&gt;%%&lt;/code&gt;. A line magic consumes one line, whereas a cell magic consumes the lines below the magic, allowing for more input. For this article, we’ll look at just one of the line magics, the &lt;code&gt;%autoreload&lt;/code&gt; magic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why autoreload?
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;%autoreload&lt;/code&gt; magic changes the Python session so that modules are automatically reloaded in that session before entering the execution of code typed at the IPython prompt (or the Jupyter notebook cell). What this means is that modules loaded into your session can be modified (outside your session), and the changes will be detected and reloaded without you having to restart your session.&lt;/p&gt;

&lt;p&gt;This can be tremendously useful. Let me describe a typical scenario. Let’s say you have a Jupyter notebook that you’ve created and are enhancing, and you require data from several sources. You get the data by executing functions in modules you import at the beginning of your session, and those modules are Python code that you control. This will be a very typical use case for many users. Futhermore, let’s say in your notebook you load all the data into memory and this takes a full 5 minutes. You then start to work with the data and soon realize that you need slightly different data from one of the functions in one of the modules you control, so you need to add another parameter to query data differently. How do you&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make this change&lt;/li&gt;
&lt;li&gt;Test this change&lt;/li&gt;
&lt;li&gt;Continue your work&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In most cases you will open the underlying code in your editor or IDE, modify it, test it in another session (or with unit tests), then optionally install changes locally. But what about the notebook that already has some of the data already loaded? One way to continue your work is to restart your Jupyter kernel to pick up the changes you just made, reload all data into memory (taking 5 minutes at least), and then continue your work.&lt;/p&gt;

&lt;p&gt;But there’s a better way, using &lt;code&gt;autoreload&lt;/code&gt;. In your Jupyter session, you first load the &lt;code&gt;autoreload&lt;/code&gt; extension, using the &lt;code&gt;%load_ext&lt;/code&gt; magic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%load_ext autoreload
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the &lt;code&gt;%autoreload&lt;/code&gt; magic is available in your session. It can take a single argument that specifies how &lt;code&gt;autoreload&lt;/code&gt;ing of modules will behave. The extension also provides another magic, &lt;code&gt;%aimport&lt;/code&gt;, which allows for fine-grained control of which modules are affected by the autoreload. If no arguments are given to &lt;code&gt;%autoreload&lt;/code&gt;, then it will reload all modules immediately (except those excluded by &lt;code&gt;%aimport&lt;/code&gt; as seen below). You can run it once and then use your updated code.&lt;/p&gt;

&lt;p&gt;The optional argument for &lt;code&gt;autoreload&lt;/code&gt; has three valid values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0 – disable automatic reloading&lt;/li&gt;
&lt;li&gt;1 – reload all the modules imported by &lt;code&gt;%aimport&lt;/code&gt; every time before executing Python code that has been typed&lt;/li&gt;
&lt;li&gt;2 – reload all modules (except those excluded by &lt;code&gt;%aimport&lt;/code&gt;) every time before executing Python code that has been typed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To regulate the modules affected by &lt;code&gt;autoreload&lt;/code&gt;, use the &lt;code&gt;%aimport&lt;/code&gt; magic. It works as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no arguments – lists the modules that will be imported or not imported&lt;/li&gt;
&lt;li&gt;with one argument – the module provided will be imported with &lt;code&gt;%autoreload 1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;with comma separated arguments – all modules in list will be imported with &lt;code&gt;%autoreload 1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;with a &lt;code&gt;-&lt;/code&gt; before argument – that module will &lt;em&gt;not&lt;/em&gt; be autoreloaded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, the most common way I use &lt;code&gt;%autoreload&lt;/code&gt; is to just include everything during my initial development work when I’m likely to be changing Python modules and notebook code (i.e. to run &lt;code&gt;%autoreload 2&lt;/code&gt;), and to not use it at all otherwise. But having the control can be useful, especially if you are loading a lot of modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;For a concrete example that you can use to follow along, make two Python files, &lt;code&gt;auto.py&lt;/code&gt; and &lt;code&gt;auto2.py&lt;/code&gt;, and save them alongside a Jupyter notebook with the imports below. Each of the Python files should have a simple function in them, as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# in auto.py
def my_api(model, year):
    # dummy result
    return { 'model': model, 'year': year, }

# in auto2.py
def my_api2(model, year):
    # dummy result
    return { 'model': model, 'year': year, }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let’s import both modules and inspect the API methods using the IPython/Jupyter help by appending a &lt;code&gt;?&lt;/code&gt; to the function. You should see that imported module matches your code in the Python file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import auto
import auto2

auto.my_api?

Signature: auto.my_api(model, year)
Docstring: &amp;lt;no docstring&amp;gt;
File: ~/projects/python_blogposts/tools/auto.py
Type: function
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, in a separate editor, add a third argument (maybe have it take a third &lt;code&gt;color&lt;/code&gt; argument) to the &lt;code&gt;auto.my_api&lt;/code&gt; function. Save the file. Do we see it? Refresh the help cell to see.&lt;/p&gt;

&lt;p&gt;No, not yet. Let’s turn on autoreload.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%autoreload 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, when I inspect &lt;code&gt;auto.my_api&lt;/code&gt;, I see the new argument. It worked!&lt;/p&gt;

&lt;p&gt;Now I can modify settings so that only the &lt;code&gt;auto2&lt;/code&gt; module is reloaded, not &lt;code&gt;auto&lt;/code&gt;. But first, let’s see the modules to reload and skip. By default, it includes all modules and skips none (because I used &lt;code&gt;2&lt;/code&gt; as the initial argument).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%aimport
Modules to reload:

Modules to skip:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s turn off &lt;code&gt;auto&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%aimport -auto
%aimport
Modules to reload:

Modules to skip:
auto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, if I modify the code in &lt;code&gt;auto&lt;/code&gt;, I shouldn’t see the changes in this session. Using &lt;code&gt;%aimport&lt;/code&gt; you can restrict which code is being reloaded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;p&gt;It’s important to note that module reloading is not perfect. You should not leave this on for production code, it will slow things down. Also, if you are live editing your code and leave it in a broken state, the most recent successfully loaded code will be the code running in your session, so it can make things confusing for you. This is probably not the way you want to modify large amounts of code, but when making incremental changes, it can work well.&lt;/p&gt;

&lt;p&gt;To observe what broken code will look like, open the module that is being autoreloaded (&lt;code&gt;auto2.py&lt;/code&gt;) and add a syntax error (for example, maybe put in mismatched parens somewhere) and save the file, then execute the function from that module in a notebook cell. You should see &lt;code&gt;autoreload&lt;/code&gt; report a traceback of the syntax error in the cell. You’ll only see this error once, if you re-execute the cell it won’t show you the same error, but will use the version of the code last loaded.&lt;/p&gt;

&lt;p&gt;Also, note that there are a few things that don’t work all the time, like removing functions from a module, changing a @property in a class to an ordinary method, or reloading C extensions. In those cases, you’ll need to restart your session. You can see more details in the &lt;a href="https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html"&gt;docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;If you’ve never used &lt;code&gt;%autoreload&lt;/code&gt; before, give it a try next time you have an IPython or Jupyter session with a lot of data in it and want to make a small change to a local module. Hopefully it will save you some time.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/using-autoreload-to-speed-up-ipython-and-jupyter-work/"&gt;Using autoreload to speed up IPython and Jupyter work&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Unit testing Python code in Jupyter notebooks</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Tue, 23 Mar 2021 03:01:33 +0000</pubDate>
      <link>https://forem.com/wrighter/unit-testing-python-code-in-jupyter-notebooks-32l6</link>
      <guid>https://forem.com/wrighter/unit-testing-python-code-in-jupyter-notebooks-32l6</guid>
      <description>&lt;p&gt;Most of us agree that we should write unit tests, and many of us actually do. This should be especially true for production code, library code, or if you ascribe to test driven development, during the entire development process.&lt;/p&gt;

&lt;p&gt;Often Jupyter notebooks with Python are used for data exploration, and so users may not choose (or need) to write unit tests for their notebook code since they typically may be looking at results for each cell as they progress through the notebook, then coming to a conclusion, and moving on. However, in my experience what typically happens with notebooks is soon the code in the notebook moves beyond data exploration and is useful for further work. Or, perhaps the notebook itself produces results that are useful and need to be run on a regular basis. Perhaps the code needs to be maintained and integrated with external data sources. Then it becomes important to ensure that the code in the notebook can be tested and verified. &lt;/p&gt;

&lt;p&gt;In this case, what are our options for unit testing notebook code? In this article I’ll cover several options for unit testing Python code in a Jupyter notebook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Maybe just don’t do it?
&lt;/h2&gt;

&lt;p&gt;The first option of Jupyter notebook unit testing is to just not do it at all. By this, I don’t mean don’t unit test your code, but rather &lt;em&gt;extract&lt;/em&gt; it from the notebook into separate Python modules that you import back into your notebook. That code should be tested the way you usually unit test your code, whether that be with &lt;code&gt;unittest&lt;/code&gt;, &lt;code&gt;pytest&lt;/code&gt;, &lt;code&gt;doctest&lt;/code&gt;, or another unit testing framework. This article won’t cover all those frameworks in detail, but a great choice for python developers is to not test inside their Jupyter notebooks, but to use the rich assortment of testing frameworks already available for Python code, and to move code to external modules as soon as possible in the development process.&lt;/p&gt;

&lt;h2&gt;
  
  
  OK, so you can test in a notebook
&lt;/h2&gt;

&lt;p&gt;If you end up deciding you want to leave your code inside a Jupyter notebook, there actually are some unit testing options. Before reviewing a few of them, let’s just setup a code example that we might encounter in a Jupyter notebook. Let’s say your notebook pulls some data from an API, calculates some results from it, then produces some graphs and other data summaries that it persists elsewhere. Maybe there’s a function that produces the proper API URL, and we want to unit test that function. This function has some logic that changes the URL format based on the date for the report. Here’s a debugged version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import datetime
import dateutil

def make_url(date):
    """Return the url for our API call based on date."""

    if isinstance(date, str):
        date = dateutil.parser.parse(date).date()
    elif not isinstance(date, datetime.date):
        raise ValueError("must be a date")
    if date &amp;gt;= datetime.date(2020, 1, 1):
        return f"https://api.example.com/v2/{date.year}/{date.month}/{date.day}"
    else:
        return f"https://api.example.com/v1/{date:%Y-%m-%d}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Unit testing with unittest
&lt;/h2&gt;

&lt;p&gt;Normally, when we test with &lt;a href="https://docs.python.org/3/library/unittest.html"&gt;&lt;code&gt;unittest&lt;/code&gt;&lt;/a&gt; we would either put our test methods in a separate test module, or possibly we’d mix those methods inside the main module. Then we’d need to execute the &lt;code&gt;unittest.main&lt;/code&gt; method, possibly as the default method inside a &lt;code&gt;__main__&lt;/code&gt; guard. We can basically do the same thing in our Jupyter notebook. We can make a &lt;code&gt;unitest.TestCase&lt;/code&gt; class, perform the tests we want, and then just execute the unit tests in any cell. The results of the tests can even be inspected or asserted to include no failures if you want the notebook execution to fail on errors. You just need to save the output of the &lt;code&gt;unittest.main&lt;/code&gt; method and inspect it for errors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import unittest

class TestUrl(unittest.TestCase):
    def test_make_url_v2(self):
        date = datetime.date(2020, 1, 1)
        self.assertEqual(make_url(date), "https://api.example.com/v2/2020/1/1")

    def test_make_url_v1(self):
        date = datetime.date(2019, 12, 31)
        self.assertEqual(make_url(date), "https://api.example.com/v1/2019-12-31")


res = unittest.main(argv=[''], verbosity=3, exit=False)

# if we want our notebook to stop processing due to failures, we need a cell itself to fail
assert len(res.result.failures) == 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test_make_url_v1 ( __main__.TestUrl) ... ok
test_make_url_v2 ( __main__.TestUrl) ... ok

---------------------------------------------------------------------------
Ran 2 tests in 0.001s

OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This turns out to be fairly straightforward, and if you don’t mind comingling code and tests in your notebook, it works fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unit testing with doctest
&lt;/h2&gt;

&lt;p&gt;Another way to include tests in your code is to use &lt;a href="https://docs.python.org/3/library/doctest.html#module-doctest"&gt;doctest&lt;/a&gt;. Doctest uses specially formatted code documentation that includes our tests and the expected results. Below is an updated method with this special code documentation included, both for positive and negative test cases. This is a simple way to test and document code in one place, and often will be used in python modules where the main guard will just run the doct test, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if __name__ == __main__ :
    doctest.testmod()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we’re in a notebook, we will just add this to a cell below where our code is defined, and it will also work. First, here’s our updated &lt;code&gt;make_url&lt;/code&gt;method with the doctest comments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def make_url(date):
    """Return the url for our API call based on date.
    &amp;gt;&amp;gt;&amp;gt; make_url("1/1/2020")
    'https://api.example.com/v2/2020/1/1'

    &amp;gt;&amp;gt;&amp;gt; make_url("1-1-x1")
    Traceback (most recent call last):
        ...
    dateutil.parser._parser.ParserError: Unknown string format: 1-1-x1

    &amp;gt;&amp;gt;&amp;gt; make_url("1/1/20001")
    Traceback (most recent call last):
        ...
    dateutil.parser._parser.ParserError: year 20001 is out of range: 1/1/20001

    &amp;gt;&amp;gt;&amp;gt; make_url(datetime.date(2020,1,1))
    'https://api.example.com/v2/2020/1/1'

    &amp;gt;&amp;gt;&amp;gt; make_url(datetime.date(2019,12,31))
    'https://api.example.com/v1/2019-12-31'
    """
    if isinstance(date, str):
        date = dateutil.parser.parse(date).date()
    elif not isinstance(date, datetime.date):
        raise ValueError("must be a date")
    if date &amp;gt;= datetime.date(2020, 1, 1):
        return f"https://api.example.com/v2/{date.year}/{date.month}/{date.day}"
    else:
        return f"https://api.example.com/v1/{date:%Y-%m-%d}"

import doctest
doctest.testmod()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TestResults(failed=0, attempted=5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Unit testing with testbook
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/nteract/testbook"&gt;testbook&lt;/a&gt; project is a different take on notebook unit testing. It allows you to refer to your notebooks in pure Python code from outside a notebook. This allows you to use any testing framework you like (for example, &lt;code&gt;pytest&lt;/code&gt;, or &lt;code&gt;unittest&lt;/code&gt;) in separate Python modules. You may have a situation where allowing users to modify and update notebook code is the best way to keep code updated and to allow for flexibility for end users. But you may prefer that the code still be tested and verified separately. Testbook makes this an option.&lt;/p&gt;

&lt;p&gt;First, you have to install it in your environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install testbook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or in your notebook&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%pip install testbook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, in a separate python file, you can import your notebook code and test it there. In that file, you’ll create code that looks like the following, and then you’ll use whichever unit testing framework you prefer to actually execute the unit test. You might create the following code in a Python file (say &lt;code&gt;jupyter_unit_tests.py&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import datetime
import testbook

@testbook.testbook('./jupyter_unit_tests.ipynb', execute=True)
def test_make_url(tb):
    func = tb.ref("make_url")
    date = datetime.date(2020, 1, 2)
    assert make_url(date) == "https://api.example.com/v2/2020/1/1"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, you can now run the tests with any unit testing framework. For example, with pytest, you would just run the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pytest jupyter_unit_tests.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works as a normal unit test, and the tests should pass. However, in developing this article, I realized that the &lt;code&gt;testbook&lt;/code&gt; code has limited support for passing arguments in the unit test back into the notebook kernel for testing. These arguments are JSON serialized, and the current code knows how to handle a wide array of Python types. But it doesn’t pass a datetime as an object, for example, but as a string. Since our code makes an attempt to parse strings into dates (after I modified it), it works. In other words, the unit test above is not passing in a &lt;code&gt;datetime.date&lt;/code&gt; to the &lt;code&gt;make_url&lt;/code&gt; method, but rather a string (&lt;code&gt;2020-01-02&lt;/code&gt;) that is then parsed into a date. How could you pass in a date from the unit test into the notebook code? You have several options. First, you can make a date object in your notebook just for testing purposes and then refer to that in your unit tests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;testdate1 = datetime.date(2020,1,1) # for unit test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, you could write your unit test to use that variable in the test.&lt;/p&gt;

&lt;p&gt;A second option is to inject Python code into the notebook, then refer to it back in your unit test. Both options are shown in the final version of the external unit test. Just save that over &lt;code&gt;jupyter_unit_tests.py&lt;/code&gt; and run it using your favorite unit testing framework.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import datetime

import testbook

@testbook.testbook('./jupyter_unit_tests.ipynb', execute=True)
def test_make_url(tb):
    f = tb.ref("make_url")
    d = "2020-01-02"
    assert f(d) == "https://api.example.com/v2/2020/1/2"

    # note that this is actually converted to a string
    d = datetime.date(2020, 1, 2)
    assert f(d) == "https://api.example.com/v2/2020/1/2"

    # this one will be testing the date functionality
    d2 = tb.ref("testdate1")
    assert f(d2) == "https://api.example.com/v2/2020/1/1"

    # this one will inject similar code as above, then use it
    tb.inject("d3 = datetime.date(2020, 2, 3)")
    d3 = tb.ref("d3")
    assert f(d3) == "https://api.example.com/v2/2020/2/3"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;So whether you are a unit testing purist or you just want to sprinkle a few unit tests into your notebooks, there are several options for you to consider. Don’t let your use of notebooks prevent you from doing the right thing in terms of testing your code.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/unit-testing-python-code-in-jupyter-notebooks/"&gt;Unit testing Python code in Jupyter notebooks&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Profiling Python code with py-spy</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Mon, 15 Mar 2021 22:16:27 +0000</pubDate>
      <link>https://forem.com/wrighter/profiling-python-code-with-py-spy-17oj</link>
      <guid>https://forem.com/wrighter/profiling-python-code-with-py-spy-17oj</guid>
      <description>&lt;p&gt;If you have a Python program that is currently running you may want to understand what the real-world performance profile of the code is. This program could be in a production environment or just on your local machine. You will want to understand where the running program spends its time and if any “hot spots” exist that should be investigated further for improvement. You may be dealing with a production system that is misbehaving and you may want to profile it in an unobtrusive way that doesn’t further impact production performance or require code modifications. What’s a good way to do this? This article will talk about &lt;a href="https://github.com/benfred/py-spy"&gt;py-spy&lt;/a&gt;, a tool that allows you to profile Python programs that are already running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic vs. Sampling profilers
&lt;/h2&gt;

&lt;p&gt;In earlier articles, I wrote about two deterministic profilers, &lt;a href="https://www.wrighters.io/profiling-python-with-cprofile-and-a-speedup-tip/"&gt;cProfile&lt;/a&gt; and &lt;a href="https://www.wrighters.io/profiling-python-code-with-line_profiler/"&gt;line_profiler&lt;/a&gt;. These profilers are useful when you are developing code and want to profile either sections of code or an entire process. Since they are deterministic, they will tell you exactly how many times a function (or in the case of &lt;code&gt;line_profiler&lt;/code&gt;, a line) is executed and how much time it &lt;em&gt;relatively&lt;/em&gt; takes to execute compared to the rest of your code. Because these profilers run within the observed process, they slow it down somewhat because they have to do bookkeeping and calculating in the midst of the program execution. For production code, modifying the code or restarting it with a profiler enabled is often not an option.&lt;/p&gt;

&lt;p&gt;This is where sampling profilers can be helpful. A sampling profiler looks at an existing process and uses various tricks to determine what the running process is doing. You can manually try this yourself. For example, on linux you can use the &lt;code&gt;pstack &amp;lt;pid&amp;gt;&lt;/code&gt; (or &lt;code&gt;gstack &amp;lt;pid&amp;gt;&lt;/code&gt;) command to see what your process is doing. On a Mac, you can execute &lt;code&gt;echo "thread backtrace all" | lldb -p &amp;lt;pid&amp;gt;&lt;/code&gt; to see something similar. The output will be the stack of all the threads in your process. This works for any process, not just Python programs. For your Python programs, you’ll see the underlying C functions, not your Python functions. In some cases, checking the stack a few times this way may tell you if your process is stuck or where it is slow, provided you can do the translation to your own code. But doing this provides only a single sample in time. Since the process is continually executing, your sample may change each time you run the command (unless it’s blocked or you just happened to be very lucky).&lt;/p&gt;

&lt;p&gt;A sampling profiler and surrounding tools take multiple snapshots of the system over time and then provide you with the ability to look over this data and understand where your code is slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  py-spy
&lt;/h2&gt;

&lt;p&gt;Py-spy uses system calls (&lt;code&gt;process_vm_readv&lt;/code&gt; on Linux, &lt;code&gt;vm_read&lt;/code&gt; on OSX, &lt;code&gt;ReadProcessMemory&lt;/code&gt; on Windows) to obtain the call stack, then translates that information into the Python function calls that you see in your source code. It samples multiple times per second so it has a good chance of seeing your program in the various states that it will be in over time. It is written in Rust for speed.&lt;/p&gt;

&lt;p&gt;Getting py-spy into your project is very simple, it’s installable via &lt;code&gt;pip&lt;/code&gt;. To show you how to use it, I’ve created some sample code to profile and observe how py-spy can tell us about a running Python process. If you want to follow along, you can easily reproduce these steps.&lt;/p&gt;

&lt;p&gt;First, I setup a new virtual environment using &lt;a href="https://www.wrighters.io/you-can-easily-and-sensibly-run-multiple-versions-of-python-with-pyenv/"&gt;py-env&lt;/a&gt; and the &lt;a href="https://www.wrighters.io/use-pyenv-and-virtual-environments-to-manage-python-complexity/"&gt;pyenv-virtualenv plugin&lt;/a&gt; for this project. You can do this or setup a virtual environment using your preferred tool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# whichever Python version you prefer
pyenv install 3.8.7             
# make our virtualenv (with above version)
pyenv virtualenv 3.8.7 py-spy   
# activate it
pyenv activate py-spy           
# install py-spy
pip install py-spy              
# make sure we pick up the commands in our path
pyenv rehash                    
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s all there is to it, we now have the tools available. If you run py-spy, you can see the common usage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ py-spy
py-spy 0.3.4
Sampling profiler for Python programs

USAGE:
    py-spy &amp;lt;SUBCOMMAND&amp;gt;

OPTIONS:
    -h, --help Prints help information
    -V, --version Prints version information

SUBCOMMANDS:
    record Records stack trace information to a flamegraph, speedscope or raw file
    top Displays a top like view of functions consuming CPU
    dump Dumps stack traces for a target program to stdout
    help Prints this message or the help of the given subcommand(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  An example
&lt;/h2&gt;

&lt;p&gt;In order to demonstrate py-spy, I’ve written a simple long-running process what will consume streaming prices from a cryptocurrency exchange and generate a record every minute (this is also known as a bar). The bar contains various information from the past minute. The bar includes the high, low, and last price, the volume, and the Volume Weighted Average Price (VWAP). Right now, the code only logs these values, but could be extended to update a database. While it’s simple, it is a useful example to use since cryptocurrencies trade around the clock and will give us real world data to work with.&lt;/p&gt;

&lt;p&gt;I’m using a &lt;a href="https://docs.pro.coinbase.com"&gt;Coinbase Pro API&lt;/a&gt; for &lt;a href="https://github.com/danpaquin/coinbasepro-python"&gt;Python&lt;/a&gt; to access data from the WebSocket feed. Here’s a first cut that has some debugging code left in place (along with two ways to generate the VWAP, one inefficient (the &lt;code&gt;_vwap&lt;/code&gt; method) and one more efficient). Let’s see if py-spy reveals how much time this code uses.&lt;/p&gt;

&lt;p&gt;This code will end up generating a thread for the WebSocket client. The asyncio loop will set a timer for the next minute boundary to tell the client to log the bar data. It will run until you kill it (with Ctrl-C, for example).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env python

import argparse
import functools
import datetime
import asyncio
import logging

import arrow
import cbpro

class BarClient(cbpro.WebsocketClient):
    def __init__ (self, **kwargs):
        super(). __init__ (**kwargs)
        self._bar_volume = 0
        self._weighted_price = 0.0
        self._ticks = 0
        self._bar_high = None
        self._bar_low = None
        self.last_msg = {}

        self._pxs = []
        self._volumes = []

    def next_minute_delay(self):
        delay = (arrow.now().shift(minutes=1).floor('minutes') - arrow.now())
        return (delay.seconds + delay.microseconds/1e6)

    def _vwap(self):
        if len(self._pxs):
            wp = sum([x*y for x,y in zip(self._pxs, self._volumes)])
            v = sum(self._volumes)

            return wp/v

    def on_message(self, msg):
        if 'last_size' in msg and 'price' in msg:
            last_size = float(msg['last_size'])
            price = float(msg['price'])
            self._bar_volume += last_size
            self._weighted_price += last_size * price
            self._ticks += 1
            if self._bar_high is None or price &amp;gt; self._bar_high:
                self._bar_high = price
            if self._bar_low is None or price &amp;lt; self._bar_low:
                self._bar_low = price
            self._pxs.append(price)
            self._volumes.append(last_size)
            logging.debug("VWAP: %s", self._vwap())
        self.last_msg = msg
        logging.debug("Message: %s", msg)

    def on_bar(self, loop):
        if self.last_msg is not None:
            if self._bar_volume == 0:
                self.last_msg['vwap'] = None
            else:
                self.last_msg['vwap'] = self._weighted_price/self._bar_volume
            self.last_msg['bar_bar_volume'] = self._bar_volume
            self.last_msg['bar_ticks'] = self._ticks
            self.last_msg['bar_high'] = self._bar_high
            self.last_msg['bar_low'] = self._bar_low
            last = self.last_msg.get('price')
            if last:
                last = float(last)
            self._bar_high = last
            self._bar_low = last
            logging.info("Bar: %s", self.last_msg)
        self._bar_volume = 0
        self._weighted_price = 0.0
        self._ticks = 0
        self._pxs.clear()
        self._volumes.clear()
        // reschedule
        loop.call_at(loop.time() + self.next_minute_delay(),
                     functools.partial(self.on_bar, loop))

def main():
    argparser = argparse.ArgumentParser()
    argparser.add_argument("--product", default="BTC-USD",
                           help="coinbase product")
    argparser.add_argument('-d', "--debug", action='store_true',
                           help="debug logging")
    args = argparser.parse_args()

    cfg = {"format": "%(asctime)s - %(levelname)s - %(message)s"}
    if args.debug:
        cfg["level"] = logging.DEBUG
    else:
        cfg["level"] = logging.INFO

    logging.basicConfig(**cfg)

    client = BarClient(url="wss://ws-feed.pro.coinbase.com",
                       products=args.product,
                       channels=["ticker"])

    loop = asyncio.get_event_loop()
    loop.call_at(loop.time() + client.next_minute_delay(), functools.partial(client.on_bar, loop))
    loop.call_soon(client.start)

    try:
        loop.run_forever()
    finally:
        loop.close()

if __name__ == ' __main__':
    main()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running the example
&lt;/h2&gt;

&lt;p&gt;To run this code, you’ll need to install a few extra modules. The cbpro module is a simple Python wrapper of the Coinbase APIs. &lt;a href="https://arrow.readthedocs.io/en/stable/"&gt;Arrow&lt;/a&gt; is a nice library for doing datetime logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install arrow cbpro
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, you can run the code with debug logging and hopefully see some messages, depending on how busy the exchange is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ./coinbase_client.py -d
2021-03-14 17:20:12,828 - DEBUG - Using selector: KqueueSelector
-- Subscribed! --

2021-03-14 17:20:13,059 - DEBUG - Message: {'type': 'subscriptions', 'channels': [{'name': 'ticker', 'product_ids': ['BTC-USD']}]}
2021-03-14 17:20:13,060 - DEBUG - VWAP: 60132.57
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Profiling the example
&lt;/h2&gt;

&lt;p&gt;Now, let’s review the py-spy commands. First, using the dump command will give us a quick view of the stack, translated to Python functions. &lt;/p&gt;

&lt;p&gt;A quick side note here, if you’re using a Mac you will need to run py-spy as sudo. On Linux, it depends on your security settings. Also, since I was using pyenv I needed to pass on my environment to sudo using the &lt;code&gt;-E&lt;/code&gt; flag so it picks up the right Python interpreter and the py-spy script in the path. I obtained the process id for my process using the &lt;code&gt;ps&lt;/code&gt; command in my shell (in my case it was 97520). &lt;/p&gt;

&lt;h3&gt;
  
  
  py-spy dump
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; sudo -E py-spy dump -p 97520
Process 97520: /Users/mcw/.pyenv/versions/py-spy/bin/python ./coinbase_client.py -d
Python v3.8.7 (/Users/mcw/.pyenv/versions/3.8.7/bin/python3.8)

Thread 0x113206DC0 (idle): "MainThread"
    select (selectors.py:558)
    _run_once (asyncio/base_events.py:1823)
    run_forever (asyncio/base_events.py:570)
    main (coinbase_client.py:107)
    &amp;lt;module&amp;gt; (coinbase_client.py:113)
Thread 0x700009CAA000 (idle): "Thread-1"
    read (ssl.py:1101)
    recv (ssl.py:1226)
    recv (websocket/_socket.py:80)
    _recv (websocket/_core.py:427)
    recv_strict (websocket/_abnf.py:371)
    recv_header (websocket/_abnf.py:286)
    recv_frame (websocket/_abnf.py:336)
    recv_frame (websocket/_core.py:357)
    recv_data_frame (websocket/_core.py:323)
    recv_data (websocket/_core.py:310)
    recv (websocket/_core.py:293)
    _listen (cbpro/websocket_client.py:84)
    _go (cbpro/websocket_client.py:41)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see there’s two threads running. One is reading data, the other is in &lt;code&gt;select&lt;/code&gt; in the run loop. This is only useful for profiling if our program is stuck. One really nice feature though is if you give it the &lt;code&gt;--locals&lt;/code&gt; option, it will show you any local variables, which can be really helpful for debugging!&lt;/p&gt;

&lt;h3&gt;
  
  
  py-spy top
&lt;/h3&gt;

&lt;p&gt;The next command to try is &lt;code&gt;top&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo -E py-spy top -p 97520
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will bring up an interface that looks very similar to the unix &lt;code&gt;top&lt;/code&gt; command. As your program runs and py-spy gathers samples, it will show you where it is spending time. Here is a screenshot of what that looked like for me after about 30 seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DqBAkzlv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/03/pyspy-top.png%3Fresize%3D656%252C518%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DqBAkzlv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/03/pyspy-top.png%3Fresize%3D656%252C518%26ssl%3D1" alt=""&gt;&lt;/a&gt;py-spy top output&lt;/p&gt;

&lt;h3&gt;
  
  
  py-spy record
&lt;/h3&gt;

&lt;p&gt;Finally, you can record data using py-spy for later analysis or output. There is a raw format, speedscope format, and a flamegraph output. You can specify the amount of time you want to collect data (in seconds), or just let it collect data until you exit the program using Ctrl-C. For example, this command will generate a useful little SVG file flamegraph that you can interact with in a web browser.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo -E py-spy record -p 97520 --output py-spy.svg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also export the data in &lt;a href="https://www.speedscope.app/"&gt;speedscope&lt;/a&gt; format and then upload it to the speedscope web tool for further analysis. This is a great tool for interactively seeing how your code executes.&lt;/p&gt;

&lt;p&gt;I’d encourage you to run this code on your own and play with both the speedscope tool and the SVG output, but here’s two screen shots that help explain how it works. This first view is the overall SVG output. If you hover over the cells, it will show you the function details. You can see that most of the time is spent in the WebSocket client &lt;code&gt;_listen&lt;/code&gt; method. But the &lt;code&gt;on_message&lt;/code&gt; method shows up to the right of that (designated by the arrow)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dyYgRaDp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/03/flamegraph1.png%3Fresize%3D656%252C265%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dyYgRaDp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i0.wp.com/www.wrighters.io/wp-content/uploads/2021/03/flamegraph1.png%3Fresize%3D656%252C265%26ssl%3D1" alt=""&gt;&lt;/a&gt;py-spy svg output&lt;/p&gt;

&lt;p&gt;If we click on it, we get a detailed view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KpmIKagg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/www.wrighters.io/wp-content/uploads/2021/03/flamegraph2.png%3Fresize%3D656%252C254%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KpmIKagg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/www.wrighters.io/wp-content/uploads/2021/03/flamegraph2.png%3Fresize%3D656%252C254%26ssl%3D1" alt=""&gt;&lt;/a&gt;py-spy svg detailed output&lt;/p&gt;

&lt;p&gt;In my case, I see that my list comprehension and logging in the unneeded &lt;code&gt;_vwap&lt;/code&gt; method show up fairly high in the profile. I can easily remove this method (and the extra prices and volumes that I was tracking) since the VWAP can be calculated with just a running product and total volume (as I’m doing already in the code). It’s also interesting to see when the script is run in debug mode how much time logging takes&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In summary, I’d encourage you to try out py-spy on some of your code. If you try to predict where your code will spend its time, how correct are you? Are there any findings that surprise you? Maybe compare the output of py-spy to a deterministic profiler like line_profiler.&lt;/p&gt;

&lt;p&gt;I hope this overview of py-spy has been helpful and that you can deploy this tool in diagnosing performance issues in your own code.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/profiling-python-code-with-py-spy/"&gt;Profiling Python code with py-spy&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>How to remove a column from a DataFrame, with some extra detail</title>
      <dc:creator>wrighter</dc:creator>
      <pubDate>Mon, 08 Mar 2021 00:14:57 +0000</pubDate>
      <link>https://forem.com/wrighter/how-to-remove-a-column-from-a-dataframe-with-some-extra-detail-374g</link>
      <guid>https://forem.com/wrighter/how-to-remove-a-column-from-a-dataframe-with-some-extra-detail-374g</guid>
      <description>&lt;p&gt;Removing one or more columns from a pandas &lt;code&gt;DataFrame&lt;/code&gt; is a pretty common task, but it turns out there are a number of possible ways to perform this task. I found that &lt;a href="https://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe"&gt;this StackOverflow question&lt;/a&gt;, along with solutions and discussion in it raised a number of interesting topics. It is worth digging in a little bit to the details.&lt;/p&gt;

&lt;p&gt;First, what’s the “correct” way to remove a column from a &lt;code&gt;DataFrame&lt;/code&gt;? The standard way to do this is to think in SQL and use &lt;code&gt;drop&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(25).reshape((5,5)),               
                  columns=list("abcde"))

display(df)

try:
    df.drop('b')
except KeyError as ke:
    print(ke)

   a  b  c  d  e
0  0  1  2  3  4
1  5  6  7  8  9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
"['b'] not found in axis"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, what? Why an error? That’s because the default axis that &lt;code&gt;drop&lt;/code&gt; works with is the rows. As with many pandas methods, there’s more than one way to invoke the method (which some people find frustrating). &lt;/p&gt;

&lt;p&gt;You can drop rows using &lt;code&gt;axis=0&lt;/code&gt; or &lt;code&gt;axis='rows'&lt;/code&gt;, or using the &lt;code&gt;labels&lt;/code&gt; argument.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.drop(0) # drop a row, on axis 0 or 'rows'
df.drop(0, axis=0) # same
df.drop(0, axis='rows') # same
df.drop(labels=0) # same
df.drop(labels=[0]) # same
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   a  b  c  d  e
1  5  6  7  8  9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Again, how do we drop a column?
&lt;/h3&gt;

&lt;p&gt;We want to drop a column, so what does that look like? You can specify the &lt;code&gt;axis&lt;/code&gt; or use the &lt;code&gt;columns&lt;/code&gt; parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.drop('b', axis=1) # drop a column
df.drop('b', axis='columns') # same
df.drop(columns='b') # same
df.drop(columns=['b']) # same
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   a  c  d  e
0  0  2  3  4
1  5  7  8  9
2 10 12 13 14
3 15 17 18 19
4 20 22 23 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There you go, that’s how you drop a column. Now you have to either assign to a new variable, or back to your old variable, or pass in &lt;code&gt;inplace=True&lt;/code&gt; to make the change permanent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df2 = df.drop('b', axis=1)

print(df2.columns)
print(df.columns)

Index(['a', 'c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s also worth noting that you can drop &lt;em&gt;both&lt;/em&gt; rows and columns at the same time using drop by using the &lt;code&gt;index&lt;/code&gt; and &lt;code&gt;columns&lt;/code&gt; arguments at once, and you can pass in multiple values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.drop(index=[0,2], columns=['b','c'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   a  d  e
1  5  8  9
3 15 18 19
4 20 23 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you didn’t have the drop method, you can basically obtain the same results through indexing. There are many ways to accomplish this, but one equivalent solution is indexing using the &lt;code&gt;.loc&lt;/code&gt; indexer and &lt;code&gt;isin&lt;/code&gt;, along with inverting the selection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.loc[~df.index.isin([0,2]), ~df.columns.isin(['b', 'c'])]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   a  d  e
1  5  8  9
3 15 18 19
4 20 23 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If none of that makes sense to you, I would suggest reading through my series on selecting and indexing in pandas, starting &lt;a href="https://www.wrighters.io/indexing-and-selecting-in-pandas-part-1/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Back to the question
&lt;/h2&gt;

&lt;p&gt;Looking back at the original question though, we see there is another available technique for removing a column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;del df['a']
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   b  c  d  e
0  1  2  3  4
1  6  7  8  9
2 11 12 13 14
3 16 17 18 19
4 21 22 23 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Poof! It’s gone. This is like doing a drop with &lt;code&gt;inplace=True&lt;/code&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  What about attribute access?
&lt;/h2&gt;

&lt;p&gt;We also know that we can use attribute access to &lt;em&gt;select&lt;/em&gt; columns of a &lt;code&gt;DataFrame&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0  1
1  6
2 11
3 16
4 21
Name: b, dtype: int64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Can we delete the column this way?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;del df.b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--------------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
&amp;lt;ipython-input-10-0dca358a6ef9&amp;gt; in &amp;lt;module&amp;gt;
---------&amp;gt; 1 del df.b

AttributeError: b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We cannot. This is not an option for removing columns with the current pandas design. Is this technically impossible? How come &lt;code&gt;del df['b']&lt;/code&gt;works but &lt;code&gt;del df.b&lt;/code&gt;  doesn’t?. Let’s dig into those details and see whether it would be possible to make the second work as well.&lt;/p&gt;

&lt;p&gt;The first version works because in pandas, the &lt;code&gt;DataFrame&lt;/code&gt; implements the &lt;code&gt;__delitem__&lt;/code&gt; method which gets invoked when you execute &lt;code&gt;del df['b']&lt;/code&gt;. But what about &lt;code&gt;del df.b&lt;/code&gt;, is there a way to handle that?&lt;/p&gt;

&lt;p&gt;First, let’s make a simple class that shows how this works under the hood. Instead of being a real &lt;code&gt;DataFrame&lt;/code&gt;, we’ll just use a &lt;code&gt;dict&lt;/code&gt; as a container for our columns (which could really contain anything, we’re not doing any indexing here).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class StupidFrame:
    def __init__ (self, columns):
        self.columns = columns

    def __delitem__ (self, item):
        del self.columns[item]

    def __getitem__ (self, item):
        return self.columns[item]

    def __setitem__ (self, item, val):
        self.columns[item] = val

f = StupidFrame({'a': 1, 'b': 2, 'c': 3})
print("StupidFrame value for a:", f['a'])
print("StupidFrame columns: ", f.columns)
del f['b']
f.d = 4
print("StupidFrame columns: ", f.columns)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;StupidFrame value for a: 1
StupidFrame columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrame columns: {'a': 1, 'c': 3}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A couple of things to note here. First, we how that we can access the data in our &lt;code&gt;StupidFrame&lt;/code&gt; with the index operators (&lt;code&gt;[]&lt;/code&gt;), and use that for setting, getting, and deleting items. When we assigned &lt;code&gt;d&lt;/code&gt; to our frame, it wasn’t added to our columns because it’s just a normal instance attribute. If we wanted to be able to handle the columns as attributes, we have to do a little bit more work.&lt;/p&gt;

&lt;p&gt;So following the example from pandas (which supports attribute access of columns), we add the &lt;code&gt;__getattr__&lt;/code&gt; method, but we also will handle setting it with the &lt;code&gt;__setattr__&lt;/code&gt; method and pretend that any attribute assignment is a ‘column’. We have to update our instance dictionary (__&lt;code&gt;dict__&lt;/code&gt;) directly to avoid an infinite recursion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class StupidFrameAttr:
    def __init__ (self, columns):
        self. __dict__ ['columns'] = columns

    def __delitem__ (self, item):
        del self. __dict__ ['columns'][item]

    def __getitem__ (self, item):
        return self. __dict__ ['columns'][item]

    def __setitem__ (self, item, val):
        self. __dict__ ['columns'][item] = val

    def __getattr__ (self, item):
        if item in self. __dict__ ['columns']:
            return self. __dict__ ['columns'][item]
        elif item == 'columns':
            return self. __dict__ [item]
        else:
            raise AttributeError

    def __setattr__ (self, item, val):
        if item != 'columns':
            self. __dict__ ['columns'][item] = val
        else:
            raise ValueError("Overwriting columns prohibited") 


f = StupidFrameAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameAttr value for a", f['a'])
print("StupidFrameAttr columns: ", f.columns)
del f['b']
print("StupidFrameAttr columns: ", f.columns)
print("StupidFrameAttr value for a", f.a)
f.d = 4
print("StupidFrameAttr columns: ", f.columns)
del f['d']
print("StupidFrameAttr columns: ", f.columns)
f.d = 5
print("StupidFrameAttr columns: ", f.columns)
del f.d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;StupidFrameAttr value for a 1
StupidFrameAttr columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrameAttr columns: {'a': 1, 'c': 3}
StupidFrameAttr value for a 1
StupidFrameAttr columns: {'a': 1, 'c': 3, 'd': 4}
StupidFrameAttr columns: {'a': 1, 'c': 3}
StupidFrameAttr columns: {'a': 1, 'c': 3, 'd': 5}
--------------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
&amp;lt;ipython-input-12-fd29f59ea01e&amp;gt; in &amp;lt;module&amp;gt;
     39 f.d = 5
     40 print("StupidFrameAttr columns: ", f.columns)
--------&amp;gt; 41 del f.d

AttributeError: d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How could we handle deletion?
&lt;/h2&gt;

&lt;p&gt;Everything works but deletion using attribute access. We handle setting/getting columns using both the array index operator (&lt;code&gt;[]&lt;/code&gt;) and attribute access. But what about detecting deletion? Is that possible?&lt;/p&gt;

&lt;p&gt;One way to do this is using the &lt;code&gt;__delattr__&lt;/code&gt; method, which is described in the &lt;a href="https://docs.python.org/3.8/reference/datamodel.html"&gt;data model&lt;/a&gt; documentation. If you define this method in your class, it will be invoked instead of updating an instance’s attribute dictionary directly. This gives us a chance to redirect this to our columns instance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class StupidFrameDelAttr(StupidFrameAttr):
    def __delattr__ (self, item):
        # trivial implementation using the data model methods
        del self. __dict__ ['columns'][item]

f = StupidFrameDelAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameDelAttr value for a", f['a'])
print("StupidFrameDelAttr columns: ", f.columns)
del f['b']
print("StupidFrameDelAttr columns: ", f.columns)
print("StupidFrameDelAttr value for a", f.a)
f.d = 4
print("StupidFrameDelAttr columns: ", f.columns)
del f.d 
print("StupidFrameDelAttr columns: ", f.columns)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrameDelAttr columns: {'a': 1, 'c': 3}
StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns: {'a': 1, 'c': 3, 'd': 4}
StupidFrameDelAttr columns: {'a': 1, 'c': 3}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I’m not suggesting that attribute deletion for columns would be easy to add to pandas, but at least this shows how it could be possible. In the case of current pandas, deleting columns is best done using &lt;code&gt;drop&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Also, it’s worth mentioning here that when you create a new column in pandas, you don’t assign it as an attribute. To better understand how to properly create a column, you can check out &lt;a href="https://www.wrighters.io/basic-pandas-how-to-add-a-column-to-a-dataframe/"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you already knew how to drop a column in pandas, hopefully you understand a little bit more about how this works.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.wrighters.io/how-to-remove-a-column-from-a-dataframe/"&gt;How to remove a column from a DataFrame, with some extra detail&lt;/a&gt; appeared first on &lt;a href="https://www.wrighters.io"&gt;wrighters.io&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>python</category>
    </item>
  </channel>
</rss>
